<a href="https://colab.research.google.com/github/Keoni808/KaggleCourses/blob/main/Intermediate_Machine_Learning/XGBoost/XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Mount your Google Drive to access files
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!cp -rT '/content/drive/MyDrive/Colab Notebooks/KaggleCourses/Intermediate Machine Learning/XGBoost/XGB_DATA' .
# !cp '/content/drive/MyDrive/Colab Notebooks/KaggleCourses/Geospatial Analysis/Coordinate Reference Systems/purple_martin.csv' .

In [3]:
# imports

# for data manipulation
import pandas as pd

# To split dataset into training / testing
from sklearn.model_selection import train_test_split

# loss function
from sklearn.metrics import mean_absolute_error

# extreme gradient boosting
# includes several additional features focused on performance and speed
# XGBRegressor - allows us to build and fit a model just as we would in scikit-learn
from xgboost import XGBRegressor

- Gradient boosting is the focus of this chapter.

- So far we have used a method called random forest method. Random forest method achieves better performance than a single decision tree does because it is the average of many decision trees.
  - This is known as an *ensemble method*
- Ensemble method: the combination of the predictions of several models.

- GRADIENT BOOSTING is a type of ensemble method

#Gradient Boosting

- Goes through cycles to iteratively add models into an ensemble?
- initializes an ensemble with a single model.
  - The model does not have to stress much on accuracy as subsequent additions to the ensemble will address those errors.


**The Cycle**
1. current ensemble is used to generate predictions for each observation in the dataset. (Predictions are made with the current ensemble)
2. Predictions are used to calculate a loss function (like mean squared error)
3. Then use the loss function to fit a new model to add to the ensemble.
  
  - we created model parameters so that adding this new model to the ensemble will reduce the loss.
4. Add new model to ensemble.
5. Repeat

# Example

In [4]:
# Read the data
data = pd.read_csv('melb_data.csv')

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

X_train, X_valid, y_train, y_valid = train_test_split(X, y)

In [5]:
my_model = XGBRegressor()
my_model.fit(X_train, y_train)

In [6]:
predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))

Mean Absolute Error: 235331.61681240794


# Parameter Tuning

**some parameters that can dramatically affect accuracy and training speed**

- n_estimators
  
  - how many times to go through the modeling cycle.
  - equal to the number of models that will be included in ensemble

    - Too low a value will cause underfitting
    - Too high a value causes overfitting

- learning_rate
  - Instead of getting predictions by adding up predictions from each model, we can multiply the predictions from each model by a number known as the *learning rate*.
    - This means that each tree we add to the ensemble helps less.
      - We can set n_estimators higher without overfitting.

- early_stopping_rounds
  - automatically finds ideal value for n_estimators.
  - Stops iterating when the validation score stops improving.
  - Need to set amount of rounds of deterioration to allow before stopping.
  - need to set aside some data for calculating the validation scores
    - eval_sel

n_jobs
- On large datasets where runtime is in consideration, you can use parallelism to build your models faster
- common to set n_jobs equal to the number of cores on your machine.



**IN GENERAL**

- a small learning rate and large number of estimators will yield more accurate XGBoost models.

In [7]:
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train)

In [8]:
# modify example above to include early stopping

# purpose of "eval_set" is to predict how the model is doing.
# It does not actually use the eval_set to train the model.

my_model = XGBRegressor(n_estimators=500, early_stopping_rounds=5)
my_model.fit(X_train, y_train,
             eval_set=[(X_valid, y_valid)],
             verbose=False)

modified to optimize models handling large datasets

In [10]:
my_model = XGBRegressor(n_estimators=1000, early_stopping_rounds=5, learning_rate=0.05, n_jobs=6)
my_model.fit(X_train, y_train,
             eval_set=[(X_valid, y_valid)],
             verbose=False)

In [11]:
preds = my_model.predict(X_valid)

In [12]:
mae_1 = mean_absolute_error(preds, y_valid)
print("Mean Absolute Error:", mae_1)

Mean Absolute Error: 238194.96362297496


XGBoost is a leading software library for working with standard tabular data (the type of data you store in Pandas DataFrames, as opposed to more exotic types of data like images and videos).

#EXERCISE EXAMPLE QUESTIONS

In [14]:
# Read the data
X = pd.read_csv('train.csv', index_col='Id')
X_test_full = pd.read_csv('test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns
                        if X_train_full[cname].nunique() < 10 and
                        X_train_full[cname].dtype == "object"]

# Select numeric columns
numeric_cols = [cname for cname in X_train_full.columns
                if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numeric_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

# One-hot encode the data (to shorten the code, we use pandas)
X_train = pd.get_dummies(X_train)
X_valid = pd.get_dummies(X_valid)
X_test = pd.get_dummies(X_test)
X_train, X_valid = X_train.align(X_valid, join='left', axis=1)
X_train, X_test = X_train.align(X_test, join='left', axis=1)

# Step 1: Build model

In [20]:
# Part A
# Build and train your model with gradient boosting

# - Begin by setting my_model_1 to an XGBoost model. Use the XGBRegressor class, and set the
#   random seed to 0 (random_state=0). Leave all other parameters as default
# - Then, fit the model to the training data in X_train and y_train.

from xgboost import XGBRegressor

# Define the model
my_model_1 = XGBRegressor(random_state=0)

# Fit the model
my_model_1.fit(X_train, y_train)

In [21]:
# Part B
# Set predictions_1 to a model's predictions for the validation data. Recall that the validation
# features are stored in X_valid

# Get predictions
predictions_1 = my_model_1.predict(X_valid)

In [22]:
# Part C
# use mean_absolute_error() corresponding to the predictions for the validation set

mae_1 = mean_absolute_error(predictions_1, y_valid)
print("Mean Absolute Error:", mae_1)

Mean Absolute Error: 18161.82412510702


#Step 2: Improve the model

In [29]:
# Begin by setting my_model_2 to an XGBoost model using XGBRegressor
# change the default parameters like
# (n_estimators and learning_rate) to get better results

my_model_2 = XGBRegressor(n_estimators=1000, early_stopping_rounds=5, learning_rate=0.05)

# fit the model to the training data in X_train and y_train
my_model_2.fit(X_train, y_train,
               eval_set=[(X_valid, y_valid)],
               verbose=False)

# Get predictions
predictions_2 = my_model_2.predict(X_valid)

# Calculate MAE
mae_2 = mean_absolute_error(predictions_2, y_valid)

print("Mean Absolute Error:",mae_2)

Mean Absolute Error: 17032.769063035103


#Step 3: Break the model

In [31]:
# Purpose is to create a model that is worse than the previous one made
# up above to get a better understanding of parameters to use.
my_model_3 = XGBRegressor(n_estimators=500, early_stopping_rounds=5, learning_rate=0.5)

# fit the model to the training data in X_train and y_train
my_model_3.fit(X_train, y_train,
               eval_set=[(X_valid, y_valid)],
               verbose=False)

# Get predictions
predictions_3 = my_model_3.predict(X_valid)

# Calculate MAE
mae_3 = mean_absolute_error(predictions_3, y_valid)

print("Mean Absolute Error:", mae_3)

Mean Absolute Error: 19681.582258668663
