<a href="https://colab.research.google.com/github/PaulToronto/DataCamp-Track---Machine-Learning-Scientist-in-Python/blob/main/8_2_Extreme_Gradient_Boosting_with_XGBoost_Regression_with_XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 8-2 Extreme Gradient Boosting with XGBoost - Regression with XGBoost

## Imports

In [1]:
import numpy as np
import pandas as pd

from sklearn.metrics import root_mean_squared_error, mean_squared_error
from sklearn.metrics import mean_absolute_error
import xgboost as xgb
from sklearn.model_selection import train_test_split

## Data

In [2]:
base_url = 'https://drive.google.com/uc?id='

### Ames Housing

In [3]:
id = '1SOsLBYrLdV5YHHnZB5TYpA0ioTwWWCX5'
ames = pd.read_csv(base_url + id)
ames.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 57 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   MSSubClass            1460 non-null   int64  
 1   LotFrontage           1460 non-null   float64
 2   LotArea               1460 non-null   int64  
 3   OverallQual           1460 non-null   int64  
 4   OverallCond           1460 non-null   int64  
 5   YearBuilt             1460 non-null   int64  
 6   Remodeled             1460 non-null   int64  
 7   GrLivArea             1460 non-null   int64  
 8   BsmtFullBath          1460 non-null   int64  
 9   BsmtHalfBath          1460 non-null   int64  
 10  FullBath              1460 non-null   int64  
 11  HalfBath              1460 non-null   int64  
 12  BedroomAbvGr          1460 non-null   int64  
 13  Fireplaces            1460 non-null   int64  
 14  GarageArea            1460 non-null   int64  
 15  MSZoning_FV          

## Regression Review

### Regression basics

- Outcome is real-valued

### Common regression metrics

- Root Mean Squared Error (RMSE)
  - most common
  - affected by large differences
- Mean Absolute Error (MAE)
  - not as affected by large differences as RMSE
  - lacks some nice mathematical properties so it is not as common as RMSE

In [4]:
y_actual = np.array([10, 3, 6])
y_pred = np.array([20, 8, 1])

In [5]:
rmse = np.sqrt(np.mean((y_actual - y_pred)**2))
rmse.item(), root_mean_squared_error(y_actual, y_pred)

(7.0710678118654755, 7.0710678118654755)

In [6]:
mae = np.abs(y_actual - y_pred).mean()
mae.item(), mean_absolute_error(y_actual, y_pred)

(6.666666666666667, 6.666666666666667)

### Common regression algorithms

- Linear regression
- Decision trees

## Objective (loss) functions and base learners

### Objective Functions and Why We Use Them

- Quantifies how far off a prediction is from the actual result for a given data point
- Meaures the difference between estimated and true valiues from some collection of data
- **GOAL**: Find the model that yields the minimum value of the loss function

### Common loss functions and XGBoost

- Loss function names in `xgboost`:
  - `"reg:squarederror"` - use for regression problems
  - `"reg:logistic"` - use for classification problems when you want just decision, not probability
  - `"binary:logistic"` - use when you want probability rather than just decision

### Base learners ands why we need them

- XGBoost involves creating a meta-model that is composed of many individual models that combine to give a final prediction
- Individual models = base learners
- Want base learners that when combined create a final prediction that is **non-linear**
- Each base learner should be good at distinguishing or predicting different parts of the dataset
- Two kinds of base learners:
  1. Tree
  2. Linear

### Trees as base learners example: Scikit-learn API

In [7]:
X, y = ames.iloc[:, :-1], ames.iloc[:, -1]
X.shape, y.shape

((1460, 56), (1460,))

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=123
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1168, 56), (292, 56), (1168,), (292,))

In [9]:
xg_reg = xgb.XGBRegressor(
    objective='reg:squarederror',
    n_estimators=10,
    seed=123
)

xg_reg

In [10]:
xg_reg.fit(X_train, y_train)

In [11]:
preds = xg_reg.predict(X_test)

In [12]:
# rmse
root_mean_squared_error(y_test, preds)

31292.9765625

In [13]:
# alternately, Native API

DM_train = xgb.DMatrix(X_train, label=y_train)
DM_test = xgb.DMatrix(X_test, y_test)

params = {
    'objective': 'reg:squarederror',
    'seed': 123
}

booster = xgb.train(params, DM_train, num_boost_round=10)

preds_alt = booster.predict(DM_test)

root_mean_squared_error(y_test, preds_alt)

31292.9765625

### Linear base learners example: learning API only

In [14]:
params = {
    'booster': 'gblinear',
    'objective': 'reg:squarederror'
}

xg_reg = xgb.train(params, dtrain=DM_train, num_boost_round=10)
xg_reg

<xgboost.core.Booster at 0x7d074c913d90>

In [15]:
preds = xg_reg.predict(DM_test)

In [16]:
root_mean_squared_error(y_test, preds)

42602.24609375

### Decision trees as base learners

In [17]:
X, y = ames.iloc[:, :-1], ames.iloc[:, -1]
X.shape, y.shape

((1460, 56), (1460,))

In [18]:
# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123)

# Instantiate the XGBRegressor: xg_reg
# booster='gbtree' is the default
xg_reg = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=10,
                          booster='gbtree', seed=123)

# Fit the regressor to the training set
xg_reg.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_reg.predict(X_test)

# Compute the rmse: rmse
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

RMSE: 31292.976337


### Linear Base Learners

In [19]:
# Convert the training and testing sets into DMatrixes: DM_train, DM_test
DM_train = xgb.DMatrix(X_train, y_train)
DM_test = xgb.DMatrix(X_test, y_test)

# Create the parameter dictionary: params
params = {"booster":"gblinear", "objective":"reg:squarederror"}

# Train the model: xg_reg
xg_reg = xgb.train(params = params, dtrain=DM_train, num_boost_round=5)

# Predict the labels of the test set: preds
preds = xg_reg.predict(DM_test)

# Compute and print the RMSE
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

RMSE: 44756.794881


### Evaluating model quality

In [20]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:squarederror", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(
    dtrain=housing_dmatrix,
    params=params,
    nfold=4, num_boost_round=5,
    metrics='rmse',
    as_pandas=True, seed=123)

# Print cv_results
print(cv_results)
print('\n')

# Extract and print final boosting round metric
print((cv_results["test-rmse-mean"]).tail(1))

   train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
0     61729.274347      679.377718    63760.373921    2933.496745
1     49654.722560      757.696043    53641.473273    3504.687699
2     41325.179705      702.570217    46796.539109    3500.230673
3     35351.338939      772.520024    41986.507917    4018.899377
4     31020.037762      574.099506    39337.103754    4583.588151


4    39337.103754
Name: test-rmse-mean, dtype: float64


In [21]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:squarederror", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(
    dtrain=housing_dmatrix,
    params=params, nfold=4,
    num_boost_round=5,
    metrics='mae',
    as_pandas=True,
    seed=123)

# Print cv_results
print(cv_results)
print('\n')

# Extract and print final boosting round metric
print((cv_results["test-mae-mean"]).tail(1))

   train-mae-mean  train-mae-std  test-mae-mean  test-mae-std
0    43978.370783     265.516103   44551.030843    875.854839
1    34677.517623     229.638967   35869.958037   1015.100363
2    28338.053913     290.958119   30144.292723    902.020896
3    24076.657948     451.048602   26492.798309    835.506913
4    21115.815254     428.025143   24289.425664    994.137976


4    24289.425664
Name: test-mae-mean, dtype: float64


## Regularization and base learners in XGBoost