<a href="https://colab.research.google.com/github/PaulToronto/DataCamp-Track---Machine-Learning-Scientist-in-Python/blob/main/6_4_Machine_Learning_with_Tree_Based_Models_in_Python_Boosting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 6-4 Machine Learning with Tree-Based Models in Python - Boosting

## Imports

In [1]:
import pandas as pd

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import GradientBoostingRegressor

## Data

In [2]:
base_url = 'https://drive.google.com/uc?id='

### Wisconsin Breast-Cancer Dataset

In [3]:
id = '1oqwkLiOXsHomv_Nhm4JhEUf0GQE8h1rp'
breast = pd.read_csv(base_url + id)
breast = breast.drop(['id', 'Unnamed: 32'], axis=1)
breast['diagnosis'] = (breast['diagnosis'] == 'M').astype(int)
breast.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    int64  
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se                  5

### Indian Liver Patient Dataset

In [4]:
id = '1ZIKZwQV88fV7RFUSkhrTbGWGBxYxp9Rh'
liver = pd.read_csv(base_url + id)
liver.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         583 non-null    int64  
 1   Gender                      583 non-null    object 
 2   Total_Bilirubin             583 non-null    float64
 3   Direct_Bilirubin            583 non-null    float64
 4   Alkaline_Phosphotase        583 non-null    int64  
 5   Alamine_Aminotransferase    583 non-null    int64  
 6   Aspartate_Aminotransferase  583 non-null    int64  
 7   Total_Protiens              583 non-null    float64
 8   Albumin                     583 non-null    float64
 9   Albumin_and_Globulin_Ratio  579 non-null    float64
 10  Dataset                     583 non-null    int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 50.2+ KB


### Auto-mpg Dataset

In [5]:
id = '14qqT73DvmgD0dx9zkcs3pxRLMCwSANii'
auto = pd.read_csv(base_url + id)
auto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   mpg     392 non-null    float64
 1   displ   392 non-null    float64
 2   hp      392 non-null    int64  
 3   weight  392 non-null    int64  
 4   accel   392 non-null    float64
 5   origin  392 non-null    object 
 6   size    392 non-null    float64
dtypes: float64(4), int64(2), object(1)
memory usage: 21.6+ KB


## Adaboost

### Boosting

- **Boosting** is an ensemble method in which many predictors are trained and eah predictor learns from the errors of its predecessor
- **Boosting** (more formally): Ensemble method combining several weak learners to form a strong learner
- **Weak learner**: Model doing slightly better than random guessing
  - Exammple of a weak learner: Decision stump (CART whose maximum depth is 1)
- Train an ensemble of predictors sequentially
- Each predictor tries to correct its predecessor
- Popular boosting methos explored in this course:
  - AdaBoost
  - Gradient Boosting

### What if you use strong learners?

- If you use strong learners, you could still get good performance, but boosting loses its magic.
- It would be harder to control overfitting.
- It would take longer to train (strong learners are usually more complex).
- You might need to regularize heavily (for example, using shallow trees, limiting depth, adding shrinkage/learning rate).

👉 That’s why in practical boosting methods like Gradient Boosting Machines or XGBoost, they use shallow trees (e.g., depth 3–5) as the base learners.

### AdaBoost

- Stands for **Ada**ptive **Boost**ing
- Each predictor pays more attention to the instances wrongly predicted by its predecessor
  - Achieved by changing the weights of training instances
- Each predictor is assigned a coefficient, $\alpha$
  - $\alpha$ depend's on the predictor's training error

<img src='https://drive.google.com/uc?export=view&id=101nnlow_NJd9FAetsqtxyLcs21vvMjA3'>

- First, Predictor 1 is trained on the intitial dataset, $(X,y)$
    - The training error for Predictor 1 is determined
    - This error can then be used to determine $\alpha_{1}$, wich is Predictor 1's coefficient
- $\alpha_{1}$ is then used to dtermime the weights, $W^{(2)}$ of the training instances for Predictor 2
    - Notice how the incorrectly predicted instances (shown in green) acquire higher weights
    - When weighted instances are used to train Predictor 2, this predictor is forced to pay more attention to the incorrectly predicted instances
- This process is repeated sequentially, until the $N$ predictors forming the ensemble are trained


### Learning Rate

<img src='https://drive.google.com/uc?export=view&id=1Q2UOUpPFfGQGSFUjT0hNB0H8jca7mRcD'>

- An important parameter used in training is the learning rate, $\eta$, which is between 0 and 1
- $\eta$ is used to shrink the coefficient $\alpha$ of a trained predictor
- There is a trade-off between $\eta$ and the number of estimators
    - A smaller value of $\eta$, should be compensated by a greater number of estimators

### AdaBoost: Prediction

- Classification
  - Weighted majority voting
  - In `sklearn`: `AdaBoostClassifier`
- Regression
  - Weighted average
  - In `sklearn`: `AdaBoostRegressor`
- Individual predictors need not be CARTs, but CARTs are used most commonly in boosting because of their high variance

### AdaBoost Classification in sklearn (Breast-Cancer dataset)

In [6]:
X = breast.drop('diagnosis', axis=1)
y = breast['diagnosis']

SEED = 1

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    stratify=y,
                                                    random_state=SEED)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((398, 30), (171, 30), (398,), (171,))

In [7]:
y_train.value_counts()

Unnamed: 0_level_0,count
diagnosis,Unnamed: 1_level_1
0,250
1,148


In [8]:
# Instantiate a classifiation tree
dt = DecisionTreeClassifier(max_depth=1, random_state=SEED)

# Instatiate an AdaBoost classifier
adab_clf = AdaBoostClassifier(
    estimator=dt,
    n_estimators=100
    # default learning_rate=1.0
)

adab_clf

In [9]:
# fit to the training set
adab_clf.fit(X_train, y_train)

In [10]:
# predit the test set probailities of positive class
y_pred_proba = adab_clf.predict_proba(X_test)[:, 1]
y_pred_proba

array([0.31252831, 0.69937093, 0.73079839, 0.25914854, 0.19488555,
       0.18646731, 0.62441114, 0.1868014 , 0.36967934, 0.26235342,
       0.33168798, 0.27627661, 0.50078076, 0.34104086, 0.20481819,
       0.80588115, 0.38606607, 0.47489563, 0.30033296, 0.38015609,
       0.75912557, 0.35680566, 0.48251931, 0.75985749, 0.23630788,
       0.36246375, 0.27362481, 0.75380339, 0.42785663, 0.75796596,
       0.30137037, 0.27451955, 0.28483756, 0.49582039, 0.41581003,
       0.35576117, 0.27171864, 0.40252643, 0.19344355, 0.23668538,
       0.4021515 , 0.19401584, 0.74158329, 0.26774575, 0.78460711,
       0.65983035, 0.221222  , 0.67809901, 0.38191409, 0.75929706,
       0.30026042, 0.6633022 , 0.36078822, 0.24885131, 0.70109777,
       0.81045246, 0.78321946, 0.31790319, 0.79566173, 0.62233916,
       0.47346924, 0.2994392 , 0.27813112, 0.71864862, 0.21990377,
       0.23824689, 0.32751413, 0.65589167, 0.21101589, 0.79828763,
       0.75412903, 0.41661285, 0.25221401, 0.74880886, 0.66544

- Note that the `roc_auc_score` is a good metric for imbalanced datasets

In [11]:
# evaluate the test-set roc_auc_score
# - can range from 0 to 1
# - 0.9 to 1.0 is excellent
# - 0.8 to 0.9 is good
# - 0.5 is random guessing
adab_clf_roc_auc_score = roc_auc_score(y_test, y_pred_proba)
adab_clf_roc_auc_score.item()

0.9856892523364486

### Define the AdaBoost classifier

In [12]:
# preprocessing
liver = liver.dropna()
liver = liver.copy()
liver['Is_male'] = (liver['Gender'] == 'Male').astype(int)
liver['Dataset'] = (liver['Dataset'] == 1).astype(int)
X = liver.drop(['Gender', 'Dataset'], axis=1)
y = liver['Dataset']

In [13]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((463, 10), (116, 10), (463,), (116,))

In [14]:
# Instantiate dt
dt = DecisionTreeClassifier(max_depth=2, random_state=1)

# Instantiate ada
ada = AdaBoostClassifier(estimator=dt, n_estimators=180, random_state=1)

ada

In [15]:
ada.fit(X_train, y_train)

In [16]:
# Compute the probabilities of obtaining the positive class
y_pred_proba = ada.predict_proba(X_test)[:, 1]
y_pred_proba

array([0.71140423, 0.54065295, 0.57991289, 0.71008917, 0.60927082,
       0.66037611, 0.45595766, 0.56382698, 0.53410574, 0.50469654,
       0.72196238, 0.52084162, 0.56511164, 0.66592374, 0.5224517 ,
       0.61265821, 0.51280681, 0.67288093, 0.52025069, 0.56469861,
       0.52156516, 0.61794953, 0.57736685, 0.55107369, 0.50096233,
       0.56460195, 0.5402022 , 0.43013395, 0.56389003, 0.48514937,
       0.51119877, 0.56909178, 0.53895322, 0.49086014, 0.58015109,
       0.63654435, 0.60588982, 0.57088322, 0.5831748 , 0.49916874,
       0.67765929, 0.53975699, 0.56610979, 0.57434826, 0.52030628,
       0.56489378, 0.52252853, 0.68751351, 0.65949685, 0.51554278,
       0.58066781, 0.53097236, 0.57769292, 0.54819526, 0.62343819,
       0.51886954, 0.49817948, 0.55666844, 0.53045343, 0.49763336,
       0.49555433, 0.53996365, 0.53668678, 0.61159226, 0.5276606 ,
       0.57208751, 0.67011321, 0.68016478, 0.58073117, 0.53101619,
       0.51940123, 0.50140157, 0.47656453, 0.45490326, 0.66622

In [17]:
# Evaluate test-set roc_auc_score
ada_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print roc_auc_score
print('ROC AUC score: {:.2f}'.format(ada_roc_auc))

ROC AUC score: 0.73


## Gradient Boosting (GB)

- **Gradient Boosting** is a popular boosting algorithm that has a proven track record of winning many machine learning competitions

### Gradient Boosted Trees

- Sequential correction of predecessor's errors
- In contrast to **AdaBoost**, the weights of the training instances are not tweaked
  - Instead, each predictor is trained using hte residual errors of its predecessor as labels
- **Gradient Boosted Trees**: a CART is used as the base learner

### Gradient Boosted Trees for Regression: Training

<img src='https://drive.google.com/uc?export=view&id=1s2V_GA3LB9ijgieGYuvzDSZ_qzJ0cErP'>

### Shrinkage

- **Shrinkage** is an important parameter used in training gradient boosed trees
- Shrinkage refers to the fact that the prediction of each tree in the ensemble after it is mulitiple by a learning rate $\eta$, which is a number between 0 and 1
- Similar to **Adaboost**, there is a trade-off between $\eta$ and the number of estimators
  - Decreasing the learning rate needs to be compensated by increasing the number of estimators

<img src='https://drive.google.com/uc?export=view&id=1P_XrCfXCslWRTTZZGgWZCy7SzoOdJSgM'>

### Gradient Boosted Trees: Prediction

- Regression
  - $y_{\text{pred}} = y_1 + \eta r_1 + \cdots + \eta r_N$
  - `sklearn`: `GradientBoostingRegressor`
- Classifiication:
 - `sklearn`: `GradientBoostingClassifier`

### Gradient Boosing in `sklearn` (auto dataset)

In [18]:
preprocessor = make_column_transformer(
    (OneHotEncoder(sparse_output=False, drop='first'), ['origin']),
    remainder='passthrough',
    force_int_remainder_cols=False
)

In [19]:
X = auto.drop('mpg', axis=1)
y = auto['mpg']

index = X.index

X = preprocessor.fit_transform(X)
X = pd.DataFrame(X,
                 columns=preprocessor.get_feature_names_out(),
                 index = index)

X.head()

Unnamed: 0,onehotencoder__origin_Europe,onehotencoder__origin_US,remainder__displ,remainder__hp,remainder__weight,remainder__accel,remainder__size
0,0.0,1.0,250.0,88.0,3139.0,14.5,15.0
1,0.0,1.0,304.0,193.0,4732.0,18.5,20.0
2,0.0,0.0,91.0,60.0,1800.0,16.4,10.0
3,0.0,1.0,250.0,98.0,3525.0,19.0,15.0
4,1.0,0.0,97.0,78.0,2188.0,15.8,10.0


In [20]:
SEED = 1

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=SEED
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((274, 7), (118, 7), (274,), (118,))

In [21]:
# `gpt` consists of 300 decision stumps
gbt = GradientBoostingRegressor(
    n_estimators=300,
    max_depth=1,
    random_state=SEED
)
gbt

In [22]:
gbt.fit(X_train, y_train)

In [23]:
y_pred = gbt.predict(X_test)
y_pred

array([33.68725603, 25.27384721, 17.93001557, 17.32484425, 15.37368677,
       25.5519115 , 21.42284454, 22.81194823, 22.39914352, 12.7594566 ,
       34.02411731, 35.70323209, 26.53399047, 13.20665951, 23.83550207,
       27.0107586 , 23.83550207, 18.06845503, 28.32339863, 22.97380871,
       23.66596624, 29.60677106, 28.85625419, 20.14111128, 24.90537577,
       25.76070167, 15.80118815, 34.90970616, 26.94185008, 25.4305163 ,
       14.99613899, 24.18657467, 17.50275567, 35.70826846, 33.89694726,
       28.14029911, 25.5519115 , 19.77002255, 33.68725603, 32.22617914,
       16.20904776, 16.20904776, 15.37368677, 32.64865757, 19.60381245,
       34.71086031, 25.4305163 , 25.4305163 , 23.66596624, 25.34049835,
       34.60446656, 35.21572421, 34.34417825, 33.91605361, 25.76070167,
       15.97051441, 34.29213296, 26.85408531, 14.02221748, 17.71667137,
       29.15064424, 18.77388096, 24.70569008, 35.78143811, 14.40998067,
       26.39662304, 14.83245223, 25.4305163 , 13.92583509, 27.80

In [24]:
rmse_test = mean_squared_error(y_test, y_pred)**(1/2)
rmse_test

4.082222521046934

### Define the GG regressor

```python
# Import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Instantiate gb
gb = GradientBoostingRegressor(
    max_depth=4,
    n_estimators=200,
    random_state=2
)
```

### Train the GB regressor

```python
# Fit gb to the training set
gb.fit(X_train, y_train)

# Predict test set labels
y_pred = gb.predict(X_test)
```

### Evaluate GG regressor

```python
# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error as MSE

# Compute MSE
mse_test = MSE(y_test, y_pred)

# Compute RMSE
rmse_test = mse_test ** (1/2)

# Print RMSE
print('Test set RMSE of gb: {:.3f}'.format(rmse_test))
```

## Stochastic Gradient Boosting (SGB)

### Gradient Boosting: Cons

- Involves an exhaustive search procedure
- Each CART in the ensemble is trained to find the best split points and features
- May lead to CARTs using the same split points and maybe the same features
- The **Stochast Gradient Boosting** algorithm mitigates this problem

### Stochastic Gradient Boosting

- Each tree is trained on a random subset of rows of the training data
- The sampled instances (40% to 80% of the training set) are sampled without replacement
- Features are sampled (without replacement) when choosing split points
- Result: further ensemble diversity
- Effect: adding further variance to the ensemble of trees

### Stochastic Gradient Boosting: Training

<img src='https://drive.google.com/uc?export=view&id=1qC5195Et4aLtFrJCkL79sboObLhDcMKV'>

- Instead of providing all the training instances to a tree, only a fraction of these instances are provided through sampling without replacement
- The sampled data is then used for training a tree, however, not all features are considered when a split is made, instead, only a randomly selected fraction are used
- Once a tree is trained, predictions are made and the residual errors can be computed
- These residual errors are multiplied by the learning rate, $\eta$ and are fed to the next tree in the ensemble
- This procedure is repeated sequentially until all the trees in the ensemble are trained
- The prediction procedure for stochastic gradient boosting is similar to that of gradient boosting

## Stochasti Gradient Boosting in sklearn (auto dataset)

In [25]:
preprocessor = make_column_transformer(
    (OneHotEncoder(sparse_output=False, drop='first'), ['origin']),
    remainder='passthrough',
    force_int_remainder_cols=False
)

In [26]:
X = auto.drop('mpg', axis=1)
y = auto['mpg']

index = X.index

X = preprocessor.fit_transform(X)
X = pd.DataFrame(X,
                 columns=preprocessor.get_feature_names_out(),
                 index = index)

X.head()

Unnamed: 0,onehotencoder__origin_Europe,onehotencoder__origin_US,remainder__displ,remainder__hp,remainder__weight,remainder__accel,remainder__size
0,0.0,1.0,250.0,88.0,3139.0,14.5,15.0
1,0.0,1.0,304.0,193.0,4732.0,18.5,20.0
2,0.0,0.0,91.0,60.0,1800.0,16.4,10.0
3,0.0,1.0,250.0,98.0,3525.0,19.0,15.0
4,1.0,0.0,97.0,78.0,2188.0,15.8,10.0


In [27]:
SEED = 1

X_train, X_test, y_train, y_test = train_test_split(
    X,y,
    test_size=0.3,
    random_state=SEED
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((274, 7), (118, 7), (274,), (118,))

In [28]:
sgbt = GradientBoostingRegressor(
    max_depth=1,
    subsample=0.8,      # each tree samples 80% of the data
    max_features=0.2,   # each tree uses 20% of the available features
    n_estimators=300,
    random_state=SEED
)

sgbt

In [29]:
sgbt.fit(X_train, y_train)

In [30]:
y_pred = sgbt.predict(X_test)
y_pred

array([33.86064382, 25.87494714, 18.08095657, 16.70402756, 15.26405882,
       25.18155645, 21.43094287, 20.36906222, 22.38125035, 12.52297002,
       32.37911702, 34.54880487, 25.88190187, 13.53378635, 23.18057728,
       27.69953995, 24.28310123, 17.57017652, 28.87202929, 23.06158952,
       24.02615985, 28.62171321, 28.76858709, 19.5120903 , 26.30943489,
       25.98059523, 16.49903335, 33.90770636, 26.39268193, 25.26544947,
       15.33389632, 24.81741139, 16.94546841, 36.27784698, 32.0109485 ,
       27.3091492 , 25.5716921 , 19.27330081, 33.66193148, 32.84151746,
       17.0098134 , 17.0098134 , 15.4499694 , 32.16803724, 19.86647341,
       35.2468379 , 25.26544947, 25.26544947, 24.16337608, 25.74493055,
       32.83473349, 35.76706693, 34.71829014, 34.33850322, 25.90449066,
       16.43766964, 31.11264436, 26.1056531 , 13.9535503 , 16.70402756,
       29.19241521, 19.55668752, 24.34862623, 35.24393204, 14.97273316,
       25.8109949 , 15.16209413, 25.5716921 , 14.11970611, 26.51

In [31]:
rmse_test = mean_squared_error(y_test, y_pred) ** (1/2)

rmse_test

4.066995868649632

### Regression with SGB

```python
# Import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Instantiate sgbr
sgbr = GradientBoostingRegressor(
    max_depth=4,
    subsample=0.9,
    max_features=0.75,
    n_estimators=200,
    random_state=2
)
```

### Train the SGB regressor

```python
# Fit sgbr to the training set
sgbr.fit(X_train, y_train)

# Predict test set labels
y_pred = sgbr.predict(X_test)
```

### Evaluate the SGB regressor

```python
# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error as MSE

# Compute test set MSE
mse_test = MSE(y_test, y_pred)

# Compute test set RMSE
rmse_test = mse_test ** (1/2)

# Print rmse_test
print('Test set RMSE of sgbr: {:.3f}'.format(rmse_test))
```