### Boosting Decisions

In this notebook, we build some understanding about the boosting process using Adaptive Boosting and Gradient Boosted Trees methods.

The final application will be the photometric redshift problem introduced in Chapter 6; however, specific solution for those are explored in the notebook "Flavors of Boosting".

Author: Viviana Acquaviva, with contributions by Jake Postiglione and Olga Privman.

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
%matplotlib inline
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', 100)


font = {'size'   : 16}
matplotlib.rc('font', **font)
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14) 
#matplotlib.rcParams.update({'figure.autolayout': True})
matplotlib.rcParams['figure.dpi'] = 300

In [None]:
from sklearn import metrics
from sklearn.model_selection import cross_validate, KFold, cross_val_predict, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor, GradientBoostingRegressor

Reference for comparison of weak learners as base estimators:

https://link.springer.com/chapter/10.1007/978-3-642-20042-7_32

Implementation from scratch (with sample weights - should check if it's an original source)

https://xavierbourretsicotte.github.io/AdaBoost.html

or this, but I think it's inspired by the above:

https://geoffruddock.com/adaboost-from-scratch-in-python/

### We can read the photometric redshifts data set with the selections applied in the previous notebook.

In [None]:
sel_features = pd.read_csv('../data/sel_features.csv', sep = '\t')

In [None]:
sel_target = pd.read_csv('../data/sel_target.csv')

In [None]:
sel_features.shape

In [None]:
sel_target.values.ravel() #changes shape to 1d row-like array

### We can try our usual benchmarking with AdaBoost, using default values.

In [None]:
model = AdaBoostRegressor()

In [None]:
ypred = cross_val_predict(model, sel_features,sel_target.values.ravel(), cv = KFold(n_splits=5, shuffle=True, random_state=10))

In [None]:
plt.figure(figsize=(7,7))
plt.scatter(sel_target,ypred, s =10)
plt.ylim(0,3)
plt.xlim(0,3)

### This is where I started wondering whether the boosting process was working!

In [None]:
model.get_params()

Note (from sklearn docs): If None, then the base estimator is DecisionTreeRegressor(max_depth=3).

### I decided to investigate the role of different parameters in the performance.

#### Changing max depth in base estimators: I tried with trees making a maximum of 3, 6, and 10 splits.

In [None]:
plt.figure(figsize=(12,4))

for i, depth in enumerate([3,6,10]):
    plt.subplot(1,3,i+1)
    model = AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=depth))
    ypred = cross_val_predict(model, sel_features,sel_target.values.ravel(), cv = KFold(n_splits=5, shuffle=True, random_state=10))
    plt.scatter(sel_target,ypred, s =10, c = 'teal')
    plt.title('Max depth = '+str(depth))
    plt.xlabel('True redshift')
    if i == 0:
        plt.ylabel('Estimated redshift')
    plt.ylim(0,2)
    plt.xlim(0,2)
    
    plt.tight_layout()

#plt.savefig('AdaB_z.png')
#    plt.axes('equal')
#    plt.legend()

#### Changing N of base estimators (stages participating in boosting).

In [None]:
plt.figure(figsize=(7,7))

plt.ylim(0,2)
plt.xlim(0,2)

for nest in [5,10,20]:
    model = AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=6), n_estimators=nest)
    ypred = cross_val_predict(model, sel_features,sel_target.values.ravel(), cv = KFold(n_splits=5, shuffle=True, random_state=10))
    plt.scatter(sel_target,ypred, s =10, label = 'N est = '+str(nest))
plt.legend()

#### Changing loss function

In [None]:
plt.figure(figsize=(7,7))

plt.ylim(0,2)
plt.xlim(0,2)

for loss in ['linear','square']:
    model = AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=6), loss = loss)
    ypred = cross_val_predict(model, sel_features,sel_target.values.ravel(), cv = KFold(n_splits=5, shuffle=True, random_state=10))
    plt.scatter(sel_target,ypred, s =10, label = 'Loss = '+loss)
plt.legend();

### The conclusion of this process was that for AdaBoost at least, the base estimator needs to be "strong enough" in order for the boosting process to succeed.

## Simple regression toy model
### Inspired by

https://scikit-learn.org/stable/auto_examples/ensemble/plot_adaboost_regression.html#sphx-glr-auto-examples-ensemble-plot-adaboost-regression-py

#### This is what happens if max_depth = 3

In [None]:
# Create the dataset
plt.figure(figsize=(15,10))

rng = np.random.RandomState(1)
X = np.linspace(0, 4, 100)[:, np.newaxis]
y = np.sin(X).ravel() + np.sin(6 * X).ravel() + rng.normal(0, 0.1, X.shape[0])

weakl = DecisionTreeRegressor(max_depth=3)

# Fit regression model, saving each "stage"

regr_1 = weakl
""
regr_2 = AdaBoostRegressor(weakl,
                          n_estimators=2, random_state=rng)

regr_3 = AdaBoostRegressor(weakl,
                          n_estimators=3, random_state=rng)

regr_4 = AdaBoostRegressor(weakl,
                          n_estimators=4, random_state=rng)

regr_10 = AdaBoostRegressor(weakl,
                          n_estimators=10, random_state=rng)

regr_100 = AdaBoostRegressor(weakl,
                          n_estimators=100, random_state=rng)


regr_1.fit(X, y)
regr_2.fit(X, y)
regr_3.fit(X, y)
regr_4.fit(X, y)
regr_10.fit(X, y)
regr_100.fit(X, y)

# Predict
y_1 = regr_1.predict(X)
y_2 = regr_2.predict(X)
y_3 = regr_3.predict(X)
y_4 = regr_4.predict(X)
y_10 = regr_10.predict(X)

for yp in [y_1,y_2,y_3,y_4,y_10]:
    print('r2 score: ', np.round(metrics.r2_score(yp,y),3))

# Plot the results

plt.scatter(X, y, c="k", s=10,label="training samples")
plt.plot(X, y_1, "-g", label="n_estimators=1", linewidth=1)
plt.plot(X, y_2, "--r", label="n_estimators=2", linewidth=1)
plt.plot(X, y_3, "-.b", label="n_estimators=3", linewidth=1)
#plt.plot(X, y_4, ":m", label="n_estimators=4", linewidth=1)
#plt.plot(X, y_10, "-k", label="n_estimators=10", linewidth=1)
plt.xlabel("data")
plt.ylabel("target")
plt.title("AdaBoost Regression, max depth = 3", fontsize = 14)
plt.legend(fontsize=10);
#plt.tight_layout()
#plt.savefig("AdaBoost_3.png")

#### This is what happens if max_depth = 6

In [None]:
# Create the dataset
plt.figure(figsize=(15,10))

rng = np.random.RandomState(1)
X = np.linspace(0, 4, 100)[:, np.newaxis]
y = np.sin(X).ravel() + np.sin(6 * X).ravel() + rng.normal(0, 0.1, X.shape[0])

weakl = DecisionTreeRegressor(max_depth=6)

# Fit regression model, saving each "stage"
regr_1 = weakl
""
regr_2 = AdaBoostRegressor(weakl,
                          n_estimators=2, random_state=rng)

regr_3 = AdaBoostRegressor(weakl,
                          n_estimators=3, random_state=rng)

regr_4 = AdaBoostRegressor(weakl,
                          n_estimators=4, random_state=rng)

regr_10 = AdaBoostRegressor(weakl,
                          n_estimators=10, random_state=rng)

regr_100 = AdaBoostRegressor(weakl,
                          n_estimators=100, random_state=rng)


regr_1.fit(X, y)
regr_2.fit(X, y)
regr_3.fit(X, y)
regr_4.fit(X, y)
regr_10.fit(X, y)
regr_100.fit(X, y)

# Predict
y_1 = regr_1.predict(X)
y_2 = regr_2.predict(X)
y_3 = regr_3.predict(X)
y_4 = regr_4.predict(X)
y_10 = regr_10.predict(X)

for yp in [y_1,y_2,y_3,y_4,y_10]:
    print(metrics.r2_score(yp,y))

# Plot the results

plt.scatter(X, y, c="k", s=10,label="training samples")
plt.plot(X, y_1, "-g", label="n_estimators=1", linewidth=1)
plt.plot(X, y_2, "--r", label="n_estimators=2", linewidth=1)
plt.plot(X, y_3, "-.b", label="n_estimators=3", linewidth=1)
#plt.plot(X, y_4, ":m", label="n_estimators=4", linewidth=1)
#plt.plot(X, y_10, "-k", label="n_estimators=10", linewidth=1)
plt.xlabel("data")
plt.ylabel("target")
plt.title("AdaBoost Regression, max depth = 6", fontsize = 14)
plt.legend(fontsize=10);
#plt.tight_layout()
#plt.savefig("AdaBoost_6.png")
plt.show()

### Learning Check-in
    
Based on the figure above, is the boosting process worth it for AdaBoost with a base learner tree with max depth = 3? How about one with max depth = 6?

<br>

<details>
<summary style="display: list-item;">Click here for the answer!</summary>
<p>
    
```
It is not worth it for the first case (max depth = 3), as we see that the r2 scores don't improve if we stack more estimators. It may be worth for max depth = 6, but the scores are essentially stable, so further investigation may be needed.
```
    
</p>
</details>

### Now that we are convinced, let's go back to photo-zs.

I will create a train/test split because I need to access the "staged_predict" property.

In [None]:
X_train, X_test, y_train, y_test = \
        train_test_split(sel_features,sel_target.values.ravel(), test_size=.3, random_state=42)

In [None]:
#begin with very weak learner (r2 = 0.4)

model= AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=3),
                  n_estimators=30)

In [None]:
model.fit(X_train, y_train)

We can plot the R2 score and the Spearman correlation coefficient between true and predicted values as a function of the number of stages/iterations, beginning with a weak base learner.

In [None]:
n_estimators = 30

plt.plot(range(n_estimators), [metrics.r2_score(y_test,list(model.staged_predict(X_test))[i]) for i in range(n_estimators)], label = 'r2 score')

plt.plot(range(n_estimators), [stats.spearmanr(y_test,list(model.staged_predict(X_test))[i])[0] for i in range(n_estimators)], label = 'Spearman r')

plt.xlabel('Iteration')

plt.ylim(0,1.0)

plt.title('Max depth = 3')
plt.legend();

### The scores don't seem to improve as we stack more estimators.

We can try again with a stronger base learner (max_depth = 6).

In [None]:
n_estimators = 30

model= AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=6),
                  n_estimators=n_estimators)

X_train, X_test, y_train, y_test = \
        train_test_split(sel_features,sel_target.values.ravel(), test_size=.3, random_state=42)

model.fit(X_train, y_train)

plt.plot(range(n_estimators), [metrics.r2_score(y_test,list(model.staged_predict(X_test))[i]) for i in range(n_estimators)], label = 'r2')

plt.plot(range(n_estimators), [stats.spearmanr(y_test,list(model.staged_predict(X_test))[i])[0] for i in range(n_estimators)], label = 'Spearman r')

plt.xlabel('Iteration')

plt.title('Base estimator, max depth = 6')
plt.legend();

And an even stronger base learner (max_depth = 10).

In [None]:
n_estimators = 30

model= AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=10),
                  n_estimators=30)

X_train, X_test, y_train, y_test = \
        train_test_split(sel_features,sel_target.values.ravel(), test_size=.3, random_state=42)

model.fit(X_train, y_train)

plt.plot(range(n_estimators), [metrics.r2_score(y_test,list(model.staged_predict(X_test))[i]) for i in range(n_estimators)], label = 'r2')

plt.plot(range(n_estimators), [stats.spearmanr(y_test,list(model.staged_predict(X_test))[i])[0] for i in range(n_estimators)], label = 'Spearman r')

plt.xlabel('Iteration')

plt.title('Base estimator, max depth = 10')

plt.legend();

### Let's combine all in one figure.

In [None]:
plt.figure(figsize=(12,4))

n_estimators = 30

for i, md in enumerate([3,6,10]):
    
    model = AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=md),
                  n_estimators=n_estimators)

    model.fit(X_train,y_train)
    
    plt.subplot(1,3,i+1)

    plt.plot(range(n_estimators), [metrics.r2_score(y_test,list(model.staged_predict(X_test))[i]) for i in range(n_estimators)], label = 'r2 score', c = 'steelblue')

    plt.plot(range(n_estimators), [stats.spearmanr(y_test,list(model.staged_predict(X_test))[i])[0] for i in range(n_estimators)], label = 'Spearman r', c = 'fuchsia')

    plt.xlabel('Iteration')

    plt.ylim(0,1.0)

    plt.title('Max depth = '+str(md)+', AdaBoost')
    
    if i == 2:
        plt.legend();
    
    plt.tight_layout()

plt.savefig('AdaB_performance.png')

### Learning Check-in
    
Based on the figure above, would you recommend using AdaBoost with a base learner with max depth = 6, and 30 iterations, or with max depth = 10, and 10 iterations?

<br>

<details>
<summary style="display: list-item;">Click here for the answer!</summary>
<p>
    
```
The R2 scores and the correlation between true and predicted values are both higher for the case of max depth = 6 and 10 iterations, so that would be the correct choice.
```
    
</p>
</details>

We sort-of have an answer from the third panel of the figure above, but we could also ask whether we should keep boosting (i.e. if adding more stages is beneficial.).

In [None]:
#Shall we keep boosting? (max_depth = 10)

n_estimators = 60

model= AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=10),
                  n_estimators=n_estimators)

X_train, X_test, y_train, y_test = \
        train_test_split(sel_features,sel_target.values.ravel(), test_size=.3, random_state=42)

model.fit(X_train, y_train)

plt.plot(range(n_estimators), [metrics.r2_score(y_test,list(model.staged_predict(X_test))[i]) for i in range(n_estimators)], label = 'r2')

plt.plot(range(n_estimators), [stats.spearmanr(y_test,list(model.staged_predict(X_test))[i])[0] for i in range(n_estimators)], label = 'Spearman r')

plt.xlabel('Iteration')

plt.title('Base estimator, max depth = 10')

plt.legend();


### Conclusion: stacking learners that are too weak doesn't help.

### Would this be true also for Gradient Boosted Trees algorithms?

There is only one way to find out!

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

The parameters depend on the particular implementation.

In the sklearn formulation, the parameters of each tree are essentially the same we have for Random Forests; additionally we have the "learning_rate" parameter, which dictates how much each tree contribute to the final estimator, and the "subsample" parameters, which allows one to use a < 1.0 fraction of samples.


We can check how this works with a weak learner on our toy data set.

In [None]:
# Create the dataset
plt.figure(figsize=(15,12))

rng = np.random.RandomState(1)
X = np.linspace(0, 4, 100)[:, np.newaxis]
y = np.sin(X).ravel() + np.sin(6 * X).ravel() + rng.normal(0, 0.1, X.shape[0])

weakl = DecisionTreeRegressor(max_depth=3)

# Fit regression model
regr_1 = weakl
""
regr_2 = GradientBoostingRegressor(max_depth=3,
                          n_estimators=2, random_state=rng)

regr_3 = GradientBoostingRegressor(max_depth=3,
                          n_estimators=3, random_state=rng)

regr_4 = GradientBoostingRegressor(max_depth=3,
                          n_estimators=4, random_state=rng)

regr_10 = GradientBoostingRegressor(max_depth=3,
                          n_estimators=10, random_state=rng)

regr_100 = GradientBoostingRegressor(max_depth=3,
                          n_estimators=100, random_state=rng)


regr_1.fit(X, y)
regr_2.fit(X, y)
regr_3.fit(X, y)
regr_4.fit(X, y)
regr_10.fit(X, y)
regr_100.fit(X, y)

# Predict
y_1 = regr_1.predict(X)
y_2 = regr_2.predict(X)
y_3 = regr_3.predict(X)
y_4 = regr_4.predict(X)
y_10 = regr_10.predict(X)
y_100 = regr_100.predict(X)

for yp in [y_1,y_2,y_3,y_4,y_10, y_100]:
    print('R2 score: ', np.round(metrics.r2_score(yp,y),3))

# Plot the results

plt.scatter(X, y, c="k", s=10,label="training samples")
plt.plot(X, y_1, "-g", label="n_estimators=1", linewidth=1)
#plt.plot(X, y_2, "--r", label="n_estimators=2", linewidth=1)
plt.plot(X, y_3, "-.b", label="n_estimators=3", linewidth=1)
#plt.plot(X, y_4, ":m", label="n_estimators=4", linewidth=1)
plt.plot(X, y_10, "-k", label="n_estimators=10", linewidth=1)
plt.plot(X, y_100, "-c", label="n_estimators=100", linewidth=1)
plt.xlabel("data")
plt.ylabel("target")
plt.ylim(-2.5,2.5)
plt.title("Gradient Boosting Regression, max depth = 3", fontsize = 14)
plt.legend(fontsize=14, loc = 'upper right');
#plt.tight_layout()
#plt.savefig("GradBoost_3.png")

### Learning Check-in
    
Some of the R2 scores seen above for GBTs are negative! Shouldn't the R2 scores always be positive?

<br>

<details>
<summary style="display: list-item;">Click here for the answer!</summary>
<p>
    
```
No, the requirement only holds for the training set. A negative R2 score on the test set (or on the validation set) simply indicates that the model performs worse than a constant prediction equal to the mean of the sample. So that would be a terrible model, but (possibly ;) ) not a coding mistake.
```
    
</p>
</details>

In [None]:
plt.figure(figsize=(12,4))

n_estimators = 30

for i, md in enumerate([3,6,10]):
    
    model = GradientBoostingRegressor(max_depth=md,
                  n_estimators=n_estimators)

    model.fit(X_train,y_train)
    
    plt.subplot(1,3,i+1)

    plt.plot(range(n_estimators), [metrics.r2_score(y_test,list(model.staged_predict(X_test))[i]) for i in range(n_estimators)], label = 'r2 score', c = 'steelblue')

    plt.plot(range(n_estimators), [stats.spearmanr(y_test,list(model.staged_predict(X_test))[i])[0] for i in range(n_estimators)], label = 'Spearman r', c = 'fuchsia')

    plt.xlabel('Iteration')

    plt.ylim(0,1.0)

    plt.title('Max depth = '+str(md)+', GBR')
    
    if i == 2:
        plt.legend();
    
    plt.tight_layout()

plt.savefig('GBR_performance.png')

### Because of the different boosting process, GBT models tend to work well even with weak base learners.

We compare AdaBoost and various GBT models on the photometric redshifts problem in the next notebook (FlavorsOfBoosting).