# Boosting
### Definition
- Boosting refers to an ensemble method in which several models are trained sequentially with each model learning from the errors of its predecessors. 
- **Weak learner**: model doing slightly better than random guessing (example a CART whose maximum depth is 1)
    - **Stump**: A tree with only one node and two leaves is called a stump. Stumps are not great at making classifications. 
        - In a forest of stumps made with Adaboost, some stumps have more weight on their vote than others (unlike a random forest)

### Advantages
- ***Boosting reduces bias, not variance. Thus, it tries to reduce the problem of underfitting the data.***

### Types
- **AdaBoost** (Adaptive Boosting): 
    - Combines a lot of 'weak learners' to make classifications.
    - Linear, not in parallel
        - The errors that the first tree makes influence how the second tree is made... and so on.
    - Some trees have more weight on their vote than others
    - **Steps**:
        1. An equal weight is given to each sample (how important the sample is). 
        2. Make the first stump in the forest. 
            2.1 Find which feature gives the lower impurity (or gini index), to know which variable will be the node of the stump
        3. ...hehe
    - **Parameters**:
        - Learning rate (between 0 and 1) - etha

<br>

- **Gradient Boosting**:
    - Like AdaBoost, Gradient Boost builds fixed sized trees based on the previous tree's errors, but unlike AdaBoost, each tree can be larger than a stump
    - Maximum number of leaves should be around 8 and 32 (of each decision tree)
    - **Cons**:
        - GB involves an exhaustive search procedure
        - Each CART is trained to find the best split points and features
        - May lead to CARTs using the same split points and maybe the same features
        - Solution: Stochastic Gradient Boosting

<br>

![title](https://drive.google.com/uc?export=view&id=1nIJIe5xsAAoE2SR5VqET0HBzhMjxvZsu)


### AdaBoost Classification Example

In [5]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

from sklearn.datasets import load_breast_cancer

# Load data
data = load_breast_cancer()
X = data.data[:, :]
y = data.target

# Seed for reproducibility
seed = 1

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=seed)

# Instantiate a classification tree
dt = DecisionTreeClassifier(max_depth=1, random_state=seed)

# Instantiate an AdaBoost classifer
adb_clf = AdaBoostClassifier(base_estimator=dt, n_estimators=100)

# Fit model
adb_clf.fit(X_train, y_train)

# Predict the test set probabilities of a positive class
y_pred_proba = adb_clf.predict_proba(X_test)[:,1]

# Evaluate 
adb_roc_auc_score = roc_auc_score(y_test, y_pred_proba)
print(f"ROC AUC Score: {adb_roc_auc_score}")

ROC AUC Score: 0.9941588785046729


### Gradient Boosting Regression Example

In [8]:
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE 

# Import data
file1 = 'https://raw.githubusercontent.com/prince381/car_mpg_predict/master/cars1.csv'
file2 = 'https://raw.githubusercontent.com/prince381/car_mpg_predict/master/cars2.csv'
cars1 = pd.read_csv(file1).dropna(how='all', axis=1)
cars2 = pd.read_csv(file2)  
df = pd.concat([cars1, cars2], ignore_index=True, sort=False)

# Split data
seed = 1
X = df[['displacement']].to_numpy().reshape(-1, 1)
y = df['mpg'].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=seed)

# Instantiate the model 
gbt = GradientBoostingRegressor(n_estimators=300, max_depth=1, random_state=1)

# Fit the model
gbt.fit(X_train, y_train)

# Predict
y_pred = gbt.predict(X_test)

# Evaluate
rmse_test = MSE(y_test, y_pred) ** (1/2)
print(f"Test set RMSE: {rmse_test}")


# STOCHASTIC GRADIENT BOOSTING EXAMPLE
sgbt = GradientBoostingRegressor(max_depth=1, subsample=0.8, max_features=0.2, n_estimators=300, random_state=seed)
sgbt.fit(X_train, y_train)
y_sgbt_pred = sgbt.predict(X_test)
rmse_sgbt = MSE(y_test, y_sgbt_pred) ** (1/2)
print(f"Test set STOCHASTIC: {rmse_sgbt}")

Test set RMSE: 3.8167721913223183
Test set STOCHASTIC: 3.7795402434726606
