In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### The effectiveness of gradual learning

Boosting, is based on a technique known as gradual learning.

#### Collective learning

The ensemble methods are based on an idea known as collective learning - that is, the wisdom of the crowd. 

- The idea is that, The combined prediction of individual models is superior to any of the individual predictions on their own. 

- For collective learning to be efficient, the estimators need to be independent and uncorrelated. 

- In addition, all the estimators are learning the same task, for the same goal: to predict the target variable given the features. 

- Because the estimators are independent, these can be trained in parallel to speed up the model building. 

#### Gradual learning

Gradual learning methods, on the other hand, are based on the principle of iterative learning. 

- In this approach, each subsequent model tries to fix the errors of the previous model. 

- Gradual learning creates dependent estimators, as each model takes advantage of the knowledge from the previous estimator. 

- In iterative learning, each model is learning a different task, but each one contributes to the same goal of accurately predicting the target variable. 

- As gradual learning follows a sequential model building process, models cannot be trained in parallel.

### AdaBoost properties

There are two distinctive properties of Adaptive Boosting compared to other Boosting algorithms. 

- First, the instances are drawn using a sample distribution of the training data into each subsequent dataset. This sample distribution makes sure that instances which were harder to predict for the previous estimator have a higher chance to be included in the training set for the next estimator by giving them higher weights. The distribution is initialized to be uniform. 

- Secondly, the estimators are combined through weighted majority voting. The voting weights are based on the estimators training error. Estimators which have shown good performance are rewarded with higher weights for voting. 

In addition, AdaBoost is guaranteed to improve as the ensemble grows if each estimator has an error rate less than 0.5. In other words, each estimator needs to be a "weak" model. And similar to Bagging, AdaBoost can be used for both Classification and Regression with its two variations.

#### AdaBoost regressor with scikit-learn

We can also find the AdaBoostRegressor class in the scikit-learn ensemble module. To instantiate an AdaBoost regression model, we need to call it with the some parameters.

- The parameter base_estimator works as usual, it's the weak model template for all the estimators. If not specified, the default is a Decision Tree regressor with a max depth of 3(1 for classifier), also known as a decision stump. 

- The second parameter is the number of estimators we want to use. By default is 50. If there's a perfect fit, or an estimator with error higher than 50%, no more estimators are built. 

- Other important parameter is learning rate, which represents how much each estimator contributes to the ensemble. This is 1.0 by default. In addition, there is a trade-off between the number of estimators and the learning rate.

- In addition, we have the loss parameter, which is the function used to update weights. By default, it is linear, but you can also use the square or exponential loss.

In [2]:
df = pd.read_csv("auto.csv")
df.head()

Unnamed: 0,mpg,displ,hp,weight,accel,origin,size
0,18.0,250.0,88,3139,14.5,US,15.0
1,9.0,304.0,193,4732,18.5,US,20.0
2,36.1,91.0,60,1800,16.4,Asia,10.0
3,18.5,250.0,98,3525,19.0,US,15.0
4,34.3,97.0,78,2188,15.8,Europe,10.0


In [3]:
X = df.drop(["mpg", "origin"], axis = 1)

In [4]:
y = df["mpg"]

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error 

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=3)

In [7]:
from sklearn.linear_model import LinearRegression

In [42]:
# Build and fit linear regression model
reg_lm = LinearRegression(normalize = True)
reg_lm.fit(X_train, y_train)

# Calculate the predictions on the test set
pred_lr = reg_lm.predict(X_test)

# Evaluate the performance using the RMSE
rmse_lr = np.sqrt(mean_squared_error(y_test,pred_lr))
print('RMSE Logistic Regression: {:.3f}'.format(rmse_lr))

RMSE Logistic Regression: 5.007


In [43]:
from sklearn.tree import DecisionTreeRegressor

In [44]:
# Instantiate dt
dt = DecisionTreeRegressor(max_depth=3,random_state=500)

# Fit dt to the training set
dt.fit(X_train, y_train)

# Compute y_pred
pred_dt = dt.predict(X_test)


# Evaluate the performance using the RMSE
rmse_dt = np.sqrt(mean_squared_error(y_test, pred_dt))
print('RMSE Decision Tree: {:.3f}'.format(rmse_dt))

RMSE Decision Tree: 4.365


In [35]:
from sklearn.ensemble import AdaBoostRegressor

In [46]:
# Instantiate a normalized linear regression model
reg_lm = LinearRegression(normalize = True)

# Build and fit an AdaBoost regressor
reg_ada = AdaBoostRegressor(base_estimator=reg_lm, n_estimators = 100, random_state=500)
reg_ada.fit(X_train, y_train)

# Calculate the predictions on the test set
pred_lr_ada = reg_ada.predict(X_test)

# Evaluate the performance using the RMSE
rmse_lr_ada = np.sqrt(mean_squared_error(y_test, pred_lr_ada))
print('RMSE AdaBoost with LR: {:.3f}'.format(rmse_lr_ada))

RMSE AdaBoost with LR: 4.855


In [47]:
# Build and fit an AdaBoost regressor
reg_ada = AdaBoostRegressor(base_estimator=dt, n_estimators=100,  learning_rate=0.1,random_state=500)
reg_ada.fit(X_train, y_train)

# Calculate the predictions on the test set
pred_dt_ada = reg_ada.predict(X_test)

# Evaluate the performance using the RMSE
rmse_dt_ada = np.sqrt(mean_squared_error(y_test, pred_dt_ada))
print('RMSE AdaBoost with DT : {:.3f}'.format(rmse_dt_ada))

RMSE AdaBoost with DT : 4.265


### Intro to gradient boosting machine

To understand the intuition behind Gradient Boosting Machine consider the following- 

- Suppose that you want to estimate an objective function, let's say y as a function of X. That means, 

                                        Objective: y = f(X)

- On the first iteration, our initial model is a weak estimator that is fit to the dataset. Let's call it f1(x). 

                          Initial model (weak estimator): y ∼ f1(X)

- Then, on each subsequent iteration, a new model is built and fitted to the residual error from the previous iteration. The error is calculated as y minus f1(x). That means, 

                        New model ,fits to residuals: y − f1(X) ∼ f2(X)

- After each individual estimator is built, the result is a new additive model, which is an improvement on the previous estimate. We repeat this process n times or until the error is small enough such that the difference in performance is negligible. 

- After the algorithm is finished, the result is a final improved additive model. 

                        Final additive model: y ∼ f1(X) + f2(X) + ... + fn(x)

This is a peculiarity of Gradient Boosting, as the individual estimators are not combined through voting or averaging, but by addition. This is because only the first model is fitted to the target variable, and the rest are estimates of the residual errors.


### Gradient boosting Regressor

To build a Gradient Boosting Regressor, we first import the class from the sklearn ensemble module. This will allow you to instantiate the Gradient Boosting Regressor. Unlike with other ensemble methods, here we don't specify the base_estimator, as Gradient Boosting is implemented and optimized with regression trees as the individual estimators.  

- The first parameter is n_estimators, it is 100 by default. 

- Then, we also specify the learning rate, It is 0.1 by default. 

- In addition, we have the tree-specific parameters: the maximum depth, which is 3 by default, the minimum number of samples required to split a node, the minimum number of samples required in a leaf node, and the maximum number of features. 

In Gradient Boosting, it is recommended to use all the features.

In [12]:
from sklearn.ensemble import GradientBoostingRegressor

In [48]:
# Build and fit a Gradient Boosting classifier
Reg_gbm = GradientBoostingRegressor(n_estimators = 100, learning_rate=0.1, random_state=500)
Reg_gbm.fit(X_train, y_train)

# Calculate the predictions on the test set
pred_gbm = Reg_gbm.predict(X_test)

# Evaluate the performance using the RMSE
rmse_gbm = np.sqrt(mean_squared_error(y_test, pred_gbm))
print('RMSE Gradient Boosting: {:.3f}'.format(rmse_gbm))

RMSE Gradient Boosting: 4.228


### Variations of gradient boosting

In this lesson, you'll learn about some variations, or flavors, of the gradient boosting family of algorithms, along with their implementations in Python.

#### Categorical boosting(CatBoost)

Categorical Boosting (or CatBoost) is the most recent Gradient Boosting flavor. It was open sourced by Yandex, a Russian tech company, in April 2017. 

CatBoost has built-in capacity to handle categorical features, so you don't need to do the preprocessing yourself. It is a fast implementation which can scale to large datasets and run on a GPU if required. CatBoost also provides a user friendly interface that integrates well with scikit-learn. 

To build a CatBoost estimator, we import catboost and give it the alias cb. This gives us access to CatBoostClassifier and CatBoostRegressor.

In [16]:
import catboost as cb

In [19]:
# Build and fit a CatBoost regressor
reg_cat = cb.CatBoostRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=500)
reg_cat.fit(X_train,y_train)

# Calculate the predictions on the set set
pred = reg_cat.predict(X_test)

# Evaluate the performance using the RMSE
rmse_cat = np.sqrt(mean_squared_error(y_test, pred))

print('RMSE (CatBoost): {:.3f}'.format(rmse_cat))

0:	learn: 7.1822880	total: 467us	remaining: 46.3ms
1:	learn: 6.7767775	total: 896us	remaining: 43.9ms
2:	learn: 6.4103424	total: 1.42ms	remaining: 46.1ms
3:	learn: 6.0857905	total: 1.87ms	remaining: 44.9ms
4:	learn: 5.7591769	total: 2.37ms	remaining: 45.1ms
5:	learn: 5.4745471	total: 2.82ms	remaining: 44.1ms
6:	learn: 5.2242875	total: 3.19ms	remaining: 42.3ms
7:	learn: 4.9832336	total: 3.56ms	remaining: 40.9ms
8:	learn: 4.7795984	total: 3.96ms	remaining: 40.1ms
9:	learn: 4.6128593	total: 4.32ms	remaining: 38.9ms
10:	learn: 4.4647771	total: 4.71ms	remaining: 38.1ms
11:	learn: 4.3544540	total: 5.09ms	remaining: 37.3ms
12:	learn: 4.2356742	total: 5.45ms	remaining: 36.5ms
13:	learn: 4.1338963	total: 5.82ms	remaining: 35.7ms
14:	learn: 4.0415002	total: 6.18ms	remaining: 35ms
15:	learn: 3.9727747	total: 6.53ms	remaining: 34.3ms
16:	learn: 3.9199435	total: 6.89ms	remaining: 33.6ms
17:	learn: 3.8522944	total: 7.25ms	remaining: 33ms
18:	learn: 3.7997483	total: 7.6ms	remaining: 32.4ms
19:	learn:

####  Extreme gradient boosting (XGBoost)

XGBoost, is a more advanced implementation of the Gradient Boosting algorithm, optimized for distributed computing for both training and prediction phases. 

While gradient boosting is a sequential ensemble, XGBoost uses parallel processing for training each estimator, thus speeding up the processing. It's described as a scalable, portable, and accurate solution that can work with huge datasets. 

To build a XGBoost model, we first import the library with the alias xgb. This allows us to call the classes XGBClassifier or XGBRegressor. The parameters are similar to the ones for Gradient Boosting.

In [20]:
import xgboost as xgb

In [51]:
# Build and fit an XGBoost regressor
reg_xgb = xgb.XGBRegressor(n_estimators=100, max_depth=3, learning_rate=0.1, objective='reg:squarederror', random_state=500)
reg_xgb.fit(X_train, y_train)

# Calculate the predictions and evaluate both regressors
pred_xgb = reg_xgb.predict(X_test)

# Evaluate the performance using the RMSE
rmse_xgb = np.sqrt(mean_squared_error(y_test, pred_xgb))

print('RMSE (XGBBoost): {:.3f}'.format(rmse_xgb))

RMSE (XGBBoost): 3.991


#### Light gradient boosting machine

Let's move on to Light Gradient Boosting, or LightGBM, which is a framework developed by Microsoft. Compared to XGBoost, LightGBM provides faster training and higher efficiency. It is also lighter in terms of space and memory usage. Being a distributed algorithm means it's optimized for parallel and GPU processing. LightGBM is useful when you are dealing with big datasets but have speed or memory constraints. 

In order to train a LightBoost ensemble model, you must import the lightgbm library and alias it as lgb, which stands for Light Gradient Boosting. Then, you can use the LGBMClassifier or LGBMRegressor depending on your problem. 

The parameters are similar to the ones for Gradient Boosting, except for max depth which is negative one by default, meaning no limit. Therefore, we must specify its value if a limit is desired. After training the model, you can use the fit and predict methods like with any scikit-learn estimator.

In [22]:
import lightgbm as lgb

In [52]:
# Build and fit a LightGBM regressor
reg_lgb = lgb.LGBMRegressor(n_estimators=100, max_depth=3, learning_rate=0.1, objective='mean_squared_error', seed=500)

reg_lgb.fit(X_train, y_train)

pred_lgb = reg_lgb.predict(X_test)

rmse_lgb = np.sqrt(mean_squared_error(y_test, pred_lgb))

print('RMSE (LGBBoost): {:.3f}'.format(rmse_lgb))

RMSE (LGBBoost): 4.250


In [54]:
print('RMSE Logistic Regression: {:.3f}'.format(rmse_lr))
print('RMSE AdaBoost with LR: {:.3f}'.format(rmse_lr_ada))
print('RMSE Decision Tree: {:.3f}'.format(rmse_dt))
print('RMSE AdaBoost with DT : {:.3f}'.format(rmse_dt_ada))
print('RMSE (LGBBoost): {:.3f}'.format(rmse_lgb))
print('RMSE (CatBoost): {:.3f}'.format(rmse_cat))
print('RMSE (XGBBoost): {:.3f}'.format(rmse_xgb))

RMSE Logistic Regression: 5.007
RMSE AdaBoost with LR: 4.855
RMSE Decision Tree: 4.365
RMSE AdaBoost with DT : 4.265
RMSE (LGBBoost): 4.250
RMSE (CatBoost): 4.147
RMSE (XGBBoost): 3.991
