# Machine Learning with AdaBoost

Now I will train a regression model using AdaBoost.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error

In [2]:
# Bring in the data
%store -r X
%store -r y

In [3]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 1)

### AdaBoost Regressor Using Decision Trees

In [4]:
# Define a AdaBoost regressor function that can take different 
# n_estimators and learning_rate and output R squared and RMSE.
def ab_R2_RMSE(n_estimators, learning_rate):
    data = []
    for n in n_estimators:
        for l in learning_rate:
            model = AdaBoostRegressor(n_estimators=n, learning_rate=l)
            model.fit(X_train, y_train)
    
            model_r2_train = model.score(X_train, y_train)
            model_r2_test = model.score(X_test, y_test)
    
            model_y_pred_train = model.predict(X_train)
            model_y_pred_test = model.predict(X_test)
            model_RMSE_train = np.sqrt(mean_squared_error(y_train, model_y_pred_train))
            model_RMSE_test = np.sqrt(mean_squared_error(y_test, model_y_pred_test))
    
            data.append([n, l, model_r2_train, model_r2_test, model_RMSE_train, model_RMSE_test])
            
    table = pd.DataFrame(data, columns = ['Estimators', 'Learning Rate', 'Training R2', 'Test R2', 'Training RMSE', 'Test RMSE'])
    
    return table

In [5]:
# Try different models to find the ideal number of estimators.
n_estimators = [50, 100, 150, 200, 250, 300]
learning_rate = [1.0]

ab_R2_RMSE(n_estimators, learning_rate)

Unnamed: 0,Estimators,Learning Rate,Training R2,Test R2,Training RMSE,Test RMSE
0,50,1.0,0.830095,0.789454,5056.330881,5321.811441
1,100,1.0,0.833214,0.806653,5009.707514,5099.824981
2,150,1.0,0.838206,0.801861,4934.16983,5162.62986
3,200,1.0,0.836824,0.804277,4955.1942,5131.054698
4,250,1.0,0.836021,0.795072,4967.37575,5250.336279
5,300,1.0,0.834825,0.802342,4985.447358,5156.362737


250 estimators is the ideal number.

In [7]:
# Try different models to find the ideal learning rate.
n_estimators = [250]
learning_rate = [0.0001, 0.001, 0.01, 0.1, 1.0, 10]

ab_results = ab_R2_RMSE(n_estimators, learning_rate)
ab_results

Unnamed: 0,Estimators,Learning Rate,Training R2,Test R2,Training RMSE,Test RMSE
0,250,0.0001,0.863211,0.840023,4536.886634,4638.898958
1,250,0.001,0.863074,0.839777,4539.164894,4642.46537
2,250,0.01,0.854671,0.826572,4676.372037,4829.988263
3,250,0.1,0.83675,0.802858,4956.319767,5149.622936
4,250,1.0,0.838946,0.799202,4922.875307,5197.159404
5,250,10.0,0.019038,-0.185856,12149.509375,12629.961178


The ideal learning rate is 0.001.

So the ideal AdaBoosted model has 250 estimators and a learning rate of 0.001.

This model does not perform better than the previous best model, random forest with depth 3 and estimators 400.

In [8]:
# Save the table for final metrics file
%store ab_results

Stored 'ab_results' (DataFrame)
