## Gradient Boosting Regression

Here, we will perform the same procedure done in the *DecisionTree.ipynb* file, just by changing the regression model to Gradient Boosting.


In [1]:
from utils import preparing_data, cross_validation,rmse # read data
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from math import sqrt # RMSE
from sklearn.metrics import mean_squared_error # error metric
from sklearn.model_selection import cross_val_score, cross_val_predict
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold # import KFold
import numpy as np
import pandas as pd
# model
from sklearn.ensemble import GradientBoostingRegressor


### Model with outliers

In [2]:
X,y=preparing_data(IQR=False)
X_scaled = preprocessing.scale(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

  


In [3]:
est = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=1, random_state=0, loss='ls')
est.fit(X_train, y_train)
predictions = est.predict(X_test)

In [4]:
print("RMSE train: ", rmse(est,y_train,X_train),", RMSE test: ", rmse(est,y_test,X_test))

RMSE train:  11610.313505791 , RMSE test:  11576.743562558253


In [5]:
list_rmse_train,list_rmse_test,trained_model = cross_validation(est,pd.DataFrame(X_scaled),y)

In [6]:
print("RMSE Test:", list_rmse_test, "\nRMSE Train: ", list_rmse_train)

RMSE Test: [13036.990968860116, 9569.979444607088, 17301.19364759242, 7207.086590090035, 11830.19563390642, 9139.341256195206, 14268.887498569173, 13588.609158878326, 16109.249035655615, 7825.3056131950425] 
RMSE Train:  [11466.093172242945, 11904.204534389226, 10817.463064153812, 11991.91712920512, 11678.007461435265, 11852.042404392245, 11298.58967603471, 11368.328211837355, 11094.382910955639, 11954.778792262581]


### Without outliers

In [19]:
X_cleaner,y_cleaner=preparing_data()
X_scaled_cleaner = preprocessing.scale(X_cleaner)
X_train_cleaner, X_test_cleaner, y_train_cleaner, y_test_cleaner = train_test_split(X_scaled_cleaner, y_cleaner, test_size=0.2, random_state=42)

  


**Here, we create the regression model:**

In [20]:
est_cleaner = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=1, random_state=0, loss='ls')
est_cleaner.fit(X_train_cleaner, y_train_cleaner)
predictions_cleaner = est_cleaner.predict(X_test_cleaner)

In [9]:
print("RMSE train: ", rmse(est_cleaner,y_train_cleaner,X_train_cleaner),", RMSE test: ", rmse(est_cleaner,y_test_cleaner,X_test_cleaner))

RMSE train:  7561.369734384837 , RMSE test:  7564.564281280412


**Cross-validation:**

In [21]:
list_rmse_train_cleaner,list_rmse_test_cleaner,trained_model = cross_validation(est_cleaner,pd.DataFrame(X_scaled_cleaner),y_cleaner)

In [11]:
pd.DataFrame(data={'train cv':list_rmse_train, 'test cv':list_rmse_test, 'train wo cv': list_rmse_train_cleaner, 'test wo cv': list_rmse_test_cleaner})

Unnamed: 0,test cv,test wo cv,train cv,train wo cv
0,13036.990969,8317.720316,11466.093172,7549.958743
1,9569.979445,8387.136547,11904.204534,7574.144157
2,17301.193648,10395.383778,10817.463064,7230.962284
3,7207.08659,8200.93859,11991.917129,7533.001856
4,11830.195634,8720.544354,11678.007461,7512.756387
5,9139.341256,7402.522271,11852.042404,7599.403913
6,14268.887499,8504.123869,11298.589676,7469.020184
7,13588.609159,8645.039479,11368.328212,7447.815507
8,16109.249036,9039.82366,11094.382911,7526.008896
9,7825.305613,7868.764639,11954.778792,7604.760005


In [28]:
df = pd.DataFrame(data={'pred':trained_model[5].predict(X_test_cleaner), 'target':y_test_cleaner})
df['error'] = abs(df['target']-df['pred'])
df.sort_values(by=['error'])
np.mean(df['error'])

6044.006439998721

In [29]:
np.std(df['error'])

4663.939028694304

In [26]:
df = pd.DataFrame(data={'pred':trained_model[5].predict(X_train_cleaner), 'target':y_train_cleaner})
df['error'] = abs(df['target']-df['pred'])
df.sort_values(by=['error'])
np.mean(df['error'])

6042.09293848066

In [27]:
np.std(df['error'])

4660.861434651604

**Standard deviation of error in training set after cross-validation**

In [12]:
np.std(list_rmse_train_cleaner)

103.25348068240042

**Standard deviation of error in test set after cross-validation**

In [13]:
np.std(list_rmse_test_cleaner)

752.9055435196367

**Mean of the error in the training set after cross-validation**

In [14]:
np.mean(list_rmse_train_cleaner)

7504.78319312853

**Mean of the error in the test set after cross-validation**

In [15]:
np.mean(list_rmse_test_cleaner)

8548.199750133077