## Linear Regression

Here, we will perform the same procedure done in the *DecisionTree.ipynb* file, just by changing the regression model to Gradient Boosting.


In [1]:
from utils import preparing_data,cross_validation,rmse # read data
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from math import sqrt # RMSE
from sklearn.metrics import mean_squared_error # error metric
from sklearn.model_selection import cross_val_score, cross_val_predict
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold # import KFold
import numpy as np
import pandas as pd
# model
from sklearn import linear_model


In [2]:
X_cleaner,y_cleaner=preparing_data()
X_scaled_cleaner = preprocessing.scale(X_cleaner)
X_train_cleaner, X_test_cleaner, y_train_cleaner, y_test_cleaner = train_test_split(X_scaled_cleaner, y_cleaner, test_size=0.2, random_state=42)

  


In [3]:
reg_cleaner = linear_model.Ridge(alpha=.5)
reg_cleaner.fit(X_train_cleaner, y_train_cleaner)
predictions_cleaner = reg_cleaner.predict(X_test_cleaner)

In [20]:
print("RMSE train: ", rmse(reg_cleaner,y_train_cleaner,X_train_cleaner),", RMSE test: ", rmse(reg_cleaner,y_test_cleaner,X_test_cleaner))

RMSE train:  7952.906389540915 , RMSE test:  7954.592151225256


**Here, we create the regression model:**

In [4]:
X,y=preparing_data(IQR=False)
X_scaled = preprocessing.scale(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

  


In [5]:
reg = linear_model.Ridge(alpha=.5)
reg.fit(X_train, y_train)
predictions = reg.predict(X_test)

In [12]:
pd.DataFrame(data={'target':y_test[152094:152144].tolist(), 'prediction':predictions[152094:152144].tolist()}).head()

Unnamed: 0,prediction,target
0,24923.415562,27900
1,14378.554358,7494
2,27112.678984,34850
3,24121.276254,22974
4,24575.656316,30349


In [10]:
print("RMSE Train: ", rmse(reg, y_train, X_train), "RMSE Test: ", rmse(reg, y_test, X_test))

RMSE Train:  7952.906389540915 RMSE Test:  7954.592151225256


**Cross-validation:**

In [7]:
list_rmse_train,list_rmse_test,scores_train = cross_validation(reg,pd.DataFrame(X_scaled),y)

In [6]:
list_rmse_train_scaled,list_rmse_test_scaled,trained_model_cleaner = cross_validation(reg,pd.DataFrame(X_scaled_cleaner),y_cleaner)

In [8]:
pd.DataFrame(data={'train cv':list_rmse_train, 'test cv':list_rmse_test, 'train wo cv': list_rmse_train_scaled, 'test wo cv': list_rmse_test_scaled})

Unnamed: 0,test cv,test wo cv,train cv,train wo cv
0,13482.908969,8650.055538,11991.676329,7912.252281
1,8567.673295,7676.567978,12473.335229,8002.679936
2,17181.781449,9411.029793,11428.928385,7797.589698
3,6941.695667,7394.696477,12562.504687,8018.134389
4,11540.76266,8482.10075,12206.330273,7908.291761
5,9258.664513,7385.995027,12403.458008,8020.066567
6,14305.03217,8308.02008,11851.148527,7913.633081
7,13424.146316,8383.597261,11969.257071,7916.426606
8,16194.683135,8659.332165,11614.147349,7896.787683
9,7911.438125,7636.362799,12502.086979,8009.791952


**Standard deviation of error in training set after cross-validation**

In [9]:
rmse(trained_model_cleaner[5],y_test_cleaner,X_test_cleaner)

8182.063412675323

In [10]:
df = pd.DataFrame(data={'pred':trained_model_cleaner[5].predict(X_test_cleaner), 'target':y_test_cleaner})
df['error'] = abs(df['target']-df['pred'])
np.mean(df['error'])

6682.386580892794

In [11]:
np.std(df['error'])

4721.426825922865

In [12]:
df = pd.DataFrame(data={'pred':trained_model_cleaner[5].predict(X_train_cleaner), 'target':y_train_cleaner})
df['error'] = abs(df['target']-df['pred'])
df.sort_values(by=['error'])
np.mean(df['error'])

6683.092800371078

In [13]:
np.std(df['error'])

4711.908422294835

In [13]:
np.std(list_rmse_train)

68.18189367605405

**Standard deviation of error in test set after cross-validation**

In [14]:
np.std(list_rmse_test)

625.1284844207817

**Mean of the error in the training set after cross-validation**

In [15]:
np.mean(list_rmse_train)

7939.565395575459

**Mean of the error in the test set after cross-validation**

In [16]:
np.mean(list_rmse_test)

8198.77606055087