## Decision Tree Regression

First, let's import all the modules needed to run the file.

In [31]:
from utils import preparing_data,cross_validation,rmse # read data
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from math import sqrt # RMSE
from sklearn.metrics import mean_squared_error # error metric
from sklearn.model_selection import cross_val_score, cross_val_predict
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold # import KFold
import numpy as np
# model
from sklearn.tree import DecisionTreeRegressor
import pandas as pd

Now we use a script we created to pre-process data *(preparData)*, this step includes transforming non-numeric data and handling the outliers. In addition, we separated the test set (80%) and training (20%).

We use the function created below to calculate the mean square error.

#### Without remove outliers

In [2]:
X,y=preparing_data(IQR=False)
X_scaled = preprocessing.scale(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

  


In [3]:
regressor = DecisionTreeRegressor(random_state=0)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)

In [4]:
print("RMSE Train: ", rmse(regressor,y_train, X_train), "RMSE Test: ", rmse(regressor,y_test, X_test))

RMSE Train:  226.88013077758367 RMSE Test:  5737.4721369755225


**Here, we create the regression model without outliers in data:**

In [4]:
X_cleaner,y_cleaner=preparing_data()
X_scaled_cleaner = preprocessing.scale(X_cleaner)
X_train_cleaner, X_test_cleaner, y_train_cleaner, y_test_cleaner = train_test_split(X_scaled_cleaner, y_cleaner, test_size=0.2, random_state=42)

  


In [5]:
regressor_cleaner = DecisionTreeRegressor(random_state=0)
regressor_cleaner.fit(X_train_cleaner, y_train_cleaner)
predictions_cleaner = regressor_cleaner.predict(X_test_cleaner)

Let's get some sample to plot the graphs later.

In [7]:
print(y_test_cleaner[152094:152144].tolist())

[27900, 7494, 34850, 22974, 30349, 17497, 19500, 26967, 30987, 23257, 8488, 34996, 21495, 37987, 9896, 20811, 6788, 9998, 30487, 13750, 31999, 23799, 18900, 9747, 16988, 17986, 26991, 16966, 15988, 19989, 20995, 21488, 11942, 31804, 6900, 26730, 12049, 11599, 17728, 10250, 14950, 8295, 21492, 19988, 12995, 19000, 47375, 12999, 9989, 6900]


In [8]:
print(predictions[152094:152144].tolist())

[27990.0, 18629.0, 13907.0, 20298.0, 4495.0, 39988.0, 8990.0, 15933.0, 6999.0, 12995.0, 15396.0, 14995.0, 57322.0, 21000.0, 15950.0, 13995.0, 18929.0, 6525.0, 23900.0, 13987.0, 44990.0, 26999.0, 16495.0, 10312.0, 16988.0, 4995.0, 18617.0, 42890.0, 12900.0, 6250.0, 12598.0, 14995.0, 14500.0, 57844.0, 5495.0, 10988.0, 16495.0, 16999.0, 4495.0, 16733.0, 23999.0, 27989.0, 23394.0, 15195.0, 10590.0, 13995.0, 10749.0, 38999.0, 14884.0, 26500.0]


Verifying the root mean square error in the training set:

In [9]:
print("RMSE Train: ", rmse(regressor_cleaner, y_train_cleaner, X_train_cleaner), "RMSE Test: ", rmse(regressor_cleaner, y_test_cleaner, X_test_cleaner))

RMSE Train:  86.60302625104681 RMSE Test:  3855.118118325681


As expected, in the test set the error is greater than in the training. Now we will perform the cross validation and compute the scores for 10 consecutive times.



**Cross-validation:**

Let's use the k-folder method to cross-validate, so we choose k = 10.


In [6]:
list_rmse_train_cleaner,list_rmse_test_cleaner,trained_model_cleaner=cross_validation(regressor_cleaner,pd.DataFrame(X_scaled_cleaner),y_cleaner)

In [7]:
list_rmse_train,list_rmse_test,trained_model=cross_validation(regressor,pd.DataFrame(X_scaled),y)

In [8]:
pd.DataFrame(data={'train cv':list_rmse_train, 'test cv':list_rmse_test, 'train wo cv': list_rmse_train_cleaner, 'test wo cv': list_rmse_test_cleaner})

Unnamed: 0,test cv,test wo cv,train cv,train wo cv
0,58297.193427,12843.41296,213.122189,77.678136
1,14601.43218,8282.601874,233.547461,90.458496
2,16016.095211,9069.81379,204.475986,85.8151
3,11466.014484,8782.436797,214.483023,68.010685
4,17576.20553,9676.502805,213.914336,76.866666
5,27759.660236,9160.585906,233.892278,91.668077
6,18373.183273,12448.039617,221.00998,89.043968
7,16764.883427,8205.878149,231.070519,89.962801
8,18807.985569,9797.068042,186.881443,90.461965
9,15363.799339,11501.170035,233.623218,90.685518


In [22]:
pred = trained_model[7].predict(X_test_cleaner)
pred.sort()
pred[0:500]

93

In [43]:
df = pd.DataFrame(data={'pred':trained_model_cleaner[7].predict(X_test_cleaner), 'target':y_test_cleaner})
df['error'] = abs(df['target']-df['pred'])
df.sort_values(by=['error'])
np.mean(df['error'])

861.3870095220325

In [44]:
np.std(df['error'])

3513.2817756483737

In [45]:
df = pd.DataFrame(data={'pred':trained_model_cleaner[7].predict(X_train_cleaner), 'target':y_train_cleaner})
df['error'] = abs(df['target']-df['pred'])
df.sort_values(by=['error'])
np.mean(df['error'])

872.457842070373

In [46]:
np.std(df['error'])

3537.1365317932905

In [37]:
rmse(trained_model_cleaner[7],y_test_cleaner, X_test_cleaner)

3617.33830533948

**Standard deviation of error in training set after cross-validation**

In [13]:
np.std(list_rmse_train_cleaner)

7.657801576247329

**Standard deviation of error in test set after cross-validation**

In [14]:
np.std(list_rmse_test_cleaner)

1602.887564760597

**Mean of the error in the training set after cross-validation**

In [15]:
np.mean(list_rmse_train_cleaner)

85.06514123212096

**Mean of the error in the test set after cross-validation**

In [16]:
np.mean(list_rmse_test_cleaner)

9976.750997537993