## Decision Tree Regression

First, let's import all the modules needed to run the file.

In [1]:
from preparingData import preparing_data # read data
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from math import sqrt # RMSE
from sklearn.metrics import mean_squared_error # error metric
from sklearn.model_selection import cross_val_score, cross_val_predict
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold # import KFold
# model
from sklearn.tree import DecisionTreeRegressor


Now we use a script we created to pre-process data *(preparData)*, this step includes transforming non-numeric data and handling the outliers. In addition, we separated the test set (80%) and training (20%).

In [2]:
X,y=preparing_data()
X_scaled = preprocessing.scale(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

  


We use the function created below to calculate the mean square error.

In [3]:
def rmse(model, y, x):
    return sqrt(mean_squared_error(y, model.predict(x)))


**Here, we create the regression model:**

In [4]:
regressor = DecisionTreeRegressor(random_state=0)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)

Verifying the root mean square error in the training set:

In [127]:
rmse(regressor, y_train, X_train)

86.60302625104681

Verifying the root mean square error in the test set:

In [128]:
rmse(regressor, y_test, X_test)

3855.118118325681

As expected, in the test set the error is greater than in the training. Now we will perform the cross validation and compute the scores for 10 consecutive times.



**Cross-validation:**

In [129]:
kf = KFold(n_splits=10) # Define the split - into 5 folds 
kf.get_n_splits(X_scaled) # returns the number of splitting iterations in the cross-validator
KFold(n_splits=10, random_state=None, shuffle=False)
print(kf) 

KFold(n_splits=10, random_state=None, shuffle=False)


In [135]:
list_rmse_train = []
list_rmse_test = []
scores_train = []
regressor = DecisionTreeRegressor(random_state=0)
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    regressor.fit(X_train, y_train)
    list_rmse_train.append(rmse(regressor, y_train, X_train))
    list_rmse_test.append(rmse(regressor, y_test, X_test))

    

0.9999252317544923
0.9999049494295896
0.9999103495313576
0.9999451051628483
0.9999283853503074
0.9999021270943964
0.9999047106014244
0.9999039580869413
0.9999025207391459
0.9999030687492263


In [132]:
print(list_rmse_test)

[12841.046954241307, 8300.998118126668, 9070.19370539217, 8782.315009046111, 9776.709419040835, 9212.774486275368, 12448.17400704062, 8200.30156918181, 9797.408806699643, 11501.193226668924]


In [133]:
scores = cross_val_score(regressor, X_train, y_train, cv=10)
scores

array([-0.14677915, -0.06747099, -0.16831508,  0.13520964, -0.04985047,
       -0.18935497, -0.02631869, -1.40687043,  0.29789286, -0.10280012])