## Decision Tree Regression

First, let's import all the modules needed to run the file.

In [123]:
from preparingData import preparing_data # read data
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from math import sqrt # RMSE
from sklearn.metrics import mean_squared_error # error metric
from sklearn.model_selection import cross_val_score, cross_val_predict
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold # import KFold
# model
from sklearn.tree import DecisionTreeRegressor


Now we use a script we created to pre-process data *(preparData)*, this step includes transforming non-numeric data and handling the outliers. In addition, we separated the test set (80%) and training (20%).

In [124]:
X,y=preparing_data()
X_scaled = preprocessing.scale(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

  


We use the function created below to calculate the mean square error.

In [125]:
def rmse(model, y, x):
    return sqrt(mean_squared_error(y, model.predict(x)))


**Here, we create the regression model:**

In [126]:
regressor = DecisionTreeRegressor(random_state=0)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)

Verifying the root mean square error in the training set:

In [127]:
rmse(regressor, y_train, X_train)

86.60302625104681

Verifying the root mean square error in the test set:

In [128]:
rmse(regressor, y_test, X_test)

3855.118118325681

As expected, in the test set the error is greater than in the training. Now we will perform the cross validation and compute the scores for 10 consecutive times.



**Cross-validation:**

In [129]:
kf = KFold(n_splits=10) # Define the split - into 5 folds 
kf.get_n_splits(X_scaled) # returns the number of splitting iterations in the cross-validator
KFold(n_splits=10, random_state=None, shuffle=False)
print(kf) 

KFold(n_splits=10, random_state=None, shuffle=False)


In [None]:
list_rmse_train = []
list_rmse_test = []
regressor = DecisionTreeRegressor(random_state=0)
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    regressor.fit(X_train, y_train)
    list_rmse_train.append(rmse(regressor, y_train, X_train))
    list_rmse_test.append(rmse(regressor, y_test, X_test))


In [122]:
print(list_rmse_train)

[77.67813575158814, 90.4584955714768, 85.81509990711267, 68.01068546766068, 76.86666634106744, 91.6680768550825, 89.04396809364674, 89.96280088167087, 90.46196525692962, 90.6855181949743]


In [120]:
print(y)

0          8995
1         10888
2          8995
3         10999
4         14799
5          7989
6         14490
7         13995
8         10495
9          9995
10        12921
11        12000
12         7750
13        17628
14        13999
15        14995
16        14990
17        14590
18         9500
19         7990
20        16994
21        15499
22        13499
23        13999
24        14999
25        15995
26        14500
27        13995
28        16000
29        17419
          ...  
851994    42994
851995    41559
851997    43990
851998    31720
852002    42540
852005    47215
852010    45980
852011    33900
852014    45885
852015    42988
852016    41795
852017    33995
852019    44991
852021    44991
852022    46935
852023    32995
852024    44670
852028    25900
852029    33995
852030    26986
852038    46885
852043    42995
852048    38888
852051    35995
852052    35900
852076    37999
852089    44565
852108    45280
852111    46500
852112    46530
Name: Price, Length: 770

In [121]:
scores = cross_val_score(regressor, X_train, y_train, cv=10)
scores

array([-0.14677915, -0.06747099, -0.16831508,  0.13520964, -0.04985047,
       -0.18935497, -0.02631869, -1.40687043,  0.29789286, -0.10280012])