## Predictive Modelling: Decision Tree Regression

In this section we will use Decision Tree Regression to predict the trip duration given the dataset prepared earlier. From the very beginning we will be using the truncated dataset in which the geographical distance between pickup and dropoff locations is smaller than 60 km, and the trip durations are limited to 6 hours.

In [1]:
import time
start_time = time.time()

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

In [3]:
#import the garbage collection module

import gc
gc.enable()

In [4]:
from sklearn.tree import DecisionTreeRegressor

from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score




In [5]:
#import the module that shows the memory usage

import os, psutil

def usage():
    process = psutil.Process(os.getpid())
    return process.memory_info()[0] / float(2 ** 20)
    
usage()

113.84375

In [6]:
df_modif = pd.read_csv('NYCTripDuration_modified.csv')

#df_modif.columns
df_modif = df_modif.drop(['Unnamed: 0','effective_speed(kmph)'], axis = 1)

usage()

269.234375

In [7]:
#shuffle the dataframe arbitrarily
df_modif.sample(frac=1, random_state = 20).reset_index(drop=True, inplace = True)

gc.collect()

7

In [8]:
# Create the function that calculates the accuracy of the predicted quantities.

def accur_func(y_pred, y):
    return 1.0 - (np.abs(y_pred - y)/np.abs(y))

In [9]:
# create the list containing the relevant columns
column_names = df_modif.drop('trip_duration(hrs)',axis =1).columns.values

In [10]:
df_modif = df_modif[(df_modif['geographical_dist(km)'] < 60) & \
                    (df_modif['trip_duration(hrs)'] < 6)]

X = df_modif.drop('trip_duration(hrs)', axis = 1).values

y = df_modif['trip_duration(hrs)'].values

In [11]:
gc.collect()
usage()

424.26171875

We will use the Decision Tree Regressor with default values of parameters. We see that the training score is 0.999, while the test score is 0.547. This confirms that Decision Tree regressor is prone to overfitting. The accuracy of predictions for mean and standard ceviation is very high, however, 
98.56 and 97.49 percent respectively.

In [12]:
%%time


dctreg = DecisionTreeRegressor(random_state = 50)

#X =  StandardScaler().fit_transform(X)
#y =  StandardScaler().fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 45, 
                                                    stratify = None)

dctreg.fit(X_train, y_train)


y_train_pred = dctreg.predict(X_train)
y_test_pred = dctreg.predict(X_test)

# compute the coefficients of determination
dctreg_training_score = dctreg.score(X_train, y_train)
dctreg_test_score = dctreg.score(X_test, y_test)

y_test_mean_dct =  y_test.mean()
y_test_pred_mean_dct =  y_test_pred.mean()
y_test_mean_accur_dct = accur_func(y_test_pred_mean_dct, y_test_mean_dct)

y_test_std_dct =  y_test.std()
y_test_pred_std_dct =  y_test_pred.std()
y_test_std_accur_dct = accur_func(y_test_pred_std_dct, y_test_std_dct)

print('Training score:', dctreg_training_score)
print('Test score:', dctreg_test_score)

print('\n')
print('Actual mean of the test set:', y_test_mean_dct)
print('Predicted mean of the test set:', y_test_pred_mean_dct)
print('Accuracy of prediction of the mean:', y_test_mean_accur_dct)

print('\n')
print('Actual std of the test set:', y_test_std_dct)
print('Predicted std of the test set:', y_test_pred_std_dct)
print('Accuracy of prediction of the std:', y_test_std_accur_dct)
print('\n')


Training score: 0.999999916641
Test score: 0.547308294622


Actual mean of the test set: 0.232793845736
Predicted mean of the test set: 0.236139810602
Accuracy of prediction of the mean: 0.985626918719


Actual std of the test set: 0.18405219248
Predicted std of the test set: 0.18867155115
Accuracy of prediction of the std: 0.974901909026


Wall time: 1min 11s


Examining the relative importances of features (their sum is equal to one in this method), wee see that the most important feature is 'geographical_dist(km)', followed by 'pickup_hour' and 'dropoff_latitude'. One should pay attention to the importance of 'pickup_hour' feature; this feature had relatively low importance in the linear methods. 

In [13]:
# examining the feature importances

df_dctreg_coeff = pd.DataFrame(column_names, columns = ['COLUMN_NAME'])
df_dctreg_coeff['DCTREG_FEATURES'] = np.transpose(dctreg.feature_importances_)

df_dctreg_coeff.sort_values('DCTREG_FEATURES', ascending = False, inplace = True)
df_dctreg_coeff.reset_index(drop = True, inplace = True)

writer3 = pd.ExcelWriter('dctreg_features.xlsx')
df_dctreg_coeff.to_excel(writer3)
writer3.save()

df_dctreg_coeff

Unnamed: 0,COLUMN_NAME,DCTREG_COEFF
0,geographical_dist(km),0.651356
1,pickup_hour,0.073308
2,dropoff_latitude,0.051921
3,pickup_longitude,0.046204
4,dropoff_longitude,0.04517
5,pickup_latitude,0.035911
6,pickup_day_of_week,0.028917
7,pickup_minute,0.022425
8,pickup_day,0.021194
9,pickup_month,0.014274


In [14]:
del(X_train, X_test, y_train, y_test)
gc.collect()

usage()

557.38671875

## Predictive Modelling: Random Forest Regression

In this section we will use Decision Tree Regression to predict the trip duration. We again, from the very beginning we will be using the truncated dataset in which the geographical distance between pickup and dropoff locations is smaller than 60 km, and the trip durations are limited to 6 hours.

In [15]:
from sklearn.ensemble import RandomForestRegressor

In [16]:
df_modif = df_modif[(df_modif['geographical_dist(km)'] < 60) & \
                    (df_modif['trip_duration(hrs)'] < 6)]

X = df_modif.drop('trip_duration(hrs)', axis = 1).values

y = df_modif['trip_duration(hrs)'].values

The Random Forest Regressor is run with default values of parameters. We see that the training score is 0.958, while the test score is 0.764, which is much better than the result obtained using Decision Tree regressor. The accuracy of predictions for mean and standard ceviation are also acceptable, 98.51 and 90.47 percent respectively. Note that we did not use the scaling of features for both Decision Tree and Random Forest methods, because the accuracy is almost not affected by such a scaling.

In [17]:
%%time


rfreg = RandomForestRegressor(random_state = 50)

#X =  StandardScaler().fit_transform(X)
#y =  StandardScaler().fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 45, 
                                                    stratify = None)

rfreg.fit(X_train, y_train)


y_train_pred = rfreg.predict(X_train)
y_test_pred = rfreg.predict(X_test)

# compute the coefficients of determination
rfreg_training_score = rfreg.score(X_train, y_train)
rfreg_test_score = rfreg.score(X_test, y_test)

y_test_mean_rf =  y_test.mean()
y_test_pred_mean_rf =  y_test_pred.mean()
y_test_mean_accur_rf = accur_func(y_test_pred_mean_rf, y_test_mean_rf)

y_test_std_rf =  y_test.std()
y_test_pred_std_rf =  y_test_pred.std()
y_test_std_accur_rf = accur_func(y_test_pred_std_rf, y_test_std_rf)

print('Training score:', rfreg_training_score)
print('Test score:', rfreg_test_score)


print('\n')
print('Actual mean of the test set:', y_test_mean_rf)
print('Predicted mean of the test set:', y_test_pred_mean_rf)
print('Accuracy of prediction of the mean:', y_test_mean_accur_rf)

print('\n')
print('Actual std of the test set:', y_test_std_rf)
print('Predicted std of the test set:', y_test_pred_std_rf)
print('Accuracy of prediction of the std:', y_test_std_accur_rf)
print('\n')

Training score: 0.958195278787
Test score: 0.764729328433


Actual mean of the test set: 0.232793845736
Predicted mean of the test set: 0.236244040919
Accuracy of prediction of the mean: 0.985179182154


Actual std of the test set: 0.18405219248
Predicted std of the test set: 0.166518983201
Accuracy of prediction of the std: 0.904737840703


Wall time: 8min 16s


The relative importance of features is the same as in the Decision Tree method, although numerical values for feature importances are slightly different.

In [18]:
# examining the coefficients

df_rfreg_coeff = pd.DataFrame(column_names, columns = ['COLUMN_NAME'])
df_rfreg_coeff['RFREG_FEATURES'] = np.transpose(rfreg.feature_importances_)

df_rfreg_coeff.sort_values('RFREG_FEATURES', ascending = False, inplace = True)
df_rfreg_coeff.reset_index(drop = True, inplace = True)

writer4 = pd.ExcelWriter('rfreg_features.xlsx')
df_rfreg_coeff.to_excel(writer4)
writer4.save()

df_rfreg_coeff

Unnamed: 0,COLUMN_NAME,RFREG_COEFF
0,geographical_dist(km),0.652705
1,pickup_hour,0.073319
2,dropoff_latitude,0.051691
3,dropoff_longitude,0.04645
4,pickup_longitude,0.045452
5,pickup_latitude,0.035374
6,pickup_day_of_week,0.029227
7,pickup_minute,0.021646
8,pickup_day,0.020097
9,pickup_month,0.014454


In [19]:
del(X_train, X_test, y_train, y_test)
gc.collect()

usage()

1308.12890625

The results are summarized in the table below. We see that applying Random Forest regression leads to relatively acceptable accuracy. One may think of applying PolynomialFeatures() preprocessing to introduce square interactions between the features and try to further increase the accuracy. However, this will require larger computational resources.

In [20]:
charac = ['Training score', 'Test score', 'Actual mean of the test set',
         'Predicted mean of the test set', 'Accuracy of prediction of the mean',
         'Actual std of the test set', 'Predicted std of the test set', 
          'Accuracy of prediction of the std']

values_dct = [dctreg_training_score, dctreg_test_score, y_test_mean_dct, y_test_pred_mean_dct,
             y_test_mean_accur_dct, y_test_std_dct,
             y_test_pred_std_dct, y_test_std_accur_dct ]

values_rf = [rfreg_training_score, rfreg_test_score, y_test_mean_rf, y_test_pred_mean_rf,
             y_test_mean_accur_rf, y_test_std_rf,
             y_test_pred_std_rf, y_test_std_accur_rf ]

comptable_tree = pd.DataFrame({'Characteristics': charac, 'Decision Tree': values_dct, 
                                'Random Forest': values_rf})


writer20 = pd.ExcelWriter('comptable_tree.xlsx')
comptable_tree.to_excel(writer20)
writer20.save()

comptable_tree.to_csv('comptable_tree.csv')

comptable_tree

Unnamed: 0,Characteristics,Decision Tree,Random Forest
0,Training score,1.0,0.958195
1,Test score,0.547308,0.764729
2,Actual mean of the test set,0.232794,0.232794
3,Predicted mean of the test set,0.23614,0.236244
4,Accuracy of prediction of the mean,0.985627,0.985179
5,Actual std of the test set,0.184052,0.184052
6,Predicted std of the test set,0.188672,0.166519
7,Accuracy of prediction of the std,0.974902,0.904738


In [21]:
print("--- Time to execute the program is {} minutes ---".format((time.time() - start_time)/60))

--- Time to execute the program is 10.178076072533925 minutes ---
