## Predictive Modelling: Elastic Net Regression

In this section we will use Elastic Net Regression to predict the trip duration given the dataset prepared earlier. We use this type of regerssion because of its ability to combine Lasso and Ridge regressions.

In [14]:
import time
start_time = time.time()

In [15]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

In [16]:
#import the garbage collection module

import gc
gc.enable()

In [17]:
from sklearn.linear_model import ElasticNet

from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import PolynomialFeatures


In [18]:
#import the module that shows the memory usage

import os, psutil

def usage():
    process = psutil.Process(os.getpid())
    return process.memory_info()[0] / float(2 ** 20)
    
usage()

610.96875

In [19]:
df_modif = pd.read_csv('NYCTripDuration_modified.csv')
usage()

621.140625

After creating the dataframe from the 'NYCTripDuration_modified.csv' file prepared earlier, we drop the 'Unnamed: 0' and 'effective_speed(kmph)' columns, and shuffle the dataframe arbitrarily.

In [20]:
#df_modif.columns
df_modif = df_modif.drop(['Unnamed: 0','effective_speed(kmph)'], axis = 1)

#df_modif.head()
usage()

599.01171875

In [21]:
#shuffle the dataframe arbitrarily
df_modif.sample(frac=1, random_state = 20).reset_index(drop=True, inplace = True)

gc.collect()

154

We then write the function that calculates the accuracy of the predicted quantities. This function will be used to compute the accuracies of the predicted mean and standard deviation of the continuous target vafriable.

In [22]:
def accur_func(y_pred, y):
    return 1.0 - (np.abs(y_pred - y)/np.abs(y))

After that, we split the dataset into arrays of feature and target variables, and create the list of column names. There are 1450573 instances at this point.

In [23]:
X = df_modif.drop('trip_duration(hrs)', axis = 1).values

y = df_modif['trip_duration(hrs)'].values

# create the list containing the relevant columns
column_names = df_modif.drop('trip_duration(hrs)',axis =1).columns.values

n_instances = len(y)
print(n_instances)

1450573


After running ElasticNet Regression on the original dataset (with the values alpha = 0.01 and l1_ratio = 0.5), we obtain miserable results. Not only the coefficient of determination is low (0.527), the accuracy of prediction of the standard deviation for the target variable is also low (72.03 percent) (although mean was predicted well enough). This is due to the fact that this method is very susceptible to outliers. Indeed, the target variable 'trip_duration(hrs)', as well as on of the feature variables 'geographical_dist(km)' are very broadly distributed.  

In [24]:
%%time

elnetreg = ElasticNet(alpha=0.01, l1_ratio=0.5, random_state=50, selection='random')

#X =  StandardScaler().fit_transform(X)
#y =  StandardScaler().fit_transform(y)

#adding polynomial features
#X = PolynomialFeatures().fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 45, 
                                                    stratify = None)

elnetreg.fit(X_train, y_train)


y_train_pred = elnetreg.predict(X_train)
y_test_pred = elnetreg.predict(X_test)

elnetreg_training_score = elnetreg.score(X_train, y_train)
elnetreg_test_score = elnetreg.score(X_test, y_test)

y_test_mean_elnet =  y_test.mean()
y_test_pred_mean_elnet =  y_test_pred.mean()
y_test_mean_accur_elnet = accur_func(y_test_pred_mean_elnet, y_test_mean_elnet)

y_test_std_elnet =  y_test.std()
y_test_pred_std_elnet =  y_test_pred.std()
y_test_std_accur_elnet = accur_func(y_test_pred_std_elnet, y_test_std_elnet)

print('Training score:', elnetreg_training_score)
print('Test score:', elnetreg_test_score)

print('\n')
print('Actual mean of the test set:', y_test_mean_elnet)
print('Predicted mean of the test set:', y_test_pred_mean_elnet)
print('Accuracy of prediction of the mean:', y_test_mean_accur_elnet)

print('\n')
print('Actual std of the test set:', y_test_std_elnet)
print('Predicted std of the test set:', y_test_pred_std_elnet)
print('Accuracy of prediction of the std:', y_test_std_accur_elnet)
print('\n')

Training score: 0.482870812746
Test score: 0.527476050547


Actual mean of the test set: 0.233571084792
Predicted mean of the test set: 0.233850228374
Accuracy of prediction of the mean: 0.998804888105


Actual std of the test set: 0.195445086378
Predicted std of the test set: 0.140797277942
Accuracy of prediction of the std: 0.720393029833


Wall time: 4.38 s


In [25]:
del(X, y, X_train, X_test, y_train, y_test)
gc.collect()

usage()

300.31640625

We can try to improve the accuracy of predictions by truncating our dataset. Namely, let's remove the instances where 'geographical_dist(km)' is greater than 60 km and those that last more than 6 hours. 

Once the dataset is truncated, we run Elastic Net once again using 3-fold cross-validation to determine the optimal values of parameters alpha (possible values are 0.0001, 0.001, 0.01 and 0.1) and l1_ratio (possible values are 0.1, 0.3 and 0.8). We see that the accuracy of the model has increased considerably; the test score is 0.597, while the mean and std of the target variable are predicted with accuracy 99.93 and 77.31 percent respectively.  

In [26]:
df_modif = df_modif[(df_modif['geographical_dist(km)'] < 60) & \
                    (df_modif['trip_duration(hrs)'] < 6)]

X = df_modif.drop('trip_duration(hrs)', axis = 1).values

y = df_modif['trip_duration(hrs)'].values

In [27]:
%%time

elnetreg = ElasticNet(random_state=50, selection='random')

# use the StandardScaler
X =  StandardScaler().fit_transform(X)
scaler = StandardScaler().fit(y)
y =  scaler.transform(y)

#adding polynomial features
#X = PolynomialFeatures().fit_transform(X)

# parameters fitted during cross-validation
params = {"alpha": [0.0001, 0.001, 0.01, 0.1], "l1_ratio": [0.1, 0.3, 0.8]}

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 45, 
                                                    stratify = None)

cv_elnetreg = GridSearchCV(elnetreg, param_grid = params, cv = 3)
cv_elnetreg.fit(X_train, y_train)

print('The value of best parameters are:',cv_elnetreg.best_params_)


y_train_pred = cv_elnetreg.predict(X_train)
y_test_pred = cv_elnetreg.predict(X_test)

# compute the coefficients of determination
elnetreg_training_score = cv_elnetreg.score(X_train, y_train)
elnetreg_test_score = cv_elnetreg.score(X_test, y_test)

#scale back the target variable
y_test = scaler.inverse_transform(y_test)
y_test_pred = scaler.inverse_transform(y_test_pred)

y_test_mean_elnet =  y_test.mean()
y_test_pred_mean_elnet =  y_test_pred.mean()
y_test_mean_accur_elnet = accur_func(y_test_pred_mean_elnet, y_test_mean_elnet)

y_test_std_elnet =  y_test.std()
y_test_pred_std_elnet =  y_test_pred.std()
y_test_std_accur_elnet = accur_func(y_test_pred_std_elnet, y_test_std_elnet)

print('Training score:', elnetreg_training_score)
print('Test score:', elnetreg_test_score)

print('\n')
print('Actual mean of the test set:', y_test_mean_elnet)
print('Predicted mean of the test set:', y_test_pred_mean_elnet)
print('Accuracy of prediction of the mean:', y_test_mean_accur_elnet)

print('\n')
print('Actual std of the test set:', y_test_std_elnet)
print('Predicted std of the test set:', y_test_pred_std_elnet)
print('Accuracy of prediction of the std:', y_test_std_accur_elnet)
print('\n')



The value of best parameters are: {'alpha': 0.0001, 'l1_ratio': 0.1}
Training score: 0.598602025012
Test score: 0.596641050462


Actual mean of the test set: 0.232793845736
Predicted mean of the test set: 0.232950521004
Accuracy of prediction of the mean: 0.999326978479


Actual std of the test set: 0.18405219248
Predicted std of the test set: 0.142300756827
Accuracy of prediction of the std: 0.773154369475


Wall time: 2min 24s


Let's examine the weights after running the regression. The most important feature (highest weight in absolute value) is 'geographical_dist(km)', as expected. The second and third most important features are 'dropoff_latitude' and 'dropoff_longitude' respectively.

In [28]:
# examining the coefficients

elnetreg_coeff = ElasticNet(alpha = cv_elnetreg.best_params_['alpha'],
                            l1_ratio = cv_elnetreg.best_params_['l1_ratio'],
    random_state=50, selection='random').fit(X_train, y_train).coef_


df_elnetreg_coeff = pd.DataFrame(column_names, columns = ['COLUMN_NAME'])
df_elnetreg_coeff['ELNETREG_COEFF'] = np.transpose(elnetreg_coeff)

df_elnetreg_coeff.sort_values('ELNETREG_COEFF', ascending = False, inplace = True)
df_elnetreg_coeff.reset_index(drop = True, inplace = True)

writer1 = pd.ExcelWriter('elnetreg_features.xlsx')
df_elnetreg_coeff.to_excel(writer1)
writer1.save()

df_elnetreg_coeff

Unnamed: 0,COLUMN_NAME,ELNETREG_COEFF
0,geographical_dist(km),0.803694
1,pickup_month,0.046158
2,pickup_latitude,0.043823
3,pickup_hour,0.039277
4,pickup_day,0.007016
5,passenger_count,0.006895
6,store_and_fwd_flag,0.005766
7,vendor_id,-0.000806
8,pickup_minute,-0.004201
9,pickup_longitude,-0.031822


In [29]:
del(X, y, X_train, X_test, y_train, y_test)
gc.collect()
usage()

302.265625

Trying to improve the accuracy, let's drop the least important features: 'pickup_day', 'passenger_count', 'store_and_fwd_flag' 	'vendor_id' and 'pickup_minute', and use the PolynomialFeatures() to generate the squared number of the remaining features. We use the parameters alpha = 0.0001 and l1_ratio = 0.1. We see that the accuracy has increased although not as much as we hoped; the test score is 0.649, while the mean and std of the target variable are predicted with accuracy 99.91 and 80.72 percent respectively. This level of accuracy can hardly be regarded acceptable. It is possible that the accuracy can be improved further by generating cubic interactions between the features. However, this approach would require large computational resources, and we think it is better to try other methods. 

In [30]:
df_modif = df_modif.drop(['pickup_day', 'passenger_count', 'store_and_fwd_flag',
                          'vendor_id','pickup_minute'], axis =1)


X = df_modif.drop('trip_duration(hrs)', axis = 1).values

y = df_modif['trip_duration(hrs)'].values

In [31]:
%%time

elnetreg = ElasticNet(alpha=0.0001, l1_ratio=0.1, random_state=50, selection='random')

#adding polynomial features
X = PolynomialFeatures().fit_transform(X)

#doing the scaling
X =  StandardScaler().fit_transform(X)
scaler = StandardScaler().fit(y)
y =  scaler.transform(y)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 45, 
                                                    stratify = None)

elnetreg.fit(X_train, y_train)


y_train_pred = elnetreg.predict(X_train)
y_test_pred = elnetreg.predict(X_test)

elnetreg_training_score = elnetreg.score(X_train, y_train)
elnetreg_test_score = elnetreg.score(X_test, y_test)

#scale back the target variable
y_test = scaler.inverse_transform(y_test)
y_test_pred = scaler.inverse_transform(y_test_pred)

y_test_mean_elnet =  y_test.mean()
y_test_pred_mean_elnet =  y_test_pred.mean()
y_test_mean_accur_elnet = accur_func(y_test_pred_mean_elnet, y_test_mean_elnet)

y_test_std_elnet =  y_test.std()
y_test_pred_std_elnet =  y_test_pred.std()
y_test_std_accur_elnet = accur_func(y_test_pred_std_elnet, y_test_std_elnet)

print('Training score:', elnetreg_training_score)
print('Test score:', elnetreg_test_score)

print('\n')
print('Actual mean of the test set:', y_test_mean_elnet)
print('Predicted mean of the test set:', y_test_pred_mean_elnet)
print('Accuracy of prediction of the mean:', y_test_mean_accur_elnet)

print('\n')
print('Actual std of the test set:', y_test_std_elnet)
print('Predicted std of the test set:', y_test_pred_std_elnet)
print('Accuracy of prediction of the std:', y_test_std_accur_elnet)
print('\n')



Training score: 0.651995467025
Test score: 0.64910691301


Actual mean of the test set: 0.232793845736
Predicted mean of the test set: 0.232984344321
Accuracy of prediction of the mean: 0.999181685477


Actual std of the test set: 0.18405219248
Predicted std of the test set: 0.148579306352
Accuracy of prediction of the std: 0.807267244959


Wall time: 10min 40s


In [32]:
del(X, y, X_train, X_test, y_train, y_test)
gc.collect()
usage()

125.0546875

## Predictive Modelling: Stochastic Gradient Decsent

In this section we will use Stochastic Gradient Descent Regression to make the prediction of the trip duration. 

In [33]:
from sklearn.linear_model import SGDRegressor

In [34]:
df_modif = pd.read_csv('NYCTripDuration_modified.csv')

df_modif = df_modif.drop(['Unnamed: 0','effective_speed(kmph)'], axis = 1)

df_modif.sample(frac=1, random_state = 20).reset_index(drop=True, inplace = True)

gc.collect()
usage()

280.3671875

In [35]:
df_modif = df_modif[(df_modif['geographical_dist(km)'] < 60) & \
                    (df_modif['trip_duration(hrs)'] < 6)]

In [36]:
X = df_modif.drop('trip_duration(hrs)', axis = 1).values

y = df_modif['trip_duration(hrs)'].values

We run the Stochastic Gradient Descent regression with the optimal (not constant) learning rate, and use 3-fold cross-validation to determine the best values of parameters alpha (possible values are 0.0001, 0.001, 0.01, 0.1 and 1.0) and l1_ratio (possible values are 0.1, 0.3 and 0.8). We see that the accuracy of the model is not very high; the test score is 0.596, while the mean and std of the target variable are predicted with accuracy 99.82 and 75.47 percent respectively.  

In [37]:
%%time

sgdreg = SGDRegressor(learning_rate = "optimal", random_state=50, 
                      eta0 = 0.01, penalty = "elasticnet")


# use the StandardScaler
X =  StandardScaler().fit_transform(X)
scaler = StandardScaler().fit(y)
y =  scaler.transform(y)

#adding polynomial features
#X = PolynomialFeatures(2).fit_transform(X)

# parameters fitted during cross-validation
params = {"alpha": [0.0001, 0.001, 0.01, 0.1, 1.0], "l1_ratio": [0.1, 0.3, 0.8]}

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 45, 
                                                    stratify = None)

cv_sgdreg = GridSearchCV(sgdreg, param_grid = params, cv = 3)
cv_sgdreg.fit(X_train, y_train)

print('The value of best parameters are:',cv_sgdreg.best_params_)

y_train_pred = cv_sgdreg.predict(X_train)
y_test_pred = cv_sgdreg.predict(X_test)

# compute the coefficients of determination
sgdreg_training_score = cv_sgdreg.score(X_train, y_train)
sgdreg_test_score = cv_sgdreg.score(X_test, y_test)

#scale back the target variable
y_test = scaler.inverse_transform(y_test)
y_test_pred = scaler.inverse_transform(y_test_pred)

y_test_mean_sgd =  y_test.mean()
y_test_pred_mean_sgd =  y_test_pred.mean()
y_test_mean_accur_sgd = accur_func(y_test_pred_mean_sgd, y_test_mean_sgd)

y_test_std_sgd =  y_test.std()
y_test_pred_std_sgd =  y_test_pred.std()
y_test_std_accur_sgd = accur_func(y_test_pred_std_sgd, y_test_std_sgd)

print('Training score:', sgdreg_training_score)
print('Test score:', sgdreg_test_score)

print('\n')
print('Actual mean of the test set:', y_test_mean_sgd)
print('Predicted mean of the test set:', y_test_pred_mean_sgd)
print('Accuracy of prediction of the mean:', y_test_mean_accur_sgd)

print('\n')
print('Actual std of the test set:', y_test_std_sgd)
print('Predicted std of the test set:', y_test_pred_std_sgd)
print('Accuracy of prediction of the std:', y_test_std_accur_sgd)
print('\n')





The value of best parameters are: {'alpha': 0.01, 'l1_ratio': 0.1}
Training score: 0.598166445123
Test score: 0.596248681381


Actual mean of the test set: 0.232793845736
Predicted mean of the test set: 0.232380234738
Accuracy of prediction of the mean: 0.998223273488


Actual std of the test set: 0.18405219248
Predicted std of the test set: 0.138914834759
Accuracy of prediction of the std: 0.754757837369


Wall time: 3min 12s


As in the case of linear regression, most important feature is 'geographical_dist(km)', as expected. The second and third most important features are 'dropoff_latitude' and 'dropoff_longitude' respectively.

In [38]:
# examining the coefficients

sgdreg_coeff = SGDRegressor(alpha = cv_sgdreg.best_params_['alpha'],
                            l1_ratio = cv_sgdreg.best_params_['l1_ratio'],
                     learning_rate = "optimal", random_state=50, 
                      eta0 = 0.1, penalty = "elasticnet").fit(X_train, y_train).coef_


df_sgdreg_coeff = pd.DataFrame(column_names, columns = ['COLUMN_NAME'])
df_sgdreg_coeff['SGDREG_COEFF'] = np.transpose(sgdreg_coeff)

df_sgdreg_coeff.sort_values('SGDREG_COEFF', ascending = False, inplace = True)
df_sgdreg_coeff.reset_index(drop = True, inplace = True)

writer14 = pd.ExcelWriter('sgdreg_features.xlsx')
df_sgdreg_coeff.to_excel(writer14)
writer14.save()

df_sgdreg_coeff

Unnamed: 0,COLUMN_NAME,SGDREG_COEFF
0,geographical_dist(km),0.780251
1,pickup_month,0.042058
2,pickup_hour,0.039743
3,pickup_latitude,0.036967
4,passenger_count,0.005884
5,pickup_day,0.005792
6,store_and_fwd_flag,0.003663
7,vendor_id,0.0
8,pickup_minute,-0.005859
9,pickup_longitude,-0.024414


In [39]:
del(X, y, X_train, X_test, y_train, y_test)
gc.collect()
usage()

283.65625

Trying to improve the accuracy, we attempted to drop the least important features: 'pickup_day', 'passenger_count', 'store_and_fwd_flag', 'vendor_id' and 'pickup_minute', and use the PolynomialFeatures() to generate the squared number of the remaining features. However, doing this resulted in the learning procedure picking up unstable solution. We also even tried to go to cubic interaction and run the SGD regressor on smaller dataset due to memory constraint; the result was the same -- conversgence to wrong solution resulting in negative coefficient of dtermination.
Thus, we abandon attempts to improve the accuracy further using SGD. 

The results of two methods are summarized in the table below. The conclusion one can make looking at the results is that it is better to use the nonlinear methods in an attempt to improve the accuracy of predictions. So, let's consider the Decision Tree and Random Forest regressions.

In [40]:
charac = ['Training score', 'Test score', 'Actual mean of the test set',
         'Predicted mean of the test set', 'Accuracy of prediction of the mean',
         'Actual std of the test set', 'Predicted std of the test set', 
          'Accuracy of prediction of the std']

values_elnet = [elnetreg_training_score, elnetreg_test_score, y_test_mean_elnet, 
                y_test_pred_mean_elnet,
             y_test_mean_accur_elnet, y_test_std_elnet,
             y_test_pred_std_elnet, y_test_std_accur_elnet ]

values_sgd = [sgdreg_training_score, sgdreg_test_score, y_test_mean_sgd, y_test_pred_mean_sgd,
             y_test_mean_accur_sgd, y_test_std_sgd,
             y_test_pred_std_sgd, y_test_std_accur_sgd ]

comptable_linear = pd.DataFrame({'Characteristics': charac, 'Elastic Net': values_elnet, 
                                'Stochastic Grad. Descent': values_sgd})


writer26 = pd.ExcelWriter('comptable_linear.xlsx')
comptable_linear.to_excel(writer26)
writer26.save()

comptable_linear.to_csv('comptable_linear.csv')

comptable_linear

Unnamed: 0,Characteristics,Elastic Net,Stochastic Grad. Descent
0,Training score,0.651995,0.598166
1,Test score,0.649107,0.596249
2,Actual mean of the test set,0.232794,0.232794
3,Predicted mean of the test set,0.232984,0.23238
4,Accuracy of prediction of the mean,0.999182,0.998223
5,Actual std of the test set,0.184052,0.184052
6,Predicted std of the test set,0.148579,0.138915
7,Accuracy of prediction of the std,0.807267,0.754758


In [41]:
print("--- Time to execute the program is {} minutes ---".format((time.time() - start_time)/60))

--- Time to execute the program is 17.46738317410151 minutes ---
