## Predictive Modelling Multilayer Perceptron

In this section we will use the Multilayer Perceptron Regression to predict the trip duration given the dataset prepared earlier. From the very beginning we will be using the truncated dataset in which the geographical distance between pickup and dropoff locations is smaller than 60 km, and the trip durations are limited to 6 hours.

In [1]:
import time
start_time = time.time()

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

In [3]:
#import the garbage collection module

import gc
gc.enable()

In [4]:
from sklearn.neural_network import MLPRegressor

from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

#from sklearn.model_selection import GridSearchCV

#from sklearn.preprocessing import PolynomialFeatures



In [5]:
#import the module that shows the memory usage

import os, psutil

def usage():
    process = psutil.Process(os.getpid())
    return process.memory_info()[0] / float(2 ** 20)
    
usage()

113.12109375

In [6]:
df_modif = pd.read_csv('NYCTripDuration_modified.csv')

#df_modif.columns
df_modif = df_modif.drop(['Unnamed: 0','effective_speed(kmph)'], axis = 1)

usage()

268.328125

In [7]:
#shuffle the dataframe arbitrarily
df_modif.sample(frac=1, random_state = 20).reset_index(drop=True, inplace = True)

gc.collect()

7

In [8]:
# Create the function that calculates the accuracy of the predicted quantities.

def accur_func(y_pred, y):
    return 1.0 - (np.abs(y_pred - y)/np.abs(y))

In [9]:
# create the list containing the relevant columns
column_names = df_modif.drop('trip_duration(hrs)',axis =1).columns.values

In [10]:
df_modif = df_modif[(df_modif['geographical_dist(km)'] < 60) & \
                    (df_modif['trip_duration(hrs)'] < 6)]

X = df_modif.drop('trip_duration(hrs)', axis = 1).values

y = df_modif['trip_duration(hrs)'].values

In [11]:
gc.collect()
usage()

423.33203125

Let us start with the regressor in which there is one hidden layer with 100 elements (default value) and each element has the rectified linear unit (ReLU) activation function. The regularization parameter alpha is set to 0.001 which is also the default value. We use the default 'adam' solver, a kind of stochastic gradient descent based optimizer. The results show that despite the relatively large running time, MLP Regressor gives the coefficient of determination equal to 0.769, while the mean and standard deviations are predicted with the accuracies of 99.46 and 85.08 percent respectively.  

In [12]:
%%time


mlpreg = MLPRegressor(alpha = 0.0001, hidden_layer_sizes=(100, ), activation='relu',
                      learning_rate = 'adaptive', learning_rate_init=0.001)

# use the StandardScaler
X =  StandardScaler().fit_transform(X)
scaler = StandardScaler().fit(y)
y =  scaler.transform(y)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 45, 
                                                    stratify = None)

mlpreg.fit(X_train, y_train)


y_train_pred = mlpreg.predict(X_train)
y_test_pred = mlpreg.predict(X_test)

# compute the coefficients of determination
mlpreg_training_score = mlpreg.score(X_train, y_train)
mlpreg_test_score = mlpreg.score(X_test, y_test)

#scale back the target variable
y_test = scaler.inverse_transform(y_test)
y_test_pred = scaler.inverse_transform(y_test_pred)


y_test_mean_mlp =  y_test.mean()
y_test_pred_mean_mlp =  y_test_pred.mean()
y_test_mean_accur_mlp = accur_func(y_test_pred_mean_mlp, y_test_mean_mlp)

y_test_std_mlp =  y_test.std()
y_test_pred_std_mlp =  y_test_pred.std()
y_test_std_accur_mlp = accur_func(y_test_pred_std_mlp, y_test_std_mlp)


print('Training score:', mlpreg_training_score)
print('Test score:', mlpreg_test_score)


print('\n')
print('Actual mean of the test set:', y_test_mean_mlp)
print('Predicted mean of the test set:', y_test_pred_mean_mlp)
print('Accuracy of prediction of the mean:', y_test_mean_accur_mlp)

print('\n')
print('Actual std of the test set:', y_test_std_mlp)
print('Predicted std of the test set:', y_test_pred_std_mlp)
print('Accuracy of prediction of the std:', y_test_std_accur_mlp)
print('\n')



Training score: 0.76889132659
Test score: 0.765317259036


Actual mean of the test set: 0.232793845736
Predicted mean of the test set: 0.23311303342
Accuracy of prediction of the mean: 0.998628882637


Actual std of the test set: 0.18405219248
Predicted std of the test set: 0.157547640479
Accuracy of prediction of the std: 0.855994369618


Wall time: 6min 13s


In [13]:
del(X, y, X_train, X_test, y_train, y_test)
gc.collect()

0

We see that the MLP Regressor shows reasonably good results. However, can we improve the results further, for example, by adding one more hidden layer and changing the activation function for the elements in layers? Let's add one more layer with 100 elements, and employ the 'tanh' activation function. The motivation for the latter step is that the 'tanh' activation function, contrary to the 'relu' one, does not nullify the negative inputs to layers that are possible given the substantial number of negative weights seen in the Elastic Net regression study. As a result of changing the activation function and adding the second layer, we are able to reach the coefficient of determination equal to 0.798 and the accuracy of predicting the mean and std equal to 99.80 and 87.94 percent respectively. We see also that the amount of time necessary to run the computations nearly quadrupled.

In [14]:
df_modif = df_modif[(df_modif['geographical_dist(km)'] < 60) & \
                    (df_modif['trip_duration(hrs)'] < 6)]

X = df_modif.drop('trip_duration(hrs)', axis = 1).values

y = df_modif['trip_duration(hrs)'].values

In [15]:
%%time


mlpreg = MLPRegressor(alpha = 0.0001, hidden_layer_sizes=(100, 100, ), activation='tanh',
                      learning_rate = 'adaptive', learning_rate_init=0.001)

# use the StandardScaler
X =  StandardScaler().fit_transform(X)
scaler = StandardScaler().fit(y)
y =  scaler.transform(y)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 45, 
                                                    stratify = None)

mlpreg.fit(X_train, y_train)


y_train_pred = mlpreg.predict(X_train)
y_test_pred = mlpreg.predict(X_test)

# compute the coefficients of determination
mlpreg_training_score = mlpreg.score(X_train, y_train)
mlpreg_test_score = mlpreg.score(X_test, y_test)

#scale back the target variable
y_test = scaler.inverse_transform(y_test)
y_test_pred = scaler.inverse_transform(y_test_pred)


y_test_mean_mlp =  y_test.mean()
y_test_pred_mean_mlp =  y_test_pred.mean()
y_test_mean_accur_mlp = accur_func(y_test_pred_mean_mlp, y_test_mean_mlp)

y_test_std_mlp =  y_test.std()
y_test_pred_std_mlp =  y_test_pred.std()
y_test_std_accur_mlp = accur_func(y_test_pred_std_mlp, y_test_std_mlp)


print('Training score:', mlpreg_training_score)
print('Test score:', mlpreg_test_score)


print('\n')
print('Actual mean of the test set:', y_test_mean_mlp)
print('Predicted mean of the test set:', y_test_pred_mean_mlp)
print('Accuracy of prediction of the mean:', y_test_mean_accur_mlp)

print('\n')
print('Actual std of the test set:', y_test_std_mlp)
print('Predicted std of the test set:', y_test_pred_std_mlp)
print('Accuracy of prediction of the std:', y_test_std_accur_mlp)
print('\n')



Training score: 0.805849008171
Test score: 0.798172035233


Actual mean of the test set: 0.232793845736
Predicted mean of the test set: 0.235460139889
Accuracy of prediction of the mean: 0.988546543639


Actual std of the test set: 0.18405219248
Predicted std of the test set: 0.164005113119
Accuracy of prediction of the std: 0.891079377589


Wall time: 44min 41s


In [16]:
del(X, y, X_train, X_test, y_train, y_test)
gc.collect()

usage()

93.97265625

The results of using Multilayer Perceptron Regression are summarized below. It is possible that they can be further improved by increasing the sizes of layers, as well as the number of elements in them. It is possible that one needs to simply find the appropriate relation between these two parameters as a result of fine tuning. This will require, however, more powerfull computational resources, and the necessity to parallelize the process of learning which is beyond the scope of this project.

In [17]:
charac = ['Training score', 'Test score', 'Actual mean of the test set',
         'Predicted mean of the test set', 'Accuracy of prediction of the mean',
         'Actual std of the test set', 'Predicted std of the test set', 
          'Accuracy of prediction of the std']

values_mlp = [mlpreg_training_score, mlpreg_test_score, y_test_mean_mlp, y_test_pred_mean_mlp,
             y_test_mean_accur_mlp, y_test_std_mlp,
             y_test_pred_std_mlp, y_test_std_accur_mlp ]



comptable_neuron = pd.DataFrame({'Characteristics': charac, 'Multilayer Perceptron': values_mlp})


writer30 = pd.ExcelWriter('comptable_neuron.xlsx')
comptable_neuron.to_excel(writer30)
writer30.save()

comptable_neuron.to_csv('comptable_neuron.csv')

comptable_neuron

Unnamed: 0,Characteristics,Mulilayer Perceptron
0,Training score,0.805849
1,Test score,0.798172
2,Actual mean of the test set,0.232794
3,Predicted mean of the test set,0.23546
4,Accuracy of prediction of the mean,0.988547
5,Actual std of the test set,0.184052
6,Predicted std of the test set,0.164005
7,Accuracy of prediction of the std,0.891079


In [18]:
print("--- Time to execute the program is {} minutes ---".format((time.time() - start_time)/60))

--- Time to execute the program is 52.11184963782628 minutes ---
