# Feature Engineering
## Objectives: 
* Using Deep Feature Synthesis to automatically generate rich features from relational database. Starting with simple features, incrementally improve the feature definitions and examine the accuray of the model.
* Feature Stacking: Including more robust features can be done by extracting related features (i.e. instances) from a parent entity, then adding additional columns to create the entity features matrix. For example, in this case study, we may want to predict trip durations taken over weekend, within a set of destinations and apply an aggregate function such as “mean( )” to the column of trips duration. By stacking information, we exploit features relationships in deeper levels (or multi-table dataset). Hence, improving the model accuracy in theory.

## Steps: 
1) Transform Primitive: Build Feature Matrix with 1 primitive 
  (1 primitive: weekend)

2) Train the model using Gradient Boosting Regressor, predict the results, evaluate RMSE, examine how important 
   each feature is for the model 
    
3) Adding more Tranform Primitives: Build Feature Matrix
  (7 primitives: Minute, Hour, Day, Week, Month, Weekday, Weekend)

4) Train the model using Gradient Boosting Regressor, comparing results with 3)
  
5) Adding Aggregation Primitives: Build Featue Matrix
   (7 primitives: Count, Sum, Mean, Median, Std, Max, Min) 

6) Train the model using Gradient Boosting Regressor, comparing results with 3 & 5)

7) Apply k-fold Cross Validation 

Scenario: To build a predictive model to predict the duration of taxi ride 

### Datasets: 
##### Read in 3x csv files
pickup_neighborhoods: info longitude and latitude on passengers pickup area codes

Matrix of Features: 51 x 3


dropoff_neighborhoods: info longitude and latitude on passengers dropoff area codes

Matrix of Features: 180 x 14


trips: info on the trips with unique id for each trip

Matrix of Features: 1,020,004 x 15

In [1]:
import numpy as np
import featuretools as ft
import utils
from utils import load_nyc_taxi_data, compute_features, preview, feature_importances
from sklearn.ensemble import GradientBoostingRegressor
from featuretools.primitives import (Weekend, Minute, Hour, Day, Week, Month,
                                     Weekday, Weekend, Count, Sum, Mean, Median, Std, Min, Max)

import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error as mse
import math

In [2]:
trips, pickup_neighborhoods, dropoff_neighborhoods = load_nyc_taxi_data()
preview(trips,10) #return (top) n rows that have fewest number of nulls  

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,payment_type,trip_duration,pickup_neighborhood,dropoff_neighborhood
0,0,2,2016-01-01 00:00:19,2016-01-01 00:06:31,3,1.32,-73.961258,40.7962,-73.95005,40.787312,2,372.0,AH,C
672146,672146,1,2016-04-29 07:01:31,2016-04-29 07:15:46,1,3.3,-73.949951,40.784653,-73.982536,40.75547,1,855.0,C,AA
672147,672147,2,2016-04-29 07:01:43,2016-04-29 07:09:15,1,1.14,-73.967331,40.75737,-73.954277,40.765282,1,452.0,N,K
672148,672148,1,2016-04-29 07:01:46,2016-04-29 07:07:54,1,1.1,-74.003082,40.727509,-73.984703,40.724377,1,368.0,AB,AC
672149,672149,2,2016-04-29 07:01:46,2016-04-29 07:06:48,2,1.4,-73.990158,40.77235,-73.982147,40.7598,1,302.0,AR,AA
672150,672150,1,2016-04-29 07:01:59,2016-04-29 07:07:33,1,1.2,-73.983681,40.746677,-73.971703,40.762463,2,334.0,AO,A
672151,672151,2,2016-04-29 07:02:11,2016-04-29 07:15:24,2,2.13,-73.994209,40.750999,-73.969391,40.761539,1,793.0,D,AK
672152,672152,1,2016-04-29 07:02:11,2016-04-29 07:06:44,1,1.0,-73.983276,40.770985,-73.98011,40.760666,1,273.0,AR,A
672153,672153,2,2016-04-29 07:02:13,2016-04-29 07:08:36,1,1.17,-73.980141,40.743168,-73.983391,40.754665,1,383.0,Y,AA
672154,672154,1,2016-04-29 07:02:16,2016-04-29 07:04:07,1,0.5,-73.965973,40.765381,-73.970558,40.758724,1,111.0,AK,N


In [3]:
#checking:: return any row has NAN
df = trips
df = df[df.isnull().any(axis = 1)]

## Create entities and relationship
### Parent entities: pickup_neighborhoods, dropoff_neighborhoods
### Child entities: trips  

In [4]:
entities = {
            "trips": (trips, "id", 'pickup_datetime' ),
            "pickup_neighborhoods": (pickup_neighborhoods, "neighborhood_id"),
            "dropoff_neighborhoods": (dropoff_neighborhoods, "neighborhood_id"),
            }

relationships = [("pickup_neighborhoods", "neighborhood_id", "trips", "pickup_neighborhood"),
                 ("dropoff_neighborhoods", "neighborhood_id", "trips", "dropoff_neighborhood")]
 

## Specify the cutoff time for each instance of target_entity, i.e., trips
The dataset (trips) contains all trips made between 1st Jan 2016 at 00:00:19 to 30th June 2016 at 23:59:41. In total, there are 1,020,003 trips. The preview function lists out the top n rows with the fewest missing information. On closer inspection, there are 45,594 trips (out of 1,020,003) has missing information in the datasest. The missing information are reasonably, evenly spread out across each month. Therefore, there is no specific time period would skew/bias the overall prediction.
For this case study, the objective is to predict duration of taxi ride. Pickup_datetime should be used to predict the duration before each trip starts. This timestamp is used as the cutoff time for each instance of target entity (i.e., trips), by calculating the respective features’ significance using the Deep Features Synthesis (DFS) algorithm.

In [5]:
cutoff_time = trips[['id', 'pickup_datetime']]
cutoff_time = cutoff_time[cutoff_time['pickup_datetime'] > "2016-01-12"]
preview(cutoff_time, 10) #return (top) n rows that have fewest number of null
cutoff_time.head(5)

Unnamed: 0,id,pickup_datetime
56311,56311,2016-01-12 00:00:25
56312,56312,2016-01-12 00:02:09
56313,56313,2016-01-12 00:02:25
56314,56314,2016-01-12 00:02:41
56315,56315,2016-01-12 00:03:44


In [6]:
#checking:: return any row has NAN
df_c = cutoff_time
df_c = df_c[df_c.isnull().any(axis = 1)] 

## 1) Transform Primitive: Build Feature Matrix with 1 primitive (Weekend) 
Automatically create transform features using transform primitives  
For each entry in the column, assess if it is a weekend, return a boolean

In [7]:
trans_primitives = [Weekend] #it can be applied to any datetime column in the data

features = ft.dfs(entities=entities,
                  relationships=relationships,
                  target_entity="trips",
                  trans_primitives=trans_primitives,
                  agg_primitives=[],
                  ignore_variables={"trips": ["pickup_latitude", "pickup_longitude",
                                              "dropoff_latitude", "dropoff_longitude"]},
                  features_only=True) 
print "Number of features: %d" % len(features)
features

Number of features: 13


[<Feature: vendor_id>,
 <Feature: passenger_count>,
 <Feature: dropoff_neighborhood>,
 <Feature: payment_type>,
 <Feature: pickup_neighborhood>,
 <Feature: trip_duration>,
 <Feature: trip_distance>,
 <Feature: dropoff_neighborhoods.longitude>,
 <Feature: dropoff_neighborhoods.latitude>,
 <Feature: pickup_neighborhoods.longitude>,
 <Feature: pickup_neighborhoods.latitude>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: IS_WEEKEND(dropoff_datetime)>]

## 2)Train the model using Gradient Boosting Regressor 
Predict the results, evaluate RMSE, examine how important each feature is for the model 

In [8]:
####Computing features - only 1 feature) weekend primitive
feature_matrix = compute_features(features, cutoff_time)

preview(feature_matrix, 5)

###Train the model 
# separates the whole feature matrix into train data feature matrix, 
# train data labels, and test data feature matrix 
# Get the log of the trip duration so that a more linear relationship can be found
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix,.75)
y_train = np.log(y_train+1)
y_test = np.log(y_test+1)

Progress: 100%|██████████| 5/5 [00:14<00:00,  2.86s/cutoff time]
Finishing computing...


In [9]:
#Train the model using a GradientBoostingRegressor
model = GradientBoostingRegressor(verbose=True)
model.fit(X_train, y_train)
model.score(X_test, y_test)

      Iter       Train Loss   Remaining Time 
         1           0.4925            1.72m
         2           0.4333            1.69m
         3           0.3843            1.66m
         4           0.3446            1.67m
         5           0.3119            1.69m
         6           0.2852            1.69m
         7           0.2634            1.66m
         8           0.2454            1.62m
         9           0.2305            1.60m
        10           0.2183            1.58m
        20           0.1666            1.40m
        30           0.1558            1.18m
        40           0.1514           58.08s
        50           0.1488           46.14s
        60           0.1472           35.39s
        70           0.1458           25.69s
        80           0.1448           16.79s
        90           0.1440            8.21s
       100           0.1433            0.00s


0.72200175704571445

In [10]:
#Predicting the test set results
y_pred = model.predict(X_train)

#Root Mean Squared Error
print (('\nNumber of features: %d' % len(features)))
print ('Score(Coefficient R^2): %f'%(model.score(X_test,y_test)))
print ('RMSE(L2 loss function): %f' %math.sqrt(mse(y_train,y_pred)))


Number of features: 13
Score(Coefficient R^2): 0.722002
RMSE(L2 loss function): 0.378550


In [11]:
#Further analysis - look at how important each feature was for the model
feature_importances(model, feature_matrix.columns, n=15)

1: Feature: trip_distance, 0.373
2: Feature: dropoff_neighborhoods.latitude, 0.125
3: Feature: dropoff_neighborhoods.longitude, 0.103
4: Feature: trip_duration, 0.087
5: Feature: pickup_neighborhoods.longitude, 0.062
6: Feature: IS_WEEKEND(pickup_datetime), 0.045
7: Feature: pickup_neighborhoods.latitude, 0.027
8: Feature: dropoff_neighborhood = AA, 0.026
9: Feature: pickup_neighborhood = D, 0.025
10: Feature: dropoff_neighborhood = A, 0.022
11: Feature: dropoff_neighborhood = AO, 0.017
12: Feature: vendor_id, 0.016
13: Feature: dropoff_neighborhood = D, 0.013
14: Feature: pickup_neighborhood = R, 0.010
15: Feature: passenger_count, 0.010


## 3) Adding more Tranform Primitives: Build Feature Matrix (7 primitives: Minute, Hour, Day, Week, Month, Weekday, Weekend)
Transform features are generated over the existing set of features. Using the Deep Feature Synthesis (DFS) tool, the function extracts related features automatically from a parent entity, then adding additional columns to create the entity features matrix. It is done by transforming the entire columns of the parent entity and/or its child entity, then applies the transform primitives.
For example, the “Weekend” primitive transforms the “_datetime” columns and returns a 0 or 1 value if the date is weekend or not. The ”Weekday” primitive transforms the ”dropoff_datatime” column and computes a value between 0 to 6 to designate which day of the week it is for that trip. Similar concepts also apply to the rest of the transform primitives “Day”, “Month” ,“IS_Weekend” etc. All these transform primitives apply to datetime columns.

In [12]:
trans_primitives = [Minute, Hour, Day, Week, Month, Weekday, Weekend]
 
features = ft.dfs(entities=entities,
                  relationships=relationships,
                  target_entity="trips",
                  trans_primitives=trans_primitives,
                  agg_primitives=[],
                  ignore_variables={"trips": ["pickup_latitude", "pickup_longitude",
                                              "dropoff_latitude", "dropoff_longitude"]},
                  features_only=True)
                  
print "Number of features: %d" % len(features)
features
####Computing features - with more transform primitivies 
feature_matrix = compute_features(features, cutoff_time)
preview(feature_matrix, 10)

Number of features: 25
Progress: 100%|██████████| 5/5 [00:19<00:00,  3.86s/cutoff time]
Finishing computing...


Unnamed: 0_level_0,WEEKDAY(dropoff_datetime),dropoff_neighborhoods.latitude,MINUTE(dropoff_datetime),WEEK(dropoff_datetime),passenger_count,trip_duration,HOUR(pickup_datetime),pickup_neighborhoods.latitude,vendor_id,dropoff_neighborhoods.longitude,...,dropoff_neighborhood = D,dropoff_neighborhood = AR,dropoff_neighborhood = C,dropoff_neighborhood = O,dropoff_neighborhood = N,dropoff_neighborhood = AO,dropoff_neighborhood = AK,HOUR(dropoff_datetime),IS_WEEKEND(dropoff_datetime),trip_distance
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
56311,1,40.721435,11,2,1,645.0,0,40.720245,2,-73.998366,...,0,0,0,0,0,0,0,0,False,1.61
691284,0,40.721435,24,18,2,160.0,12,40.729652,2,-73.998366,...,0,0,0,0,0,0,0,12,False,0.61
691285,0,40.785005,27,18,2,295.0,12,40.77627,2,-73.97605,...,0,0,0,0,0,0,0,12,False,0.88
691286,0,40.757707,48,18,1,1573.0,12,40.742531,1,-73.986446,...,0,0,0,0,0,0,0,12,False,1.9
691288,0,40.761087,30,18,1,404.0,12,40.747126,1,-73.995736,...,0,0,0,0,0,0,0,12,False,1.0
691289,0,40.761492,55,18,1,1906.0,12,40.721435,2,-73.975899,...,0,0,0,0,0,0,0,12,False,3.24
691290,0,40.764723,26,18,1,156.0,12,40.764723,1,-73.966696,...,0,0,0,0,0,0,1,12,False,0.1
691291,0,40.77627,37,18,1,827.0,12,40.766809,1,-73.982322,...,0,0,0,0,0,0,0,12,False,1.6
691292,0,40.764723,39,18,1,883.0,12,40.752186,1,-73.966696,...,0,0,0,0,0,0,1,12,False,1.5
691293,0,40.766488,34,18,2,592.0,12,40.775299,2,-73.983998,...,0,1,0,0,0,0,0,12,False,1.89


## 4)Train the model using Gradient Boosting Regressor, comparing results with 3) above

In [13]:
###Train the new model
# separates the whole feature matrix into train data feature matrix,
# train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix,.75)
y_train = np.log(y_train+1)
y_test = np.log(y_test+1)

In [14]:
#Train the model using a GradientBoostingRegressor
model = GradientBoostingRegressor(verbose=True)
model.fit(X_train,y_train)
model.score(X_test,y_test)

      Iter       Train Loss   Remaining Time 
         1           0.4925            2.72m
         2           0.4333            2.71m
         3           0.3843            2.66m
         4           0.3444            2.62m
         5           0.3117            2.60m
         6           0.2848            2.56m
         7           0.2620            2.52m
         8           0.2435            2.51m
         9           0.2282            2.49m
        10           0.2152            2.47m
        20           0.1588            2.14m
        30           0.1415            1.78m
        40           0.1332            1.48m
        50           0.1283            1.19m
        60           0.1252           54.97s
        70           0.1227           39.47s
        80           0.1207           25.47s
        90           0.1191           12.40s
       100           0.1177            0.00s


0.7755608981558122

In [15]:
#Predicting the test set results
y_pred = model.predict(X_train)

#Root Mean Squared Error
mse(y_train,y_pred)
print (('\nNumber of features: %d' % len(features)))
print ('Score(Coefficient R^2): %f'%(model.score(X_test,y_test)))
print ('RMSE(L2 loss function): %f' %math.sqrt(mse(y_train,y_pred)))


Number of features: 25
Score(Coefficient R^2): 0.775561
RMSE(L2 loss function): 0.343023


## Results Analysis: 
The model’s RMSE has decreased by 9.3%. This would suggest by including additional transform features from 13 to 25, the accuracy of the model has improved. This shows that by applying Deep Features Synthesis algorithm, we exploit related features relationships in deeper levels (or multi-table dataset). Hence, improving the model’s accuracy.
The model’s score has increased by 7.4%. This would suggest by including additional transform features from 13 to 25, the accuracy of the model has improved. However, as explained earlier in section 2.2, the score is related to R^2. R^2 is bias as the additional terms increase. Adjusted R^2 would be a better measure for the accuracy of the model.

In [16]:
#Further analysis - look at how important each feature was for the model
feature_importances(model, feature_matrix.columns, n=15)

1: Feature: IS_WEEKEND(dropoff_datetime), 0.316
2: Feature: trip_duration, 0.117
3: Feature: dropoff_neighborhood = AK, 0.092
4: Feature: dropoff_neighborhoods.latitude, 0.078
5: Feature: MONTH(dropoff_datetime), 0.068
6: Feature: vendor_id, 0.060
7: Feature: payment_type, 0.046
8: Feature: HOUR(pickup_datetime), 0.029
9: Feature: MINUTE(pickup_datetime), 0.023
10: Feature: WEEKDAY(dropoff_datetime), 0.023
11: Feature: dropoff_neighborhood = A, 0.022
12: Feature: DAY(dropoff_datetime), 0.018
13: Feature: pickup_neighborhood = D, 0.017
14: Feature: dropoff_neighborhood = AD, 0.014
15: Feature: HOUR(dropoff_datetime), 0.013


## 5) Adding Aggregation Primitives: Build Featue Matrix (7 primitives: Count, Sum, Mean, Median, Std, Max, Min) 

In [17]:
#################################
### Add Aggregation Primitives
trans_primitives = [Minute, Hour, Day, Week, Month, Weekday, Weekend]
aggregation_primitives = [Count, Sum, Mean, Median, Std, Max, Min]

features = ft.dfs(entities=entities,
                  relationships=relationships,
                  target_entity="trips",
                  trans_primitives=trans_primitives,
                  agg_primitives=aggregation_primitives,
                  ignore_variables={"trips": ["pickup_latitude", "pickup_longitude",
                                              "dropoff_latitude", "dropoff_longitude"]},
                  features_only=True)
print "Number of features: %d" % len(features)
features

Number of features: 63


[<Feature: passenger_count>,
 <Feature: dropoff_neighborhood>,
 <Feature: payment_type>,
 <Feature: vendor_id>,
 <Feature: pickup_neighborhood>,
 <Feature: trip_duration>,
 <Feature: trip_distance>,
 <Feature: DAY(pickup_datetime)>,
 <Feature: dropoff_neighborhoods.latitude>,
 <Feature: WEEK(dropoff_datetime)>,
 <Feature: HOUR(pickup_datetime)>,
 <Feature: WEEKDAY(dropoff_datetime)>,
 <Feature: WEEKDAY(pickup_datetime)>,
 <Feature: MONTH(pickup_datetime)>,
 <Feature: WEEK(pickup_datetime)>,
 <Feature: pickup_neighborhoods.latitude>,
 <Feature: DAY(dropoff_datetime)>,
 <Feature: MONTH(dropoff_datetime)>,
 <Feature: HOUR(dropoff_datetime)>,
 <Feature: pickup_neighborhoods.longitude>,
 <Feature: dropoff_neighborhoods.longitude>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: MINUTE(pickup_datetime)>,
 <Feature: MINUTE(dropoff_datetime)>,
 <Feature: IS_WEEKEND(dropoff_datetime)>,
 <Feature: dropoff_neighborhoods.SUM(trips.trip_duration)>,
 <Feature: pickup_neighborhoods.STD(trips.trip

In [18]:
####Computing new set of features - with aggregression primitives 
feature_matrix = compute_features(features, cutoff_time)
preview(feature_matrix, 10)

Progress: 100%|██████████| 5/5 [00:37<00:00,  7.43s/cutoff time]
Finishing computing...


Unnamed: 0_level_0,dropoff_neighborhoods.MAX(trips.trip_duration),pickup_neighborhoods.MEDIAN(trips.passenger_count),pickup_neighborhoods.MEDIAN(trips.trip_distance),HOUR(dropoff_datetime),dropoff_neighborhoods.COUNT(trips),DAY(pickup_datetime),pickup_neighborhoods.latitude,pickup_neighborhoods.SUM(trips.passenger_count),pickup_neighborhoods.STD(trips.trip_distance),dropoff_neighborhoods.SUM(trips.passenger_count),...,pickup_neighborhoods.STD(trips.passenger_count),MONTH(pickup_datetime),pickup_neighborhoods.MEAN(trips.trip_duration),WEEK(pickup_datetime),dropoff_neighborhoods.MEAN(trips.trip_distance),MONTH(dropoff_datetime),payment_type,MINUTE(dropoff_datetime),WEEK(dropoff_datetime),dropoff_neighborhoods.MAX(trips.passenger_count)
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
56311,3572.0,1.0,2.4,0,1396.0,12,40.720245,2283.0,2.51706,2375.0,...,1.331649,1,740.870871,2,2.495358,1,1,11,2,6.0
691284,3603.0,1.0,1.6,12,16736.0,2,40.729652,34521.0,2.099009,28154.0,...,1.310235,5,753.81368,18,2.338798,5,1,24,18,6.0
691285,3602.0,1.0,1.6,12,19017.0,2,40.77627,36299.0,2.111243,31836.0,...,1.315396,5,681.405688,18,2.176976,5,1,27,18,6.0
691286,3606.0,1.0,1.49,12,28805.0,2,40.742531,31158.0,2.137177,49208.0,...,1.330198,5,682.62444,18,2.36529,5,1,48,18,6.0
691288,3580.0,1.0,1.4,12,16985.0,2,40.747126,43543.0,2.382449,28197.0,...,1.319326,5,714.648716,18,2.067381,5,1,30,18,6.0
691289,3606.0,1.0,1.9,12,31541.0,2,40.721435,30913.0,4.278882,52591.0,...,1.315238,5,818.141251,18,2.102551,5,1,55,18,6.0
691290,3580.0,1.0,1.3,12,21894.0,2,40.764723,43212.0,1.846378,36175.0,...,1.315462,5,637.726834,18,1.732215,5,2,26,18,6.0
691291,3604.0,1.0,1.63,12,21272.0,2,40.766809,32656.0,2.206183,35282.0,...,1.332742,5,707.024093,18,2.061938,5,1,37,18,6.0
691292,3580.0,1.0,1.49,12,21894.0,2,40.752186,57862.0,2.488034,36175.0,...,1.306561,5,749.696305,18,1.732215,5,1,39,18,6.0
691293,3587.0,1.0,1.37,12,24592.0,2,40.775299,39612.0,1.904818,41249.0,...,1.341478,5,670.677993,18,2.200316,5,1,34,18,6.0


## 6)Train the model using Gradient Boosting Regressor, comparing results with 3 & 5)
Train the new model separates the whole feature matrix into train data feature matrix,train data labels, and test data feature matrix. 

In [19]:
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix,.75)
y_train = np.log(y_train+1)
y_test = np.log(y_test+1)

In [20]:
#Train the model using a GradientBoostingRegressor
model = GradientBoostingRegressor(verbose=True)
model.fit(X_train,y_train)
model.score(X_test,y_test)

      Iter       Train Loss   Remaining Time 
         1           0.4925            5.74m
         2           0.4333            5.57m
         3           0.3843            5.41m
         4           0.3444            5.28m
         5           0.3117            5.28m
         6           0.2848            5.24m
         7           0.2620            5.15m
         8           0.2435            5.16m
         9           0.2282            5.16m
        10           0.2152            5.12m
        20           0.1585            4.43m
        30           0.1420            3.77m
        40           0.1332            3.16m
        50           0.1271            2.54m
        60           0.1238            1.96m
        70           0.1211            1.46m
        80           0.1191           57.65s
        90           0.1176           28.36s
       100           0.1163            0.00s


0.77808885638984859

In [21]:
#Predicting the test set results
y_pred = model.predict(X_train)

#Root Mean Squared Error
mse(y_train,y_pred)
print (('\nNumber of features: %d' % len(features)))
print ('Score(Coefficient R^2): %f'%(model.score(X_test,y_test)))
print ('RMSE(L2 loss function): %f' %math.sqrt(mse(y_train,y_pred)))


Number of features: 63
Score(Coefficient R^2): 0.778089
RMSE(L2 loss function): 0.340997


## Results Analysis:
The model’s RMSE has decreased by 0.59%. This would suggest by including additional aggregate features from 25 to 63, the improvement on the accuracy of the model is negligible.
The model’s accuracy has increased by 0.3%. This would suggest by including additional aggregate features from 25 to 63, the improvement on the accuracy of the model is negligible.
The total training time has doubled.

In [22]:
#Further analysis - look at how important each feature was for the model
feature_importances(model, feature_matrix.columns, n=15)

1: Feature: trip_distance, 0.314
2: Feature: HOUR(pickup_datetime), 0.126
3: Feature: HOUR(dropoff_datetime), 0.089
4: Feature: WEEKDAY(pickup_datetime), 0.052
5: Feature: dropoff_neighborhoods.latitude, 0.046
6: Feature: dropoff_neighborhoods.longitude, 0.036
7: Feature: dropoff_neighborhoods.STD(trips.trip_distance), 0.027
8: Feature: dropoff_neighborhoods.MIN(trips.passenger_count), 0.022
9: Feature: dropoff_neighborhoods.MEDIAN(trips.trip_duration), 0.022
10: Feature: pickup_neighborhoods.MEDIAN(trips.trip_distance), 0.021
11: Feature: IS_WEEKEND(pickup_datetime), 0.021
12: Feature: WEEKDAY(dropoff_datetime), 0.020
13: Feature: WEEK(pickup_datetime), 0.019
14: Feature: dropoff_neighborhoods.MEAN(trips.trip_duration), 0.019
15: Feature: MONTH(dropoff_datetime), 0.018


In [23]:
######################################################
#Predicting the test set results
y_pred = model.predict(X_test)
y_pred = np.exp(y_pred) - 1 # undo the log we took earlier
y_pred[5:]

array([  557.67992664,   590.2602792 ,  1497.39684679, ...,  1063.48382696,
        1800.89361932,   739.60249439])

## 7) Apply k-fold Cross Validation 

In [24]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 10)

      Iter       Train Loss   Remaining Time 
         1           0.4942            6.67m
         2           0.4348            6.63m
         3           0.3857            6.66m
         4           0.3458            6.49m
         5           0.3129            6.45m
         6           0.2859            6.34m
         7           0.2634            6.28m
         8           0.2443            6.21m
         9           0.2292            6.15m
        10           0.2159            6.10m
        20           0.1588            5.30m
        30           0.1417            4.54m
        40           0.1324            3.82m
        50           0.1267            3.13m
        60           0.1234            2.47m
        70           0.1204            1.84m
        80           0.1186            1.22m
        90           0.1172           36.19s
       100           0.1159            0.00s
      Iter       Train Loss   Remaining Time 
         1           0.4942            6.52m
        

         2           0.4309            7.48m
         3           0.3820            7.21m
         4           0.3422            7.19m
         5           0.3096            7.18m
         6           0.2828            7.01m
         7           0.2600            6.87m
         8           0.2415            6.73m
         9           0.2263            6.66m
        10           0.2133            6.58m
        20           0.1569            5.91m
        30           0.1406            5.04m
        40           0.1317            4.26m
        50           0.1264            3.47m
        60           0.1225            2.73m
        70           0.1198            2.03m
        80           0.1183            1.34m
        90           0.1167           39.98s
       100           0.1155            0.00s


In [25]:
print('\nThe mean accuracy is {:.2%} with a standard deviation of {:.2%}'.format(accuracies.mean(),accuracies.std()))  


The mean accuracy is 79.03% with a standard deviation of 0.98%


In [26]:
#Further analysis - look at how important each feature was for the model
feature_importances(model, feature_matrix.columns, n=15)

1: Feature: trip_distance, 0.314
2: Feature: HOUR(pickup_datetime), 0.126
3: Feature: HOUR(dropoff_datetime), 0.089
4: Feature: WEEKDAY(pickup_datetime), 0.052
5: Feature: dropoff_neighborhoods.latitude, 0.046
6: Feature: dropoff_neighborhoods.longitude, 0.036
7: Feature: dropoff_neighborhoods.STD(trips.trip_distance), 0.027
8: Feature: dropoff_neighborhoods.MIN(trips.passenger_count), 0.022
9: Feature: dropoff_neighborhoods.MEDIAN(trips.trip_duration), 0.022
10: Feature: pickup_neighborhoods.MEDIAN(trips.trip_distance), 0.021
11: Feature: IS_WEEKEND(pickup_datetime), 0.021
12: Feature: WEEKDAY(dropoff_datetime), 0.020
13: Feature: WEEK(pickup_datetime), 0.019
14: Feature: dropoff_neighborhoods.MEAN(trips.trip_duration), 0.019
15: Feature: MONTH(dropoff_datetime), 0.018
