**New York City Taxi Ride Duration Prediction**

In this case study, I will build a predictive model to predict the duration of BUS ride. I will do the following steps:
  * Install the dependencies
  * Load the data as pandas dataframe
  * Define the outcome variable - the variable we are trying to predict.
  * Build features with Deep Feature Synthesis using the [featuretools](https://featuretools.com) package. We will start with simple features and incrementally improve the feature definitions and examine the accuracy of the system

In [None]:
conda install -c conda-forge featuretools

### Importing libraries

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Feataurestools for feature engineering
import featuretools as ft

#Utils python file contains some fuctions which can be used any where you want. 
import utils
from utils import load_nyc_taxi_data, compute_features, preview, feature_importances

# Importing gradient boosting regressor, to make prediction
from sklearn.ensemble import GradientBoostingRegressor

#importing primitives
from featuretools.primitives import (Minute, Hour, Day, Week, Month,
                                     Weekday, IsWeekend, Count, Sum, Mean, Median, Std, Min, Max)

print(ft.__version__)
%load_ext autoreload
%autoreload 2

0.27.1


### Step 1: Load the raw data

In [4]:
trips, pickup_neighborhoods, dropoff_neighborhoods = load_nyc_taxi_data()
preview(trips, 10)

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,payment_type,trip_duration,pickup_neighborhood,dropoff_neighborhood
0,0,2,2016-01-01 00:00:19,2016-01-01 00:06:31,3,1.32,-73.961258,40.7962,-73.95005,40.787312,2,372.0,AH,C
672146,672146,1,2016-04-29 07:01:31,2016-04-29 07:15:46,1,3.3,-73.949951,40.784653,-73.982536,40.75547,1,855.0,C,AA
672147,672147,2,2016-04-29 07:01:43,2016-04-29 07:09:15,1,1.14,-73.967331,40.75737,-73.954277,40.765282,1,452.0,N,K
672148,672148,1,2016-04-29 07:01:46,2016-04-29 07:07:54,1,1.1,-74.003082,40.727509,-73.984703,40.724377,1,368.0,AB,AC
672149,672149,2,2016-04-29 07:01:46,2016-04-29 07:06:48,2,1.4,-73.990158,40.77235,-73.982147,40.7598,1,302.0,AR,AA
672150,672150,1,2016-04-29 07:01:59,2016-04-29 07:07:33,1,1.2,-73.983681,40.746677,-73.971703,40.762463,2,334.0,AO,A
672151,672151,2,2016-04-29 07:02:11,2016-04-29 07:15:24,2,2.13,-73.994209,40.750999,-73.969391,40.761539,1,793.0,D,AK
672152,672152,1,2016-04-29 07:02:11,2016-04-29 07:06:44,1,1.0,-73.983276,40.770985,-73.98011,40.760666,1,273.0,AR,A
672153,672153,2,2016-04-29 07:02:13,2016-04-29 07:08:36,1,1.17,-73.980141,40.743168,-73.983391,40.754665,1,383.0,Y,AA
672154,672154,1,2016-04-29 07:02:16,2016-04-29 07:04:07,1,0.5,-73.965973,40.765381,-73.970558,40.758724,1,111.0,AK,N


The ``trips`` table has the following fields
* ``id`` which uniquely identifies the trip
* ``vendor_id`` is the taxi cab company - in our case study we have data from three different cab companies
* ``pickup_datetime`` the time stamp for pickup
* ``dropoff_datetime`` the time stamp for drop-off
* ``passenger_count`` the number of passengers for the trip
* ``trip_distance`` total distance of the trip in miles 
* ``pickup_longitude`` the longitude for pickup
* ``pickup_latitude`` the latitude for pickup
* ``dropoff_longitude``the longitude of dropoff 
* ``dropoff_latitude`` the latitude of dropoff
* ``payment_type`` a numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided
* ``trip_duration`` this is the duration we would like to predict using other fields 
* ``pickup_neighborhood`` a one or two letter id of the neighborhood where the trip started
* ``dropoff_neighborhood`` a one or two letter id of the neighborhood where the trip ended

### Step 2: Prepare the Data

Lets create entities and relationships. The three entities in this data are 
* trips 
* pickup_neighborhoods
* dropoff_neighborhoods

This data has the following relationships
* pickup_neighborhoods --> trips (one neighborhood can have multiple trips that start in it. This means pickup_neighborhoods is the ``parent_entity`` and trips is the child entity)
* dropoff_neighborhoods --> trips (one neighborhood can have multiple trips that end in it. This means dropoff_neighborhoods is the ``parent_entity`` and trips is the child entity)

In <a <href="https://www.featuretools.com/"><featuretools (automated feature engineering software package)/></a>, we specify the list of entities and relationships as follows: 


### Question 1: Define entities and relationships for the Deep Feature Synthesis (5 Marks)

In [5]:
entities = {
        "trips": (trips, "id", 'pickup_datetime' ),
        "pickup_neighborhoods": (pickup_neighborhoods, "neighborhood_id"),
        "dropoff_neighborhoods": (dropoff_neighborhoods, "neighborhood_id"),
        }

relationships = [("pickup_neighborhoods", "neighborhood_id", "trips", "pickup_neighborhood"),
                 ("dropoff_neighborhoods", "neighborhood_id", "trips", "dropoff_neighborhood")]

Next, we specify the cutoff time for each instance of the target_entity, in this case ``trips``.This timestamp represents the last time data can be used for calculating features by DFS. In this scenario, that would be the pickup time because we would like to make the duration prediction using data before the trip starts. 

For the purposes of the case study, we choose to only select trips that started after January 12th, 2016. 

In [6]:
cutoff_time = trips[['id', 'pickup_datetime']]
cutoff_time = cutoff_time[cutoff_time['pickup_datetime'] > "2016-01-12"]
preview(cutoff_time, 10)

Unnamed: 0,id,pickup_datetime
56311,56311,2016-01-12 00:00:25
698765,698765,2016-05-03 18:54:53
698766,698766,2016-05-03 18:55:37
698767,698767,2016-05-03 18:55:38
698768,698768,2016-05-03 18:55:49
698769,698769,2016-05-03 18:55:58
698770,698770,2016-05-03 18:56:22
698771,698771,2016-05-03 18:56:24
698772,698772,2016-05-03 18:56:51
698773,698773,2016-05-03 18:56:56


### Step 3: Create baseline features using Deep Feature Synthesis

Instead of manually creating features, such as "month of pickup datetime", we can let DFS come up with them automatically. It does this by 
* interpreting the variable types of the columns e.g categorical, numeric and others 
* matching the columns to the primitives that can be applied to their variable types
* creating features based on these matches

**Create transform features using transform primitives**

As we described in the video, features fall into two major categories, ``transform`` and ``aggregate``. In featureools, we can create transform features by specifying ``transform`` primitives. Below we specify a ``transform`` primitive called ``weekend`` and here is what it does:

* It can be applied to any ``datetime`` column in the data. 
* For each entry in the column, it assess if it is a ``weekend`` and returns a boolean. 

In this specific data, there are two ``datetime`` columns ``pickup_datetime`` and ``dropoff_datetime``. The tool automatically creates features using the primitive and these two columns as shown below. 

### Question 2: Create a model with only 1 transform primitive (10 Marks)

**Question: 2.1 Define transform primitive for weekend and define features using dfs?**

In [7]:
#defining Transform feature as weather the ride was at the weekend or not, and it is affecting the duration of the ride. 
trans_primitives = [IsWeekend]


#defining features we want to create using featuretools deep featurs synthesis(dfs)
features = ft.dfs(entities=entities,
                  relationships=relationships,
                  target_entity="trips",
                  trans_primitives=trans_primitives,
                  agg_primitives=[],
                  ignore_variables={"trips": ["pickup_latitude", "pickup_longitude",
                                              "dropoff_latitude", "dropoff_longitude"]},
                  features_only=True)

*If you're interested about parameters to DFS such as `ignore_variables`, you can learn more about these parameters [here](https://docs.featuretools.com/generated/featuretools.dfs.html#featuretools.dfs)*
<p>Here are the features created.</p>

In [7]:
print ("Number of features: %d" % len(features))
features

Number of features: 13


[<Feature: vendor_id>,
 <Feature: passenger_count>,
 <Feature: trip_distance>,
 <Feature: payment_type>,
 <Feature: trip_duration>,
 <Feature: pickup_neighborhood>,
 <Feature: dropoff_neighborhood>,
 <Feature: IS_WEEKEND(dropoff_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: pickup_neighborhoods.latitude>,
 <Feature: pickup_neighborhoods.longitude>,
 <Feature: dropoff_neighborhoods.latitude>,
 <Feature: dropoff_neighborhoods.longitude>]

Now let's compute the features. 

**Question: 2.2 Compute features and define feature matrix**

In [8]:
def compute_features(features, cutoff_time):
    # shuffle so we don't see encoded features in the front or backs

    np.random.shuffle(features)
    feature_matrix = ft.calculate_feature_matrix(features,
                                                 cutoff_time=cutoff_time,
                                                 approximate='36d',
                                                 verbose=True,entities=entities, relationships=relationships)
    print("Finishing computing...")
    feature_matrix, features = ft.encode_features(feature_matrix, features,
                                                  to_encode=["pickup_neighborhood", "dropoff_neighborhood"],
                                                  include_unknown=False)
    return feature_matrix

In [9]:
feature_matrix1 = compute_features(features, cutoff_time)

Elapsed: 00:08 | Progress: 100%|██████████
Finishing computing...


In [10]:
preview(feature_matrix1, 5)

Unnamed: 0_level_0,dropoff_neighborhood = AD,dropoff_neighborhood = A,dropoff_neighborhood = AA,dropoff_neighborhood = D,dropoff_neighborhood = AR,dropoff_neighborhood = C,dropoff_neighborhood = O,dropoff_neighborhood = N,dropoff_neighborhood = AO,dropoff_neighborhood = AK,...,pickup_neighborhood = AA,pickup_neighborhood = D,pickup_neighborhood = A,pickup_neighborhood = AR,pickup_neighborhood = AK,pickup_neighborhood = AO,pickup_neighborhood = N,pickup_neighborhood = R,pickup_neighborhood = O,IS_WEEKEND(dropoff_datetime)
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
56311,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
691284,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
691285,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
691286,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
691288,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False


### Build the Model

To build a model, we
* Separate the data into a portion for ``training`` (75% in this case) and a portion for ``testing`` 
* Get the log of the trip duration so that a more linear relationship can be found.
* Train a model using a ``GradientBoostingRegressor``

**Question: 2.3 What was the Modeling Score after your last training round?**

**Question: 2.4 Hypothesize on how including more robust features will change the accuracy.**

In [11]:
# separates the whole feature matrix into train data feature matrix, 
# train data labels, and test data feature matrix 
X_train1, y_train1, X_test1, y_test1 = utils.get_train_test_fm(feature_matrix1,.75)
y_train1 = np.log(y_train1 +1)
y_test1 = np.log(y_test1 +1)

In [12]:
model1 = GradientBoostingRegressor(verbose=True)
model1.fit(X_train1, y_train1)
model1.score(X_test1, y_test1)

      Iter       Train Loss   Remaining Time 
         1           0.4925            2.69m
         2           0.4333            2.68m
         3           0.3843            2.61m
         4           0.3446            2.54m
         5           0.3119            2.47m
         6           0.2852            2.43m
         7           0.2634            2.39m
         8           0.2454            2.36m
         9           0.2305            2.33m
        10           0.2183            2.30m
        20           0.1666            2.03m
        30           0.1558            1.80m
        40           0.1514            1.55m
        50           0.1488            1.29m
        60           0.1472            1.05m
        70           0.1458           47.09s
        80           0.1448           31.48s
        90           0.1440           15.77s
       100           0.1433            0.00s


0.7220107526801756

**Write your answers here:** 

Question: 2.3

   The score for the model with only 1 transform primitive is ~72.2%
   
Question: 2.4

   The more robust features we add incrementally, we could see that the accuracy improves by observing the increasing modeling score.


### Step 5: Adding more Transform Primitives

* Add ``Minute``, ``Hour``, ``Week``, ``Month``, ``Weekday`` , etc primitives
* All these transform primitives apply to ``datetime`` columns

### Question 3: Create a model with more transform primitives (10 Marks)

**3.1 Define more transform primitives and define features using dfs?**

In [14]:
trans_primitives = [Minute, Hour, Day, Week, Month]

features = ft.dfs(entities=entities,
                  relationships=relationships,
                  target_entity="trips",
                  trans_primitives=trans_primitives,
                  agg_primitives=[],
                  ignore_variables={"trips": ["pickup_latitude", "pickup_longitude",
                                              "dropoff_latitude", "dropoff_longitude"]},
                  features_only=True)

In [15]:
print ("Number of features: %d" % len(features))
features

Number of features: 21


[<Feature: vendor_id>,
 <Feature: passenger_count>,
 <Feature: trip_distance>,
 <Feature: payment_type>,
 <Feature: trip_duration>,
 <Feature: pickup_neighborhood>,
 <Feature: dropoff_neighborhood>,
 <Feature: DAY(dropoff_datetime)>,
 <Feature: DAY(pickup_datetime)>,
 <Feature: HOUR(dropoff_datetime)>,
 <Feature: HOUR(pickup_datetime)>,
 <Feature: MINUTE(dropoff_datetime)>,
 <Feature: MINUTE(pickup_datetime)>,
 <Feature: MONTH(dropoff_datetime)>,
 <Feature: MONTH(pickup_datetime)>,
 <Feature: WEEK(dropoff_datetime)>,
 <Feature: WEEK(pickup_datetime)>,
 <Feature: pickup_neighborhoods.latitude>,
 <Feature: pickup_neighborhoods.longitude>,
 <Feature: dropoff_neighborhoods.latitude>,
 <Feature: dropoff_neighborhoods.longitude>]

Now let's compute the features. 

**Question: 3.2 Compute features and define feature matrix**

In [16]:
feature_matrix2 = compute_features(features, cutoff_time)

Elapsed: 00:10 | Progress: 100%|██████████
Finishing computing...


In [17]:
preview(feature_matrix2, 10)

Unnamed: 0_level_0,pickup_neighborhoods.longitude,DAY(dropoff_datetime),HOUR(dropoff_datetime),WEEK(pickup_datetime),DAY(pickup_datetime),WEEK(dropoff_datetime),pickup_neighborhood = AD,pickup_neighborhood = AA,pickup_neighborhood = D,pickup_neighborhood = A,...,dropoff_neighborhood = AO,dropoff_neighborhood = AK,trip_duration,MINUTE(dropoff_datetime),HOUR(pickup_datetime),trip_distance,MONTH(dropoff_datetime),dropoff_neighborhoods.longitude,vendor_id,dropoff_neighborhoods.latitude
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
56311,-73.987205,12,0,2,12,2,False,False,False,False,...,False,False,645.0,11,0,1.61,1,-73.998366,2,40.721435
691284,-73.991595,2,12,18,2,18,False,False,False,False,...,False,False,160.0,24,12,0.61,5,-73.998366,2,40.721435
691285,-73.982322,2,12,18,2,18,False,False,False,False,...,False,False,295.0,27,12,0.88,5,-73.97605,2,40.785005
691286,-73.977943,2,12,18,2,18,False,False,False,False,...,False,False,1573.0,48,12,1.9,5,-73.986446,1,40.757707
691288,-73.985336,2,12,18,2,18,False,False,False,False,...,False,False,404.0,30,12,1.0,5,-73.995736,1,40.761087
691289,-73.998366,2,12,18,2,18,False,False,False,False,...,False,False,1906.0,55,12,3.24,5,-73.975899,2,40.761492
691290,-73.966696,2,12,18,2,18,False,False,False,False,...,False,True,156.0,26,12,0.1,5,-73.966696,1,40.764723
691291,-73.956886,2,12,18,2,18,False,False,False,False,...,False,False,827.0,37,12,1.6,5,-73.982322,1,40.77627
691292,-73.976515,2,12,18,2,18,True,False,False,False,...,False,True,883.0,39,12,1.5,5,-73.966696,1,40.764723
691293,-73.960551,2,12,18,2,18,False,False,False,False,...,False,False,592.0,34,12,1.89,5,-73.983998,2,40.766488


### Step 6: Build the new model

**Question: 3.3 What was the Modeling Score after your last training round when including the transform primitives?**

**Question: 3.4 Comment on how the modeling accuracy differs when including more transform features.**

In [18]:
# separates the whole feature matrix into train data feature matrix,
# train data labels, and test data feature matrix 
X_train2, y_train2, X_test2, y_test2 = utils.get_train_test_fm(feature_matrix2,.75)
y_train2 = np.log(y_train2+1)
y_test2 = np.log(y_test2+1)

In [19]:
model2 = GradientBoostingRegressor(verbose=True)
model2.fit(X_train2,y_train2)
model2.score(X_test2,y_test2)

      Iter       Train Loss   Remaining Time 
         1           0.4925            3.70m
         2           0.4333            3.72m
         3           0.3843            3.79m
         4           0.3444            3.77m
         5           0.3117            3.69m
         6           0.2848            3.76m
         7           0.2620            3.69m
         8           0.2435            3.64m
         9           0.2282            3.57m
        10           0.2152            3.51m
        20           0.1588            3.07m
        30           0.1433            2.68m
        40           0.1368            2.30m
        50           0.1331            1.89m
        60           0.1308            1.49m
        70           0.1290            1.10m
        80           0.1275           43.74s
        90           0.1263           21.76s
       100           0.1250            0.00s


0.7588895947809645

**Write your answers here:**

    Question: 3.3
    The score for the model with more transform primitive is ~75.9%.
    
    Question: 3.4 
    As Compared to previous model, the score has improved only by a small amount, 3.7%.


### Step 7: Add Aggregation Primitives

Now let's add aggregation primitives. These primitives will generate features for the parent entities ``pickup_neighborhoods``, and ``dropoff_neighborhood`` and then add them to the trips entity, which is the entity for which we are trying to make prediction.

### Question 4: Create a model with transform and aggregate primitive (10 Marks)
**4.1 Define more transform and aggregate primitive and define features using dfs?**

In [20]:
trans_primitives = [Minute, Hour, Day, Week, Month]
aggregation_primitives = [Count, Sum, Mean, Median, Std, Max, Min]

features = ft.dfs(entities=entities,
                  relationships=relationships,
                  target_entity="trips",
                  trans_primitives=trans_primitives,
                  agg_primitives=aggregation_primitives,
                  ignore_variables={"trips": ["pickup_latitude", "pickup_longitude",
                                              "dropoff_latitude", "dropoff_longitude","dropoff_neighborhoods","pickup_neighborhoods"]},
                  features_only=True)

In [21]:
print ("Number of features: %d" % len(features))
features

Number of features: 59


[<Feature: vendor_id>,
 <Feature: passenger_count>,
 <Feature: trip_distance>,
 <Feature: payment_type>,
 <Feature: trip_duration>,
 <Feature: pickup_neighborhood>,
 <Feature: dropoff_neighborhood>,
 <Feature: DAY(dropoff_datetime)>,
 <Feature: DAY(pickup_datetime)>,
 <Feature: HOUR(dropoff_datetime)>,
 <Feature: HOUR(pickup_datetime)>,
 <Feature: MINUTE(dropoff_datetime)>,
 <Feature: MINUTE(pickup_datetime)>,
 <Feature: MONTH(dropoff_datetime)>,
 <Feature: MONTH(pickup_datetime)>,
 <Feature: WEEK(dropoff_datetime)>,
 <Feature: WEEK(pickup_datetime)>,
 <Feature: pickup_neighborhoods.latitude>,
 <Feature: pickup_neighborhoods.longitude>,
 <Feature: dropoff_neighborhoods.latitude>,
 <Feature: dropoff_neighborhoods.longitude>,
 <Feature: pickup_neighborhoods.COUNT(trips)>,
 <Feature: pickup_neighborhoods.MAX(trips.passenger_count)>,
 <Feature: pickup_neighborhoods.MAX(trips.trip_distance)>,
 <Feature: pickup_neighborhoods.MAX(trips.trip_duration)>,
 <Feature: pickup_neighborhoods.MEAN(tri

**Question: 4.2 Compute features and define feature matrix**

In [22]:
feature_matrix3 = compute_features(features, cutoff_time)

Elapsed: 00:25 | Progress: 100%|██████████
Finishing computing...


In [23]:
preview(feature_matrix3, 10)

Unnamed: 0_level_0,pickup_neighborhoods.MEDIAN(trips.trip_duration),pickup_neighborhoods.longitude,dropoff_neighborhoods.MEAN(trips.trip_distance),WEEK(pickup_datetime),trip_duration,dropoff_neighborhoods.MEDIAN(trips.trip_distance),pickup_neighborhood = AD,pickup_neighborhood = AA,pickup_neighborhood = D,pickup_neighborhood = A,...,dropoff_neighborhoods.SUM(trips.trip_duration),dropoff_neighborhoods.STD(trips.trip_distance),MONTH(pickup_datetime),payment_type,dropoff_neighborhoods.MEDIAN(trips.passenger_count),HOUR(pickup_datetime),dropoff_neighborhoods.MAX(trips.passenger_count),MONTH(dropoff_datetime),dropoff_neighborhoods.COUNT(trips),pickup_neighborhoods.MEAN(trips.passenger_count)
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
56311,679.0,-73.987205,2.495358,2,645.0,1.75,False,False,False,False,...,1053070.0,2.714009,1,1,1.0,0,6.0,1,1396.0,1.713964
691284,640.0,-73.991595,2.338798,18,160.0,1.69,False,False,False,False,...,13871368.0,2.477577,5,1,1.0,12,6.0,5,16736.0,1.671072
691285,560.0,-73.982322,2.176976,18,295.0,1.6,False,False,False,False,...,12276557.0,2.461582,5,1,1.0,12,6.0,5,19017.0,1.665092
691286,585.5,-73.977943,2.36529,18,1573.0,1.37,False,False,False,False,...,24003066.0,4.294641,5,1,1.0,12,6.0,5,28805.0,1.66052
691288,601.0,-73.985336,2.067381,18,404.0,1.5,False,False,False,False,...,12359849.0,2.303356,5,1,1.0,12,6.0,5,16985.0,1.680872
691289,698.0,-73.998366,2.102551,18,1906.0,1.4,False,False,False,False,...,24417513.0,2.598888,5,1,1.0,12,6.0,5,31541.0,1.69507
691290,517.0,-73.966696,1.732215,18,156.0,1.3,False,False,False,False,...,14790756.0,1.843108,5,2,1.0,12,6.0,5,21894.0,1.642417
691291,600.0,-73.956886,2.061938,18,827.0,1.52,False,False,False,False,...,14027834.0,4.096928,5,1,1.0,12,6.0,5,21272.0,1.64946
691292,643.0,-73.976515,1.732215,18,883.0,1.3,True,False,False,False,...,14790756.0,1.843108,5,1,1.0,12,6.0,5,21894.0,1.63586
691293,540.0,-73.960551,2.200316,18,592.0,1.48,False,False,False,False,...,18350446.0,2.747908,5,1,1.0,12,6.0,5,24592.0,1.676131


### Step 8: Build the new model

**Question 4.3 What was the Modeling Score after your last training round when including the aggregate transforms?**

**Question 4.4 How do these aggregate transforms impact performance? How do they impact training time?**

In [25]:
# separates the whole feature matrix into train data feature matrix,
# train data labels, and test data feature matrix 
X_train3, y_train3, X_test3, y_test3 = utils.get_train_test_fm(feature_matrix3,.75)
y_train3 = np.log(y_train3 + 1)
y_test3 = np.log(y_test3 + 1)

In [26]:
model3 = GradientBoostingRegressor(verbose=True)
model3.fit(X_train3, y_train3)

model3.score(X_test3,y_test3)

      Iter       Train Loss   Remaining Time 
         1           0.4925            9.73m
         2           0.4333            9.82m
         3           0.3843            9.65m
         4           0.3444            9.61m
         5           0.3117            9.50m
         6           0.2848            9.33m
         7           0.2620            9.28m
         8           0.2435            9.15m
         9           0.2282            9.08m
        10           0.2152            8.98m
        20           0.1585            8.06m
        30           0.1426            6.92m
        40           0.1354            5.87m
        50           0.1317            4.86m
        60           0.1293            3.87m
        70           0.1273            2.89m
        80           0.1260            1.91m
        90           0.1246           57.02s
       100           0.1237            0.00s


0.7636476931897523

**Write your answers here:_____**

- The model Performance has improved from ~75.9% to ~76.4% by the addition of transform and aggregation features. Still not much of a difference between the two model performances.
- Although the modeling score has increased after adding aggregate transforms, but training time was also increased significantly.

#### Based on the above 3 models, we can make predictions using our model2, as it is giving almost same accuracy as model3 and also the training time is not that large as compared to model3

In [27]:
y_pred = model2.predict(X_test2)
y_pred = np.exp(y_pred) - 1 # undo the log we took earlier
y_pred[5:]

array([ 517.79065872,  523.8562343 , 1440.46465995, ..., 1065.93980266,
       1881.34304208,  782.00731943])

### Question 5: What are some important features based on model2 and how can they affect the duration of the rides? (5 Marks)

In [28]:
feature_importances(model3, feature_matrix3.columns, n=10)

1: Feature: dropoff_neighborhoods.MIN(trips.trip_distance), 0.907
2: Feature: MINUTE(pickup_datetime), 0.027
3: Feature: dropoff_neighborhoods.MEDIAN(trips.passenger_count), 0.023
4: Feature: WEEK(dropoff_datetime), 0.016
5: Feature: trip_duration, 0.005
6: Feature: dropoff_neighborhoods.STD(trips.trip_duration), 0.003
7: Feature: dropoff_neighborhoods.MEAN(trips.trip_distance), 0.003
8: Feature: dropoff_neighborhoods.MAX(trips.trip_duration), 0.003
9: Feature: dropoff_neighborhoods.STD(trips.passenger_count), 0.002
10: Feature: pickup_neighborhoods.MEAN(trips.trip_distance), 0.002


**Write your answers here:_____**

- pickup_datetime is the most important feature, which implies that the longer the pickup_datetime is the longer duration of the trip is.
- Other important features are passenger_count, maybe some transport companies are carrying more passengers on a given trip than the others. 
- The aggregate feature of the sum of passenger count, may represents the The trip duration is affected by the number of total passengers in the dropoff_neighborhoods.