<center><a href="https://www.featuretools.com/"><img src="img/featuretools-logo.png" width="400" height="200" /></a></center>

<h2> New York City Taxi Ride Duration Prediction </h2>

In this case study, we will build a predictive model to predict the duration of taxi ride. We will do the following steps:
* First install the dependencies
* Next load the data as pandas dataframe
* Define the outcome variable- the variable we are trying to predict. 
* Build features using featuretools package - that implements Deep Feature Synthesis. We will start with simple features and incrementally improve the feature definitions and examine the accuracy of the system. 

Allocate atleast 2-3 hours to go through this case study end-to-end

<h2>Install Dependencies</h2>
<p>If you have not done so already, download this repository <a href="https://github.com/Featuretools/DSx/archive/master.zip">from git</a>. Once you have downloaded this archive, unzip it and cd into the directory from the command line. Next run the command ``./install_osx.sh`` if you are on a mac or ``./install_linux.sh`` if you are on linux. This should install all of the dependencies.</p>
<p> If you are on a windows machine, open the requirements.txt folder and make sure to install each of the dependencies listed (featuretools, jupyter, pandas, sklearn, xgboost, numpy) </p>
<p> Once you have installed all of the dependencies, open this notebook. On Mac and Linux, navigate to the directory that you downloaded from git and run ``jupyter notebook`` to be taken to this notebook in your default web browser. When you open the NewYorkCity_taxi_case_study.ipynb file in the web browser, you can step through the code by clicking the ``Run`` button at the top of the page. If you have any questions for how to use Jupyter, refer to google or the discussion forum.</p>

<h2>Running the Code</h2>

In [43]:
import pandas as pd
import numpy as np
import featuretools as ft
import utils
from utils import load_nyc_taxi_data, compute_features
from sklearn.metrics import mean_squared_error
from math import sqrt
from featuretools.primitives import (Day, Hour, Minute, Month, Weekday, Week, Weekend, Sum, Mean, Median, Std)
ft.__version__
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload



### Step 1: Download and load the raw data as pandas dataframes </h2>
<p>If you have not yet downloaded the data it can be downloaded <a href="https://s3.amazonaws.com/mit-dsx-data/nyc-taxi-data.zip">from S3</a>. Once you have downloaded the archive, unzip it and place trips.csv, passenger_cnt.csv, and vendors.csv in the nyc-taxi-data folder. 
</p>

In [44]:
trips, passenger_cnt, vendors = load_nyc_taxi_data()
trips.head(10)

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,trip_duration
0,0,2,2016-01-01 00:00:19,2016-01-01 00:06:31,3,1.32,-73.961258,40.7962,False,-73.95005,40.787312,2,372.0
1,1,2,2016-01-01 00:01:45,2016-01-01 00:27:38,1,13.7,-73.956169,40.707756,False,-73.939949,40.839558,1,1553.0
2,2,1,2016-01-01 00:01:47,2016-01-01 00:21:51,2,5.3,-73.993103,40.752632,False,-73.953903,40.81654,2,1204.0
3,3,2,2016-01-01 00:01:48,2016-01-01 00:16:06,1,7.19,-73.983009,40.731419,False,-73.930969,40.80846,2,858.0
4,4,1,2016-01-01 00:02:49,2016-01-01 00:20:45,2,2.9,-74.004631,40.747234,False,-73.976395,40.777237,1,1076.0
5,5,2,2016-01-01 00:03:21,2016-01-01 00:12:18,1,2.76,-73.956947,40.76638,False,-73.943008,40.796822,1,537.0
6,6,1,2016-01-01 00:04:20,2016-01-01 00:13:16,4,1.0,-73.98912,40.738045,False,-73.991638,40.748993,1,536.0
7,7,1,2016-01-01 00:05:06,2016-01-01 00:32:46,1,10.6,-73.972755,40.764198,False,-73.834953,40.692356,1,1660.0
8,8,2,2016-01-01 00:05:06,2016-01-01 00:12:27,3,2.32,-73.962997,40.765808,False,-73.967758,40.79039,2,441.0
9,9,2,2016-01-01 00:05:15,2016-01-01 00:08:27,1,0.73,-73.973824,40.792049,False,-73.977913,40.78376,2,192.0


The ``trips`` table has the following fields
* ``id`` which uniquely identifies the trip
* ``vendor_id`` is the taxi cab company - in our case study we have data from three different cab companies
* ``pickup_datetime`` the time stamp for pickup
* ``dropoff_datetime`` the time stamp for drop-off
* ``passenger_count`` the number of passengers for the trip
* ``trip_distance`` total distance of the trip in miles 
* ``pickup_longitude`` the longitude for pickup
* ``pickup_latitude`` the latitude for pickup
* ``dropoff_longitude``the longitude of dropoff 
* ``dropoff_latitude`` the latitude of dropoff
* ``payment_type``
* ``trip_duration`` this is the duration we would like to predict using other fields 

## Step 2: Prepare the Data 
Lets create entities and relationships. The three entities in this data are 
* trips 
* vendors (these are the cab companies)
* passenger_cnt (a simple entity that has the unique number of passenger counts 1-8)

This data has the following relationships
* Vendors --> trips (the same vendor can have multiple trips - vendors is the ``parent_entity`` and trips it the child entity
* passenger_cnt --> trips (the same passenger_cnt can appear in multiple trips. passenger_cnt is the ``parent_entity`` and trips is the child entity. 

In <a <href="https://www.featuretools.com/"><featuretools (automated feature engineering software package)/></a>, we specify the list of entities and relationships as follows: 

In [45]:
entities = {
        "trips": (trips, "id", 'pickup_datetime' ),
        "vendors": (vendors, "vendor_id"),
        "passenger_cnt": (passenger_cnt,"passenger_count")
        }

relationships = [("vendors", "vendor_id","trips", "vendor_id"), 
                ("passenger_cnt", "passenger_count","trips", "passenger_count")]

<p>We specify the time for each instance of the target_entity, in this case ``trips`` to calculate features. The timestamp represents the last time data can be used for calculating features by DFS. This is specified using a dataframe of cutoff time. This cutoff time for each trip is the pickup time.</p>

In [46]:
cutoff_time = (trips[['id', 'pickup_datetime']])
print cutoff_time.head(10)

   id     pickup_datetime
0   0 2016-01-01 00:00:19
1   1 2016-01-01 00:01:45
2   2 2016-01-01 00:01:47
3   3 2016-01-01 00:01:48
4   4 2016-01-01 00:02:49
5   5 2016-01-01 00:03:21
6   6 2016-01-01 00:04:20
7   7 2016-01-01 00:05:06
8   8 2016-01-01 00:05:06
9   9 2016-01-01 00:05:15


<h2>Step 3: Create baseline features using DFS </h2>
<p>Instead of manually creating features, such as month of <b>pickup_datetime</b>, we can let featuretools come up with them. </p> 

Featuretools does this by 
* interpret the types of variables - categorical, numeric and others. We can override this interpretation by specifying the types. In this case study, we wanted <b>passenger_count</b> to be a type of Ordinal, and <b>vendor_id</b> to be of type Categorical. This override occured while loading in the csv files.</p>
* then based on the primitives we specify, it matches up the columns to which those primitives can be applied. 

### Create transform features using transform primitives

As we described in the video, features fall into two major categories, ``transform`` and ``aggregate``. In featureools, we can create transform features by specifying ``transform`` primitives. Below we specify a ``transform`` primitive called ``weekend`` and here is what it does:

* It can be applied to any ``datetime`` column in the data. 
* For each entry in the column, it assess if it is a ``weekend`` and returns a boolean. 

In this specific data, there are two ``datetime`` columns ``pickup_datetime`` and ``dropoff_datetime``. The tool automatically creates features using the primitive and these two columns as shown below. 

In [47]:
trans_primitives = [Weekend]

features = ft.dfs(entities=entities,
                   relationships=relationships,
                   target_entity="trips",
                   trans_primitives=trans_primitives,
                   agg_primitives=[],
                   features_only=True)

<p>Here are the features created.</p>

In [48]:
print len(features)
features

12


[<Feature: vendor_id>,
 <Feature: passenger_count>,
 <Feature: payment_type>,
 <Feature: dropoff_longitude>,
 <Feature: pickup_latitude>,
 <Feature: trip_duration>,
 <Feature: store_and_fwd_flag>,
 <Feature: trip_distance>,
 <Feature: dropoff_latitude>,
 <Feature: pickup_longitude>,
 <Feature: IS_WEEKEND(dropoff_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>]

Now let's compute the features. 

In [49]:
feature_matrix = compute_features(features,cutoff_time)

<h2>Step 4: Build the Model </h2>

To build a model,
* we first seperate the data into a porition for ``training`` (75% in this case) and a portion for ``testing`` 
* We also get the log of the trip duration so that a more linear relationship can be found.
* We use ``XGBOOST`` to train a model. 

In [50]:
# separates the whole feature matrix into train data feature matrix, train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix,.75)
y_train = np.log(y_train.values + 1)

In [51]:
model = utils.train_xgb(X_train, y_train)

[0]	train-rmse:4.99751	valid-rmse:4.99615
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.

Will train until valid-rmse hasn't improved in 50 rounds.
[10]	train-rmse:0.93005	valid-rmse:0.929421
[20]	train-rmse:0.433608	valid-rmse:0.433675
[30]	train-rmse:0.381003	valid-rmse:0.382241
[40]	train-rmse:0.373603	valid-rmse:0.375856
[50]	train-rmse:0.369393	valid-rmse:0.372375
[60]	train-rmse:0.364222	valid-rmse:0.368266
[70]	train-rmse:0.362767	valid-rmse:0.367407
[80]	train-rmse:0.360245	valid-rmse:0.365683
[90]	train-rmse:0.355881	valid-rmse:0.362394
[100]	train-rmse:0.354977	valid-rmse:0.361988
[110]	train-rmse:0.353991	valid-rmse:0.36151
[120]	train-rmse:0.35106	valid-rmse:0.359381
[130]	train-rmse:0.349586	valid-rmse:0.35839
[140]	train-rmse:0.348239	valid-rmse:0.35747
[150]	train-rmse:0.347029	valid-rmse:0.356792
[160]	train-rmse:0.346215	valid-rmse:0.356345
[170]	train-rmse:0.345722	valid-rmse:0.356136
[180]	train-rmse:0.345057	valid-rmse:0.355904

<h2>Step 5: Adding more Transform Primitives</h2>

* Adding ``Minute`` ``Hour`` ``Week`` ``Month`` ``Weekday`` primitives
* All these transform primitives apply to ``datetime`` column

In [52]:
trans_primitives = [Minute, Hour, Day, Week, Month, Weekday, Weekend]

features = ft.dfs(entities=entities,
                   relationships=relationships,
                   target_entity="trips",
                   trans_primitives=trans_primitives,
                   agg_primitives=[],
                   features_only=True)

In [53]:
print len(features)
features

36


[<Feature: passenger_count>,
 <Feature: dropoff_longitude>,
 <Feature: payment_type>,
 <Feature: store_and_fwd_flag>,
 <Feature: vendor_id>,
 <Feature: pickup_latitude>,
 <Feature: pickup_longitude>,
 <Feature: trip_duration>,
 <Feature: trip_distance>,
 <Feature: dropoff_latitude>,
 <Feature: WEEKDAY(pickup_datetime)>,
 <Feature: WEEK(dropoff_datetime)>,
 <Feature: HOUR(pickup_datetime)>,
 <Feature: WEEKDAY(dropoff_datetime)>,
 <Feature: DAY(pickup_datetime)>,
 <Feature: MONTH(pickup_datetime)>,
 <Feature: WEEK(pickup_datetime)>,
 <Feature: DAY(dropoff_datetime)>,
 <Feature: MONTH(dropoff_datetime)>,
 <Feature: HOUR(dropoff_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: IS_WEEKEND(dropoff_datetime)>,
 <Feature: MINUTE(pickup_datetime)>,
 <Feature: MINUTE(dropoff_datetime)>,
 <Feature: passenger_cnt.WEEK(first_trips_time)>,
 <Feature: vendors.DAY(first_trips_time)>,
 <Feature: passenger_cnt.WEEKDAY(first_trips_time)>,
 <Feature: vendors.WEEKDAY(first_trips_time)>,
 <Fe

Now let's compute the features. 

In [54]:
feature_matrix = compute_features(features,cutoff_time)

In [55]:
feature_matrix.head(10)

Unnamed: 0_level_0,passenger_count,dropoff_longitude,payment_type,store_and_fwd_flag,vendor_id,pickup_latitude,pickup_longitude,trip_duration,trip_distance,dropoff_latitude,...,passenger_cnt.WEEKDAY(first_trips_time),vendors.WEEKDAY(first_trips_time),vendors.MONTH(first_trips_time),passenger_cnt.DAY(first_trips_time),passenger_cnt.MINUTE(first_trips_time),passenger_cnt.HOUR(first_trips_time),vendors.HOUR(first_trips_time),passenger_cnt.MONTH(first_trips_time),vendors.MINUTE(first_trips_time),vendors.WEEK(first_trips_time)
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,3,-73.95005,2,False,2,40.7962,-73.961258,372.0,1.32,40.787312,...,4,4,1,1,0,0,0,1,1,53
1,1,-73.939949,1,False,2,40.707756,-73.956169,1553.0,13.7,40.839558,...,4,4,1,1,45,1,0,1,1,53
2,2,-73.953903,2,False,1,40.752632,-73.993103,1204.0,5.3,40.81654,...,4,4,1,1,47,1,0,1,1,53
3,1,-73.930969,2,False,2,40.731419,-73.983009,858.0,7.19,40.80846,...,4,4,1,1,45,1,0,1,1,53
4,2,-73.976395,1,False,1,40.747234,-74.004631,1076.0,2.9,40.777237,...,4,4,1,1,47,1,0,1,1,53
5,1,-73.943008,1,False,2,40.76638,-73.956947,537.0,2.76,40.796822,...,4,4,1,1,45,1,0,1,1,53
6,4,-73.991638,1,False,1,40.738045,-73.98912,536.0,1.0,40.748993,...,4,4,1,1,4,0,0,1,1,53
7,1,-73.834953,1,False,1,40.764198,-73.972755,1660.0,10.6,40.692356,...,4,4,1,1,45,1,0,1,1,53
8,3,-73.967758,2,False,2,40.765808,-73.962997,441.0,2.32,40.79039,...,4,4,1,1,0,0,0,1,1,53
9,1,-73.977913,2,False,2,40.792049,-73.973824,192.0,0.73,40.78376,...,4,4,1,1,45,1,0,1,1,53


<h2>Step 5: Build the new model</h2>

In [56]:
# separates the whole feature matrix into train data feature matrix, train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix,.75)
y_train = np.log(y_train.values + 1)

In [57]:
model = utils.train_xgb(X_train, y_train)

[0]	train-rmse:4.98698	valid-rmse:4.98587
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.

Will train until valid-rmse hasn't improved in 50 rounds.
[10]	train-rmse:0.908664	valid-rmse:0.907791
[20]	train-rmse:0.378251	valid-rmse:0.378217
[30]	train-rmse:0.332551	valid-rmse:0.33422
[40]	train-rmse:0.317389	valid-rmse:0.320492
[50]	train-rmse:0.31068	valid-rmse:0.315051
[60]	train-rmse:0.292872	valid-rmse:0.298602
[70]	train-rmse:0.288614	valid-rmse:0.295648
[80]	train-rmse:0.27091	valid-rmse:0.278915
[90]	train-rmse:0.255523	valid-rmse:0.264503
[100]	train-rmse:0.250732	valid-rmse:0.260512
[110]	train-rmse:0.240201	valid-rmse:0.251177
[120]	train-rmse:0.226158	valid-rmse:0.238181
[130]	train-rmse:0.217158	valid-rmse:0.230031
[140]	train-rmse:0.207297	valid-rmse:0.220993
[150]	train-rmse:0.200401	valid-rmse:0.215108
[160]	train-rmse:0.191279	valid-rmse:0.207002
[170]	train-rmse:0.186321	valid-rmse:0.202622
[180]	train-rmse:0.183084	valid-rmse:0.1999

<h2>Step 6: Add Aggregation Primitives</h2>

Now let's add aggregation primitives. These primitives will generate features for the parent entities in this case both ``vendors`` and ``passenger_cnt`` and then add them to the trips entity (which is the entity for which we are trying to make prediction.

In [58]:
trans_primitives = [Minute, Hour, Day, Week, Month, Weekday, Weekend]
aggregation_primitives = [Sum, Mean, Median, Std]

features = ft.dfs(entities=entities,
                   relationships=relationships,
                   target_entity="trips",
                   trans_primitives=trans_primitives,
                   agg_primitives=aggregation_primitives,
                   features_only=True)

In [59]:
print len(features)
features

92


[<Feature: payment_type>,
 <Feature: store_and_fwd_flag>,
 <Feature: dropoff_longitude>,
 <Feature: pickup_longitude>,
 <Feature: trip_duration>,
 <Feature: vendor_id>,
 <Feature: passenger_count>,
 <Feature: pickup_latitude>,
 <Feature: trip_distance>,
 <Feature: dropoff_latitude>,
 <Feature: MONTH(pickup_datetime)>,
 <Feature: HOUR(dropoff_datetime)>,
 <Feature: MINUTE(pickup_datetime)>,
 <Feature: HOUR(pickup_datetime)>,
 <Feature: WEEKDAY(dropoff_datetime)>,
 <Feature: DAY(pickup_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: IS_WEEKEND(dropoff_datetime)>,
 <Feature: WEEK(dropoff_datetime)>,
 <Feature: WEEK(pickup_datetime)>,
 <Feature: MONTH(dropoff_datetime)>,
 <Feature: WEEKDAY(pickup_datetime)>,
 <Feature: DAY(dropoff_datetime)>,
 <Feature: MINUTE(dropoff_datetime)>,
 <Feature: passenger_cnt.STD(trips.pickup_longitude)>,
 <Feature: passenger_cnt.SUM(trips.pickup_longitude)>,
 <Feature: vendors.SUM(trips.dropoff_longitude)>,
 <Feature: passenger_cnt.WEEKDAY(firs

In [60]:
feature_matrix = compute_features(features,cutoff_time)

In [61]:
feature_matrix.head(10)

Unnamed: 0_level_0,payment_type,store_and_fwd_flag,dropoff_longitude,pickup_longitude,trip_duration,vendor_id,passenger_count,pickup_latitude,trip_distance,dropoff_latitude,...,passenger_cnt.MEAN(trips.pickup_longitude),vendors.MEDIAN(trips.dropoff_latitude),passenger_cnt.SUM(trips.trip_distance),passenger_cnt.MEDIAN(trips.payment_type),passenger_cnt.STD(trips.dropoff_latitude),passenger_cnt.MEDIAN(trips.trip_duration),vendors.MEDIAN(trips.payment_type),passenger_cnt.MEDIAN(trips.dropoff_latitude),vendors.MEAN(trips.pickup_longitude),vendors.MEAN(trips.dropoff_longitude)
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,2,False,-73.95005,-73.961258,372.0,2,3,40.7962,1.32,40.787312,...,,,,,,,,,,
1,1,False,-73.939949,-73.956169,1553.0,2,1,40.707756,13.7,40.839558,...,,,,,,,,,,
2,2,False,-73.953903,-73.993103,1204.0,1,2,40.752632,5.3,40.81654,...,,,,,,,,,,
3,2,False,-73.930969,-73.983009,858.0,2,1,40.731419,7.19,40.80846,...,,,,,,,,,,
4,1,False,-73.976395,-74.004631,1076.0,1,2,40.747234,2.9,40.777237,...,,,,,,,,,,
5,1,False,-73.943008,-73.956947,537.0,2,1,40.76638,2.76,40.796822,...,,,,,,,,,,
6,1,False,-73.991638,-73.98912,536.0,1,4,40.738045,1.0,40.748993,...,,,,,,,,,,
7,1,False,-73.834953,-73.972755,1660.0,1,1,40.764198,10.6,40.692356,...,,,,,,,,,,
8,2,False,-73.967758,-73.962997,441.0,2,3,40.765808,2.32,40.79039,...,,,,,,,,,,
9,2,False,-73.977913,-73.973824,192.0,2,1,40.792049,0.73,40.78376,...,,,,,,,,,,


<h2>Step 6: Build the new model</h2>

In [62]:
# separates the whole feature matrix into train data feature matrix, train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix,.75)
y_train = np.log(y_train.values + 1)

In [63]:
model = utils.train_xgb(X_train, y_train)

[0]	train-rmse:4.99815	valid-rmse:4.99687
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.

Will train until valid-rmse hasn't improved in 50 rounds.
[10]	train-rmse:0.936621	valid-rmse:0.9361
[20]	train-rmse:0.376382	valid-rmse:0.377363
[30]	train-rmse:0.323934	valid-rmse:0.326726
[40]	train-rmse:0.310511	valid-rmse:0.315125
[50]	train-rmse:0.303277	valid-rmse:0.309302
[60]	train-rmse:0.28603	valid-rmse:0.293976
[70]	train-rmse:0.272503	valid-rmse:0.281673
[80]	train-rmse:0.255918	valid-rmse:0.26648
[90]	train-rmse:0.249071	valid-rmse:0.26074
[100]	train-rmse:0.246214	valid-rmse:0.259197
[110]	train-rmse:0.240819	valid-rmse:0.254303
[120]	train-rmse:0.234597	valid-rmse:0.248932
[130]	train-rmse:0.225729	valid-rmse:0.240803
[140]	train-rmse:0.223231	valid-rmse:0.23897
[150]	train-rmse:0.221887	valid-rmse:0.238123
[160]	train-rmse:0.22111	valid-rmse:0.237765
[170]	train-rmse:0.210751	valid-rmse:0.227995
[180]	train-rmse:0.201788	valid-rmse:0.219886
[

<h2>Step 7: Evalute on test data  </h2>


In [64]:
y_pred = utils.predict_xgb(model, X_test)
y_pred.head(5)

Unnamed: 0_level_0,trip_duration
id,Unnamed: 1_level_1
765003,641.368591
765004,538.560974
765005,1349.380615
765006,970.936584
765007,2034.524536


In [65]:
print "rmse:", np.sqrt(mean_squared_error(y_test, y_pred['trip_duration']))

rmse: 194.882283782


<h2>Additional Analysis</h2>
<p>Let's look at how important each feature was for the model.</p>

In [66]:
feature_names = X_train.columns.values
ft_importances = utils.feature_importances(model, feature_names)
ft_importances[:20]

Unnamed: 0,feature_name,importance
6,pickup_latitude,4486.0
2,dropoff_longitude,3982.0
8,dropoff_latitude,3564.0
3,pickup_longitude,3381.0
7,trip_distance,2821.0
28,MINUTE(dropoff_datetime),2417.0
73,MINUTE(pickup_datetime),2267.0
72,HOUR(dropoff_datetime),2166.0
70,HOUR(pickup_datetime),1973.0
71,WEEKDAY(dropoff_datetime),1009.0
