<center><a href="https://www.featuretools.com/"><img src="img/featuretools-logo.png" width="400" height="200" /></a></center>

<h2> An Advanced Featuretools Approach with Premium Primitives</h2>
<p>The following tutorial illustrates an advanced featuretools model for the NYC Taxi Trip Duration competition on Kaggle using premium primitives and our custom primitive API. You will need to download the following five files into a `data` folder in this repository.</p>

<h2>Step 1: Load EntitySet </h2>

For this example we will download a pre-built EntiySet

In [None]:
import featuretools as ft
import pandas as pd
import numpy as np

In [None]:
es = ft.entityset.read_entityset("s3://featurelabs-static/nyc_taxi_entityset")

With [graphviz installed](https://docs.featuretools.com/getting_started/install.html#installing-graphviz) we can generate a visualization of the EntitySet

In [None]:
es.plot()

## Featuretools with Premium Primitives and Custom Primitives

Some premium primitives we can apply to this problem:

* CityblockDistance - roughly, the distance from point A to point B while only traveling due North, East, South, or West.  This can be a better metric for distance than Haversine since cars cannot travel diagonally through a city block.
* IsInGeoBox - returns True if a LatLong coordinate is found within a rectangle created using the supplied coordinates as opposite corners of the box.
    * (40.62, -73.85), (40.70, -73.75) - Area around JFK Airport
    * (40.76, -73.89), (40.78, -73.85) - Area around La Guardia Airport
* IsFederalHoliday - returns True if the input time was during a federal holiday
* PartOfDay - returns Morning, Afternoon, Evening, or Night based on input time
* Bearing - the angle (in degrees) between the shortest path from point A to B and North.

In [None]:
from featuretools.primitives import CityblockDistance, IsInGeoBox, IsFederalHoliday, PartOfDay

In [None]:
from featuretools.primitives import make_trans_primitive
from featuretools.variable_types import LatLong, Numeric

def bearing(latlong1, latlong2):
    lat1 = np.array([x[0] for x in latlong1])
    lon1 = np.array([x[1] for x in latlong1])
    lat2 = np.array([x[0] for x in latlong2])
    lon2 = np.array([x[1] for x in latlong2])
    delta_lon = np.radians(lon2 - lon1)
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    x = np.cos(lat2) * np.sin(delta_lon)
    y = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(delta_lon)
    return np.degrees(np.arctan2(x, y))

Bearing = make_trans_primitive(function=bearing,
                               input_types=[LatLong, LatLong],
                               commutative=True,
                               return_type=Numeric)

Before calculating the features we need to specify the [cutoff time](https://docs.featuretools.com/automated_feature_engineering/handling_time.html#what-is-the-cutoff-time) for using data to calculate features for an individual trip. We'll use the time when the passenger was picked up.

In [None]:
cutoff_time = es['trips'].df[['id', 'pickup_datetime']]

In [None]:
agg_primitives = ['Sum', 'Mean', 'Median', 'Std', 'Count', 'Min', 'Max', 'Num_Unique', 'Skew']
trans_primitives = ["cityblock_distance",
                    IsInGeoBox((40.62, -73.85), (40.70, -73.75)),
                    IsInGeoBox((40.76, -73.89), (40.78, -73.85)),
                    "is_federal_holiday",
                    "part_of_day",
                    Bearing,
                    "Haversine", "Latitude", "Longitude",
                    'Day', 'Hour', 'Minute', 'Month', 'Weekday', 'Week', 'Is_weekend']

# this allows us to create features that are conditioned on a second value before we calculate.
es.add_interesting_values()

# calculate feature_matrix using deep feature synthesis
feature_matrix, features = ft.dfs(entityset=es,
                                  target_entity="trips",
                                  trans_primitives=trans_primitives,
                                  agg_primitives=agg_primitives,
                                  drop_contains=['trips.test_data'],
                                  verbose=True,
                                  cutoff_time=cutoff_time,
                                  approximate='36d',
                                  max_depth=4)

feature_matrix.head()

We need to encode the PartOfDay features so Xgboost can process them.

In [None]:
encoded_matrix, encoded_features = ft.encode_features(
    feature_matrix=feature_matrix,
    features=features,
    to_encode=["PART_OF_DAY(pickup_datetime)",
               "vendors.PART_OF_DAY(first_trips_time)",
               "passenger_cnt.PART_OF_DAY(first_trips_time)"])

## Step 5: Build the Model

<p>We need to retrieve our labels for the train dataset, so we should merge our current feature matrix with the original dataset. </p>

<p>We use the `log` of the trip duration since that measure is better at distinguishing distances within the city</p>

In [None]:
from sklearn.model_selection import train_test_split

X_train = encoded_matrix[encoded_matrix['test_data'] == False]
X_train = X_train.drop(['test_data'], axis=1)
labels = X_train['trip_duration']
X_train = X_train.drop(['trip_duration'], axis=1)
labels = np.log(labels.values + 1)

X_train, X_val, y_train, y_val = train_test_split(X_train.values,
                                                  labels,
                                                  test_size=0.2,
                                                  random_state=0)

In [None]:
import xgboost as xgb

dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_val, label=y_val)

evals = [(dtrain, 'train'), (dvalid, 'valid')]

xgb_params = {
    "min_child_weight": 1, "eta": 0.166, "colsample_bytree": 0.4,
    "max_depth": 9, "subsample": 1.0, "lambda": 57.93, "booster": "gbtree",
    "gamma": 0.5, "silent": 1, "eval_metric": "rmse", "objective": "reg:linear"  
}

model = xgb.train(params=xgb_params, dtrain=dtrain, num_boost_round=227,
                  evals=evals, early_stopping_rounds=60, maximize=False,
                  verbose_eval=10)

print('Modeling RMSLE %.5f' % model.best_score)


In [None]:
feature_importance_dict = model.get_fscore()

feature_names = X_train.columns.values
fs = ['f%i' % i for i in range(len(feature_names))]
f1 = pd.DataFrame({'f': list(feature_importance_dict.keys()),
                   'importance': list(feature_importance_dict.values())})
f2 = pd.DataFrame({'f': fs, 'feature_name': feature_names})
feature_importance = pd.merge(f1, f2, how='right', on='f')
feature_importance = feature_importance.fillna(0)
feature_importance = feature_importance[['feature_name', 'importance']].sort_values(by='importance',
                                                                      ascending=False)


<p>
    <img src="https://www.featurelabs.com/wp-content/uploads/2017/12/logo.png" alt="Featuretools" />
</p>

Featuretools was created by the developers at [Feature Labs](https://www.featurelabs.com/). If building impactful data science pipelines is important to you or your business, please [get in touch](https://www.featurelabs.com/contact/).