# Predict NYC Taxi Fares using ANN

The goal is to estimate the cost of a New York City cab ride from several inputs.The inspiration behind this code along is a recent <a href='https://www.kaggle.com/c/new-york-city-taxi-fare-prediction'>Kaggle competition</a>.

### STEPS:
1. Read the data  : only a portion from the 55 million dataset is used (120,000 records from April 11 to April 24, 2010.)
2. Features engineering :
 - Calculate distance
 - Derive useful data and time statistics
3.  Deal with categorical data 
 - Embedding
4. Use of TabularModel class to work with both continuous and categorical data
 - Create a TabularModel class
 - Add in loss function and optimizer
 - Train/test split the data
 - Train the model
 - Evaluate on test data
 - Predict on brand new data
 
    
        

In [1]:
import torch 
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# read the data into a data frame
df= pd.read_csv("../Data/NYCTaxiFares.csv")
df.head(5)

Unnamed: 0,pickup_datetime,fare_amount,fare_class,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2010-04-19 08:17:56 UTC,6.5,0,-73.992365,40.730521,-73.975499,40.744746,1
1,2010-04-17 15:43:53 UTC,6.9,0,-73.990078,40.740558,-73.974232,40.744114,1
2,2010-04-17 11:23:26 UTC,10.1,1,-73.994149,40.751118,-73.960064,40.766235,2
3,2010-04-11 21:25:03 UTC,8.9,0,-73.990485,40.756422,-73.971205,40.748192,1
4,2010-04-17 02:19:01 UTC,19.7,1,-73.990976,40.734202,-73.905956,40.743115,1


In [4]:
# descrptive statistics on the fare amount
df['fare_amount'].describe()

count    120000.000000
mean         10.040326
std           7.500134
min           2.500000
25%           5.700000
50%           7.700000
75%          11.300000
max          49.900000
Name: fare_amount, dtype: float64

So, the average fare is \\$10.Minimum amount for a ride being \\$2.50 and maximum being \\$49.90 with a median of \\$7.70. 

In [6]:
# copy the dataframe to do feature engineering
df1 = df.copy()
df1.head()

Unnamed: 0,pickup_datetime,fare_amount,fare_class,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2010-04-19 08:17:56 UTC,6.5,0,-73.992365,40.730521,-73.975499,40.744746,1
1,2010-04-17 15:43:53 UTC,6.9,0,-73.990078,40.740558,-73.974232,40.744114,1
2,2010-04-17 11:23:26 UTC,10.1,1,-73.994149,40.751118,-73.960064,40.766235,2
3,2010-04-11 21:25:03 UTC,8.9,0,-73.990485,40.756422,-73.971205,40.748192,1
4,2010-04-17 02:19:01 UTC,19.7,1,-73.990976,40.734202,-73.905956,40.743115,1


### Distance caluclation
Our goal is to predict the fares for a taxi ride in New York city. ususally it depends upon the distance. Here we have the longitudes and latitudes of the pickup and dropoff destinations. But these values look quite similar. To calculate the distance we will take the help of the <a href='https://en.wikipedia.org/wiki/Haversine_formula'>haversine formula</a> which calculates the distance on a sphere between two sets of GPS coordinates.<br>
Here we assign latitude values with $\varphi$ (phi) and longitude with $\lambda$ (lambda).

The distance formula works out to

${\displaystyle d=2r\arcsin \left({\sqrt {\sin ^{2}\left({\frac {\varphi _{2}-\varphi _{1}}{2}}\right)+\cos(\varphi _{1})\:\cos(\varphi _{2})\:\sin ^{2}\left({\frac {\lambda _{2}-\lambda _{1}}{2}}\right)}}\right)}$

where

$\begin{split} r&: \textrm {radius of the sphere (Earth's radius averages 6371 km)}\\
\varphi_1, \varphi_2&: \textrm {latitudes of point 1 and point 2 in radians}\\
\lambda_1, \lambda_2&: \textrm {longitudes of point 1 and point 2 in radians}\end{split}$

In [10]:
# define the haversine formula in a function
def haversine(df,long1,lat1,long2,lat2):
    """
    Calculates the haversine distance between 2 sets of GPS coordinates in the dataframe df
    """
    # r = radius of earth in kms
    r = 6371
    
    # convert the longitude and latitude in radians
    phi1 = np.radians(df[lat1])
    phi2 = np.radians(df[lat2])
    
    lambda1 = np.radians(df[long1])
    lambda2 = np. radians(df[long2])
    
    a = np.square(np.sin((phi2-phi1)/2))
    b = np.cos(phi1)* np.cos(phi2)* np.square((np.sin(lambda2-lambda1)/2))
    c = np.sqrt(a+b)
    d = 2*r* np.arcsin(c)
    
    return d

In [11]:
# Feature engineering the distance in Kms
df1['distance_kms'] = haversine(df1,'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude')
df1.head()

Unnamed: 0,pickup_datetime,fare_amount,fare_class,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,distance_kms
0,2010-04-19 08:17:56 UTC,6.5,0,-73.992365,40.730521,-73.975499,40.744746,1,2.126312
1,2010-04-17 15:43:53 UTC,6.9,0,-73.990078,40.740558,-73.974232,40.744114,1,1.392307
2,2010-04-17 11:23:26 UTC,10.1,1,-73.994149,40.751118,-73.960064,40.766235,2,3.326763
3,2010-04-11 21:25:03 UTC,8.9,0,-73.990485,40.756422,-73.971205,40.748192,1,1.864129
4,2010-04-17 02:19:01 UTC,19.7,1,-73.990976,40.734202,-73.905956,40.743115,1,7.231319


### Engineering the day and time

The time can also be an important predictor of the fare amount as it might vary on the weedays and peak hours. To evaluate this , we have to first convert it into a date time variable from a string.

In [12]:
# The datetime columns is string 
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120000 entries, 0 to 119999
Data columns (total 9 columns):
pickup_datetime      120000 non-null object
fare_amount          120000 non-null float64
fare_class           120000 non-null int64
pickup_longitude     120000 non-null float64
pickup_latitude      120000 non-null float64
dropoff_longitude    120000 non-null float64
dropoff_latitude     120000 non-null float64
passenger_count      120000 non-null int64
distance_kms         120000 non-null float64
dtypes: float64(6), int64(2), object(1)
memory usage: 8.2+ MB


In [14]:
df1['pickup_datetime'] = pd.to_datetime(df1['pickup_datetime'])
df1.head()

Unnamed: 0,pickup_datetime,fare_amount,fare_class,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,distance_kms
0,2010-04-19 08:17:56+00:00,6.5,0,-73.992365,40.730521,-73.975499,40.744746,1,2.126312
1,2010-04-17 15:43:53+00:00,6.9,0,-73.990078,40.740558,-73.974232,40.744114,1,1.392307
2,2010-04-17 11:23:26+00:00,10.1,1,-73.994149,40.751118,-73.960064,40.766235,2,3.326763
3,2010-04-11 21:25:03+00:00,8.9,0,-73.990485,40.756422,-73.971205,40.748192,1,1.864129
4,2010-04-17 02:19:01+00:00,19.7,1,-73.990976,40.734202,-73.905956,40.743115,1,7.231319


In [16]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120000 entries, 0 to 119999
Data columns (total 9 columns):
pickup_datetime      120000 non-null datetime64[ns, UTC]
fare_amount          120000 non-null float64
fare_class           120000 non-null int64
pickup_longitude     120000 non-null float64
pickup_latitude      120000 non-null float64
dropoff_longitude    120000 non-null float64
dropoff_latitude     120000 non-null float64
passenger_count      120000 non-null int64
distance_kms         120000 non-null float64
dtypes: datetime64[ns, UTC](1), float64(6), int64(2)
memory usage: 8.2 MB


Now, the pickup_datetime is in data time format. It is represented in UTC format. As New York is in Eastern Time Zone, so we have to convert to ETD by substracting 3 more hours. Also, there is daylight savings between April 11 to 24th, 2010, so we will add one more hour to convert to ETD zone. 

We will also extract the day of travel and if that is during AM or PM  to evaluate the impact on the fare. 

In [21]:
df1['EDT_time'] = df1['pickup_datetime']-pd.Timedelta(hours =4)
df1.head()

Unnamed: 0,pickup_datetime,fare_amount,fare_class,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,distance_kms,EDT_time
0,2010-04-19 08:17:56+00:00,6.5,0,-73.992365,40.730521,-73.975499,40.744746,1,2.126312,2010-04-19 04:17:56+00:00
1,2010-04-17 15:43:53+00:00,6.9,0,-73.990078,40.740558,-73.974232,40.744114,1,1.392307,2010-04-17 11:43:53+00:00
2,2010-04-17 11:23:26+00:00,10.1,1,-73.994149,40.751118,-73.960064,40.766235,2,3.326763,2010-04-17 07:23:26+00:00
3,2010-04-11 21:25:03+00:00,8.9,0,-73.990485,40.756422,-73.971205,40.748192,1,1.864129,2010-04-11 17:25:03+00:00
4,2010-04-17 02:19:01+00:00,19.7,1,-73.990976,40.734202,-73.905956,40.743115,1,7.231319,2010-04-16 22:19:01+00:00


In [22]:
df1['day']=df1['EDT_time'].dt.strftime("%a")

In [25]:
df1['hour'] = df1['EDT_time'].dt.hour

In [27]:
df1['AMorPM'] = np.where(df1['hour']<12,'am','pm')
df1.head()

Unnamed: 0,pickup_datetime,fare_amount,fare_class,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,distance_kms,EDT_time,day,hour,AMorPM
0,2010-04-19 08:17:56+00:00,6.5,0,-73.992365,40.730521,-73.975499,40.744746,1,2.126312,2010-04-19 04:17:56+00:00,Mon,4,am
1,2010-04-17 15:43:53+00:00,6.9,0,-73.990078,40.740558,-73.974232,40.744114,1,1.392307,2010-04-17 11:43:53+00:00,Sat,11,am
2,2010-04-17 11:23:26+00:00,10.1,1,-73.994149,40.751118,-73.960064,40.766235,2,3.326763,2010-04-17 07:23:26+00:00,Sat,7,am
3,2010-04-11 21:25:03+00:00,8.9,0,-73.990485,40.756422,-73.971205,40.748192,1,1.864129,2010-04-11 17:25:03+00:00,Sun,17,pm
4,2010-04-17 02:19:01+00:00,19.7,1,-73.990976,40.734202,-73.905956,40.743115,1,7.231319,2010-04-16 22:19:01+00:00,Fri,22,pm


### Separate categorical from continuous columns

In [29]:
cat_cols = ['day','hour','AMorPM']
cont_cols = ['pickup_longitude','pickup_latitude', 'dropoff_longitude', 
             'dropoff_latitude','passenger_count', 'distance_kms']
y_col = ['fare_amount']

In [28]:
# Categirify : converting the  columns into categories 

Index(['pickup_datetime', 'fare_amount', 'fare_class', 'pickup_longitude',
       'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',
       'passenger_count', 'distance_kms', 'EDT_time', 'day', 'hour', 'AMorPM'],
      dtype='object')

In [30]:
for cat in cat_cols:
    df1[cat] = df1[cat].astype('category')

In [34]:
df1.dtypes

pickup_datetime      datetime64[ns, UTC]
fare_amount                      float64
fare_class                         int64
pickup_longitude                 float64
pickup_latitude                  float64
dropoff_longitude                float64
dropoff_latitude                 float64
passenger_count                    int64
distance_kms                     float64
EDT_time             datetime64[ns, UTC]
day                             category
hour                            category
AMorPM                          category
dtype: object

The last three columns are now converted to categories.

In [35]:
df1['hour'].head()

0     4
1    11
2     7
3    17
4    22
Name: hour, dtype: category
Categories (24, int64): [0, 1, 2, 3, ..., 20, 21, 22, 23]

In [39]:
df1['day'].cat.codes

0         1
1         2
2         2
3         3
4         0
5         4
6         0
7         4
8         5
9         0
10        4
11        4
12        0
13        2
14        6
15        5
16        4
17        4
18        6
19        5
20        5
21        1
22        5
23        5
24        6
25        2
26        3
27        4
28        0
29        5
         ..
119970    3
119971    4
119972    0
119973    2
119974    2
119975    5
119976    5
119977    5
119978    3
119979    6
119980    6
119981    4
119982    6
119983    6
119984    0
119985    1
119986    0
119987    1
119988    3
119989    2
119990    3
119991    1
119992    0
119993    1
119994    3
119995    3
119996    0
119997    3
119998    5
119999    2
Length: 120000, dtype: int8

In [38]:
df1['AMorPM'].cat.categories

Index(['am', 'pm'], dtype='object')

So,we have hour divided in 24 , day in 7 and AMorPM in 2 categories. All these have been neumerically coded as well.
Next we convert these categories to numpy arrays so that in future we can use neural network. 

In [40]:
hr = df1['hour'].cat.codes.values
ampm = df1['AMorPM'].cat.codes.values
wkday = df1['day'].cat.codes.values

In [41]:
# Check one of the variable
hr

array([ 4, 11,  7, ..., 14,  4, 12], dtype=int8)

In [42]:
# stack these 
cats=np.stack([hr,ampm,wkday],axis =1)
cats

array([[ 4,  0,  1],
       [11,  0,  2],
       [ 7,  0,  2],
       ...,
       [14,  1,  3],
       [ 4,  0,  5],
       [12,  1,  2]], dtype=int8)

Now the three categorical variables are represented as a numpy array with each column representing each variable - hour, day and am/pm value.  

In [None]:
# Another way of doing it is using list comprehension
# cats = np.stack([df[col].cat.codes.values for col in cat_cols],1)

In [43]:
# convert category array to tensor
cats = torch.tensor(cats,dtype = torch.int64)

In [49]:
# shape of category tensor
cats.shape

torch.Size([120000, 3])

In [53]:
# convert continuos list to a numpy array to tensor
conts = np.stack([df1[col].values for col in cont_cols],axis =1)
conts[:5]

array([[-73.992365  ,  40.730521  , -73.975499  ,  40.744746  ,
          1.        ,   2.12631158],
       [-73.990078  ,  40.740558  , -73.974232  ,  40.744114  ,
          1.        ,   1.39230685],
       [-73.994149  ,  40.751118  , -73.960064  ,  40.766235  ,
          2.        ,   3.32676333],
       [-73.990485  ,  40.756422  , -73.971205  ,  40.748192  ,
          1.        ,   1.86412923],
       [-73.990976  ,  40.734202  , -73.905956  ,  40.743115  ,
          1.        ,   7.23131908]])

In [45]:
# convert continuos array to tensor
conts = torch.tensor(conts, dtype = torch.float)

In [48]:
# shape of the continuos tensor
conts.shape

torch.Size([120000, 6])

In [54]:
# convert label to tensor
y = torch.tensor(df[y_col].values,dtype = torch.float).reshape(-1,1)
y[:5]

tensor([[ 6.5000],
        [ 6.9000],
        [10.1000],
        [ 8.9000],
        [19.7000]])

In [55]:
# shape of the label tensor
y.shape

torch.Size([120000, 1])

### Embedding

Embedding layer comes with PyTorch. It create a lookup table of fixed dictionary and size. We want to one hot encode the categories. 

### Set an embedding size
The rule of thumb for determining the embedding size is to divide the number of unique entries in each column by 2, but not to exceed 50.

In [2]:
# This will set embedding sizes for Hours, AMvsPM and Weekdays
cat_szs = [len(df1[col].cat.categories) for col in cat_cols]
emb_szs = [(size, min(50, (size+1)//2)) for size in cat_szs]
emb_szs

NameError: name 'cat_cols' is not defined