**Exploratory data analysis for NY Taxi fare**

In [None]:
df = pd.read_csv("NYC_sample.csv")
df.head().T

Handling missing values

In [None]:
# Determine the percentage of missing values in each column of the data

na_counts = pd.DataFrame(df.isna().sum()/len(df)) 
na_counts.columns = ["null_row_pct"]
na_counts[na_counts.null_row_pct > 0].sort_values(by = "null_row_pct", ascending=False)

Removing the missing location ID and coordinates

In [None]:
df = df[~(
 (df.Dropoff_latitude.isna()) & (df.DOLocationID.isna())
)]

Handling time values

Converting the pickup and dropoff times into pandas datetime values to calculate the target value, which will be the natural log of the difference in time between dropoff and pickup in seconds

In [None]:
df["trip_duration"] = np.log((df.Lpep_dropoff_datetime - df.lpep_pickup_datetime).dt.seconds + 1)

In the preceding line of code, we are adding 1 second to the trip duration to prevent an undefined error when a log transformation is applied over the value.

But why are we using natural log transformation over the trip duration? There are three reasons for this, as follows:

For the Kaggle competition on New York taxi trip duration prediction, the evaluation metric is defined as the Root Mean Squared Logarithmic Error (RMSLE). When log transformation is applied and the RMSE is calculated over the target values, we get the RMSLE. This helps us compare our results with the best-performing teams. 

Errors in log scale let us know by how many factors we were wrong, for example, whether we were 10% off from the actual values or 70% off. We will be discussing this in detail when we look at the Error metric section.

The log transformation over the target variable follows a perfectly normal distribution. This satisfies one of the assumptions of linear regression. The plot of the trip duration values (on a log scale) looks as follows:

**Time values as a feature**

The pickup time (pickup_datetime) can be considered a feature, and can be further deconstructedinto disparate components such as day of the week, month, hour, minute, and Boolean indicators, such as whether the datetime is a weekday or not, as shown in the following screenshot.add_datepart() is a convenience method in thefastai.structured module adds most of these components for us:

In [None]:
add_datepart(df, 'lpep_pickup_datetime', time=True)
df.tail().T

**Handling unrelated data**

There are a few features that aren't related to the value to be predicted, in our case things such as fare and vendor ID and the dropoff time. Therefore, we will go ahead and drop these columns:

In [None]:
drop_columns = [
"Ehail_fee", "Extra", "Payment_type", "Total_amount", "improvement_surcharge", "Tolls_amount", "Tip_amount", "MTA_tax", "VendorID", "RateCodeID", "Store_and_fwd_flag", "Fare_amount", "Lpep_dropoff_datetime", 'Trip_type ', 'Passenger_count'
]
df.drop(columns=drop_columns, inplace=True)

**Spatial data processing**

Three things in this section: taxi zones, spatial joins, and the calculation of distances.

Taxi zones in New York

Analyzing and processing a taxi zone spatial data helps us achieve two objectives: 

- Substitute the missing coordinates for pickup and dropoff locations with the taxi zone's centroid

- Use the taxi zone as a feature in the model

**Visualization of taxi zones**

We have provided the shapefile for the taxi zones in the data repository. Shapefiles can be read as (Geo)DataFrames with the Python library known as GeoPandas, like so:

In [None]:
taxi_zones = gpd.read_file("taxi_zones.shp")
taxi_zones.tail().T

In [None]:
# Visualize taxi zones with the plot method of the GeoDataFrame

#Projecting Taxi Zones into WGS84 coordinate system
taxi_zones = taxi_zones.to_crs({'init': 'epsg:4326'})

#Plot the Geodataframe
ax = taxi_zones.plot(column = "zone", figsize = (12, 12), alpha = 0.4)

**Spatial joins**

If we can derive the pickup and dropoff taxi zone of each trip, we can add these as features to our machine learning model. For this, we need to perform an operation known as a spatial join, which is nothing but a Point-in-Polygon solution that's supported by the GeoDataFrame. The following code has an assign_taxi_zones() function, which takes our DataFrame as an input and returns a pandas series. Internally, it does three things: 

- Construct a GeoDataFrame using the input DataFrame's latitude and longitude values (point geometry)
- Perform a spatial join between the point and the taxi zones (polygon geometry)
- Return the location ID of the taxi zone for each coordinate

In [None]:
from shapely.geometry import Point

def assign_taxi_zones(df, lon_var, lat_var, locid_var):
    try:
        # Construct a Geodataframe using the coordinates of each trip
        local_gdf = gpd.GeoDataFrame(
            crs={'init': 'epsg:4326'},
            geometry=[Point(xy) for xy in
                      zip(df[lon_var], df[lat_var])])                     

        #Perform a spatial join with the Taxi Zones
        local_gdf = gpd.sjoin(local_gdf, taxi_zones, how='left', op='within')

        return local_gdf.LocationID.rename(locid_var)
    except ValueError as ve:
        print(ve)
        print(ve.stacktrace())
        series = df[lon_var]
        series = np.nan
        return series

"""
Calculate pickup and dropoff taxi zone ids
"""

df['pickup_taxizone_id'] = assign_taxi_zones(df, "Pickup_longitude","Pickup_latitude", "pickup_taxizone_id")

df['dropoff_taxizone_id'] = assign_taxi_zones(df, "Dropoff_longitude","Dropoff_latitude","dropoff_taxizone_id")

Remember, this operation will only assign a taxi zone that we know the coordinates of.

To backfill the missing coordinates with the centroid of the taxi zone, we can follow these steps: 

- Find the centroid of each taxi zone
- Join the DataFrame with the taxi zones based on the pickup zone ID that we just computed (pickup_taxizone_id)
- Transfer the taxi zone's centroid to the DataFrame
- For all rows with missing pickup coordinates, substitute the centroid values
- Apply the same process in order to backfill missing dropoff coordinates

In [None]:
#1. Finding Taxi Zone' Centroid
taxi_zones["X"] = taxi_zones.centroid.x
taxi_zones["Y"] = taxi_zones.centroid.y

#2. Join dataframe with taxizone based on pickup zone id
df = pd.merge(df, taxi_zones[["LocationID","X", "Y"]], how = "left", left_on = "PULocationID", right_on = "LocationID")

#3.Substitute missing lat/long values w/ 
# the taxi zone's centroid

df.Pickup_longitude.fillna(df.X, inplace=True)
df.Pickup_latitude.fillna(df.Y, inplace=True)


df.drop(columns=["LocationID", "X", "Y"], inplace=True)

#5. Apply same process for Dropoff zone
df = pd.merge(df, taxi_zones[["LocationID","X", "Y"]], how = "left", left_on = "DOLocationID", right_on = "LocationID")

df.Dropoff_longitude.fillna(df.X, inplace=True)
df.Dropoff_latitude.fillna(df.Y, inplace=True)

df.drop(columns=["LocationID", "X", "Y"], inplace=True)

df.tail().T

We can use a very similar process to add the borough names from the taxi zone shapefile to each row in the DataFrame. In this process, we have added the following features:

- Pickup zone ID
- Dropoff zone ID
- Pickup borough
- Dropoff borough

We were also quite successful in backfilling many missing values in pickup and dropoff location coordinates, as well as taxi zone IDs.

**Calculating distances**

When it comes to distance, there are different kinds of distance that make sense in this context. These are as follows:

- Distance as the crow flies (or Haversine distance) 
- Distance as you drive in Manhattan (or Manhattan distance)

**Haversine distance**
Haversine distance is the Great Circle Distance (GCD) between two geographic coordinates. A GCD is incidentally the shortest distance between the two coordinates. This is almost similar to a Euclidean distance (or a straight-line distance), except that we are accounting for the spherical nature of the Earth (yes, we are generalizing the Earth as a sphere with a radius of 3,958 miles to make our lives easier). The Python code for calculating the Haversine distance is as follows:

In [None]:
import numpy as np

def haversine(lat1, lon1, lat2, lon2):
    R = 3958.76 # Earth radius in miles
    dLat = np.radians(lat2 - lat1)
    dLon = np.radians(lon2 - lon1)
    lat1 = np.radians(lat1)
    lat2 = np.radians(lat2)
    a = np.sin(dLat/2) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dLon/2) ** 2
    c = 2*np.arcsin(np.sqrt(a))
    return R * c

The preceding function takes a pair of coordinates (the latitude and longitude values of the source and destination, respectively) and returns the Haversine distance between them. 

**Manhattan distance**

Manhattan distance is a distance metric inspired by the near rectangular street block design in Manhattan. The distance between two locations is calculated as the sum of the straight-line distance along the x axis and the straight-line distance along the y axis

The following operations are required to calculate the Manhattan distance when given pickup and dropoff coordinates: 

- Perform a matrix multiplication of the pickup and dropoff coordinates with the rotation matrix, where θ = -29º
- Derive the hinge point coordinates as per the formula provided
- Transform the hinge point back into the geographic coordinate system by performing a rotation of an equal amount in an anti-clockwise direction ( θ = +29º)
- Use the preceding formula to calculate the Manhattan distance using the pickup, hinge, and dropoff coordinates

In [None]:
theta1 = np.radians(-28.904)
theta2 = np.radians(28.904)
R1 = np.array([[np.cos(theta1), np.sin(theta1)], [-np.sin(theta1), np.cos(theta1)]])
R2 = np.array([[np.cos(theta2), np.sin(theta2)], [-np.sin(theta2), np.cos(theta2)]])

def manhattan_dist(lat1, lon1, lat2, lon2):
    p = np.stack([lat1, lon1], axis = 1)
    d = np.stack([lat2, lon2], axis = 1)
    pT = R1 @ p.T 
    dT = R1 @ d.T 

    vT = np.stack((pT[0,:], dT[1,:]))
    v = R2 @ vT
    return (haversine(p.T[0], p.T[1], v[0], v[1]) + haversine(v[0], v[1], d.T[0], d.T[1]))

For now, let's just add the first two types of distance as different columns to the training DataFrame

In [None]:
df["haversine_dist"] = haversine(df["pickup_latitude"], df["pickup_longitude"], \
 df["dropoff_latitude"], df["dropoff_longitude"])

df["manhattan_dist"] = manhattan_dist(df["pickup_latitude"], df["pickup_longitude"], \
 df["dropoff_latitude"], df["dropoff_longitude"])

**Error metric**

If we visit the evaluation section of the Kaggle competition, the evaluation metric is defined as the RMSLE. In the competition, the objective is to minimize this metric for the test data. An error is simply the difference between actual values and predicted values:

error = predicted value - actual value

The Root Mean Squared Error (RMSE) would literally be the square root applied over the mean of all the squared error terms for each observation.

However, our metric in the Kaggle competition needs to be a log error:

log_error = log(predicted value + 1) - log(actual value + 1)

Therefore, it is important to apply a log transform over the trip_duration column as we did earlier:

In [None]:
df["trip_duration"] = np.log(df["trip_duration"] + 1) 

Now, we can use a function that can calculate RMSE rather a function that calculates RMSLE:

In [None]:
import math 
def rmse(x,y): return math.sqrt(((x-y)**2).mean())

**Interpreting errors** 

What does an RMSLE of, say, 0.3 actually mean? Well, let's visit the formula for the log error again:

log_error = 0.3

log(predicted_value + 1) - log(actual_value + 1) = 0.3


log((predicted_value + 1) / (actual_value + 1)) = 0.3

(predicted_value + 1) / (actual_value + 1) = 1.349

predicted_value = 1.349 * actual_value +0.349

If the preceding derivation is hard to follow, it doesn't matter; it's just for math enthusiasts. What this means is that, on average, our naive model predicts the trip duration, 1.35 times the actual value. This is not too bad, given that the best model in the Kaggle competition predicts, on average, 1.3 times the actual value. We can arrive at this metric by using a single line of Python code: 

In [None]:
np.exp(rmsle)

The response to the preceding line of code is the factor by which our predictive model is off from the actual values.

**Building the model**

Let's build the final model using a random forest regressor. A random forest is a universal machine learning technique, that is, it can handle different kinds of data; it could be a category (classification), a continuous variable (regression), or features of any kind, such an image, price, time, post codes, and so on (that is, both structured and unstructured data). It doesn't generally overfit too much, and it is very easy to stop it from overfitting. For these reasons, random forest is a versatile ML technique which we can effectively use to solve our problem.

**Validation data and error metrics**
Our initial step is choosing a suitable size for validation data. Before delineating the validation dataset and defining accuracy metrics, we have just two more steps to take into account that will make our data ready for building models. These are two convenience functions that are provided by fastai to make our models more robust:

- train_cats(): Convert any string data into categorical data
- proc_df(): Perform one-hot encoding on categorical variables and handle missing values

Let's have a look at the following code snippet:

In [None]:
train_cats(train_df)
tdf, y, nas = proc_df(df, 'trip_duration')

A validation data size of 20,000 will be enough to validate our model:

In [None]:
def split_vals(a,n): 
    return a[:n].copy(), a[n:].copy()

n_valid = 20000
n_trn = len(tdf)-n_valid
raw_train, raw_valid = split_vals(df, n_trn)
X_train, X_valid = split_vals(tdf, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

We will use RMSE and R2 as accuracy metrics. RMSE is a very simple yet effective measure to understand errors, and R2 is a very effective metric to evaluate the predictive power of a model. In the Kaggle competition, we talked about the top RMSE values, which are around 0.28 at the time of writing

In [None]:
def rmse(x,y): return np.sqrt(((x-y)**2).mean())

def print_score(m):
    res = f"""Train RMS : {rmse(m.predict(X_train), y_train)}, 
            Valid RMSE : {rmse(m.predict(X_valid), y_valid)},
            Train R2 score : {m.score(X_train, y_train)}, 
            Valid R2 score: {m.score(X_valid, y_valid)}
           """
    if hasattr(m, 'oob_score_'): res += f" OOB Score : {m.oob_score_}"
    print(res)

Let's go ahead and run our first model with about 40 estimators (trees). We will be using all the CPUs that are available to us to enable multiprocessing in the background (hence the n_jobs = -1 parameter)

In [None]:
m = RandomForestRegressor(n_estimators = 40, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

The preceding model gives us these scores. A validation RMSE of 0.25 indicates that we are among the 5th percentile of competitors

In [None]:
Train RMS : 0.09880475531046715, 
Valid RMSE : 0.2599229455143868,
Train R2 score : 0.9786976405479708, 
Valid R2 score: 0.8602128925971101
OOB Score : 0.8480746084914459

Let's see if we can make the model any better. There are some cool methods in fastai that let us understand the importance of the features that are used in the model, as well as the correlation among the features (multicollinearity). For example, we can easily look at the important features in our model by using the rf_feature_importance() method

In [None]:
fi = rf_feat_importance(m, df_trn); fi[:20]feature_imp
def plot_fi(fi): return fi.plot('cols', 'imp', 'barh', figsize=(16,8), legend=False, grid = False)

plot_fi(fi[:20]);

We can also write our own code to assess conditions such as multicollinearity, in which two features are closely associated with one another. The following lines of code plot a dendrogram, which shows multicollinearity between features

In [None]:
from scipy.cluster import hierarchy as hc

corr = np.round(scipy.stats.spearmanr(df_keep).correlation, 4)
corr_condensed = hc.distance.squareform(1-corr)
z = hc.linkage(corr_condensed, method='average')
fig = plt.figure(figsize=(12,10))
dendrogram = hc.dendrogram(z, labels=df_keep.columns, orientation='left', leaf_font_size=12)
plt.show()

The dendrogram shows highly correlated features, such as Dayofyear and datetimeElapsed, as well as datetimeMonth. The analysis also shows high correlation between the Manhattan and Haversine distances

Once we remove such redundant features and introduce the regularization parameter (max_features = 0.5), our model's RMSE drops even further, so it's among the top 2%ile of the leaderboard!