# UNDERSTANDING DECISION TREES & ENSEMBLE METHODS USING NEW YORK CITY TAXI FARE PREDICTION DATASET
--- 

**PROBLEM STATEMENT** 

*You are tasked with predicting the fare amount (inclusive of tolls) for a taxi ride in New York City given the pickup and dropoff locations. While you can get a basic estimate based on just the distance between the two points, this will result in an RMSE of \$5-$8, depending on the model used. Your challenge is to do better than this using Machine Learning techniques!*


In [1]:
from IPython.display import HTML
html1 = '<img src="https://images.unsplash.com/photo-1573225935973-40b81f6e39e6?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=500&q=60" \
width="1200" height="900" align="center"/>'
HTML(html1)

### Download the data

Let's use Kaggle API to download the data.

In [None]:
!kaggle competitions download -c new-york-city-taxi-fare-prediction

Downloading new-york-city-taxi-fare-prediction.zip to /Users/nikhilkashyap/Downloads/Learning/Github/11-Projects-to-DataScience
 25%|█████████▌                             | 392M/1.56G [01:37<03:33, 5.92MB/s]

Let's unzip the taxi file

In [None]:
!unzip new-york-city-taxi-fare-prediction.zip

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
pd.set_option('display.max_colwidth', -1)
plt.style.use('fivethirtyeight')

### Read the data

The taxiing dataset contains 55M rows. Let us read 5M rows for faster computing.

In [None]:
%%time
import pandas as pd

taxi = pd.read_csv("train.csv",nrows=5000000)

In [None]:
taxi.head()

In [None]:
taxi.info()

Convert pickup_datetime from Object to Datetime object.

In [None]:
%%time
taxi['pickup_datetime']=pd.to_datetime(taxi['pickup_datetime'],format='%Y-%m-%d %H:%M:%S UTC')
taxi.head()

In [None]:
taxi.describe()

We can observe that

1). Min fare amount is negative.

2). Min and Max longitude and latitude look unreal.

3). Min passenger count is 0.

We are fixing them now.

1). New York city longitudes are around -74 and latitudes are around 41.

2). Remove 0 passenger count.

3). The initial charge is $2.5, so we are removing fare amount smaller than this amount.

In [None]:
taxi = taxi[((taxi['pickup_longitude'] > -78) & (taxi['pickup_longitude'] < -70)) & 
            ((taxi['dropoff_longitude'] > -78) & (taxi['dropoff_longitude'] < -70)) & 
            ((taxi['pickup_latitude'] > 37) & (taxi['pickup_latitude'] < 45)) & 
            ((taxi['dropoff_latitude'] > 37) & (taxi['dropoff_latitude'] < 45)) & 
            (taxi['passenger_count'] > 0) & (taxi['fare_amount'] >= 2.5)]

Check for Missing Values

In [None]:
taxi[pd.isnull(taxi)].sum()

### EDA

##### Distribution of Trip Fare

In [None]:
plt.figure(figsize = (14, 4))
n, bins, patches = plt.hist(taxi.fare_amount, 1000, facecolor='blue', alpha=0.75)
plt.xlabel('Fare amount')
plt.title('Histogram of fare amount')
plt.xlim(0, 200)
plt.show();

The above graph also shows that most of the fare amount is small.

In [None]:
taxi.groupby('fare_amount').size().nlargest(10)

Interesting, the most common fare amount are very small at only 6.5 and 4.5, they are very short rides.

Passenger Count

In [None]:
taxi['passenger_count'].value_counts().plot.bar(color = 'b', edgecolor = 'k');
plt.title('Histogram of passenger counts'); plt.xlabel('Passenger counts'); plt.ylabel('Count');

In [None]:
taxi.groupby('passenger_count').size()

Based on the above discovery, we are going to remove taxi rides with passenger_count > 6.

In [None]:
taxi = taxi.loc[taxi['passenger_count'] <= 6]

In [None]:
taxi.groupby('passenger_count').size()

In [None]:
taxi.describe()

### Baseline Model

To be quick, let's create a baseline model, without Machine learning, just a simple rate calculation

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(taxi, test_size=0.3, random_state=42)

In [None]:
import numpy as np
import shutil

def distance_between(lat1, lon1, lat2, lon2):
  # Haversine formula to compute distance 
  dist = np.degrees(np.arccos(np.sin(np.radians(lat1)) * np.sin(np.radians(lat2)) + np.cos(np.radians(lat1)) * np.cos(np.radians(lat2)) * np.cos(np.radians(lon2 - lon1)))) * 60 * 1.515 * 1.609344
  return dist

def estimate_distance(df):
  return distance_between(df['pickup_latitude'], df['pickup_longitude'], df['dropoff_latitude'], df['dropoff_longitude'])

def compute_rmse(actual, predicted):
  return np.sqrt(np.mean((actual - predicted)**2))

def print_rmse(df, rate, name):
  print("{1} RMSE = {0}".format(compute_rmse(df['fare_amount'], rate * estimate_distance(df)), name))


In [None]:
rate = train['fare_amount'].mean() / estimate_distance(train).mean()

print("Rate = ${0}/km".format(rate))
print_rmse(train, rate, 'Train')
print_rmse(test, rate, 'Test')

This baseline model gets us RMSE for test set at $10.0. We expect ML achieve better than this.

### Feature Engineering

1). Extract information from datetime (day of week, month, hour, day). Taxi fares change day/night or on weekdays/holidays.

2). The distance from pickup to dropoff. The longer the trip, the higher the price.

3). Add columns indicating distance from pickup or dropoff coordinates to JFK. Trips from/to JFK have a flat fare at $52.

Getting distance between two points based on latitude and longitude using haversine formula. https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas/29546836#29546836

In [None]:
taxi['year'] = taxi.pickup_datetime.dt.year
taxi['month'] = taxi.pickup_datetime.dt.month
taxi['day'] = taxi.pickup_datetime.dt.day
taxi['weekday'] = taxi.pickup_datetime.dt.weekday
taxi['hour'] = taxi.pickup_datetime.dt.hour

In [None]:
taxi.head()

In [None]:
from math import radians, cos, sin, asin, sqrt
import numpy as np

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6371 * c  # 6371 is Radius of earth in kilometers. Use 3956 for miles
    return km

taxi['distance'] = haversine_np(taxi['pickup_latitude'], taxi['pickup_longitude'], taxi['dropoff_latitude'] , taxi['dropoff_longitude'])

In [None]:
taxi.head()

In [None]:
plt.figure(figsize = (14, 4))
n, bins, patches = plt.hist(taxi.distance, 1000, facecolor='blue', alpha=0.75)
plt.xlabel('distance')
plt.title('Histogram of ride distance')
plt.show();

In [None]:
taxi['distance'].describe()

In [None]:
taxi = taxi.loc[taxi['distance'] > 0]

Official NYC yellow taxis have a flat rate fee from JFK to Manhattan for $52 (plus tolls and tip), Add columns indicating distance from pickup or dropoff coordinates to JFK.

In [None]:
JFK_coord = (40.6413, -73.7781)

pickup_JFK = haversine_np(taxi['pickup_latitude'], taxi['pickup_longitude'], JFK_coord[0], JFK_coord[1]) 
dropoff_JFK = haversine_np(JFK_coord[0], JFK_coord[1], taxi['dropoff_latitude'], taxi['dropoff_longitude'])

In [None]:
taxi['JFK_distance'] = pd.concat([pickup_JFK, dropoff_JFK], axis=1).min(axis=1)

In [None]:
taxi['JFK_distance'].describe()

In [None]:
taxi.head()

In [None]:
del taxi['pickup_datetime']
del taxi['key']

### Model Building

**Linear Regression**

In [None]:
from sklearn.model_selection import train_test_split
y = taxi['fare_amount']
X = taxi.drop(columns=['fare_amount'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

In [None]:
from sklearn.metrics import mean_squared_error

print("Test RMSE: %.3f" % mean_squared_error(y_test, y_pred) ** 0.5)

**Decision Trees**

In [None]:
%%time
from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor(max_depth=2)
dt.fit(X, y)

In [None]:
y_pred = dt.predict(X_test)

In [None]:
print("Test RMSE: %.3f" % mean_squared_error(y_test, y_pred) ** 0.5)

Ensemble methods combine several decision trees classifiers to produce better predictive performance than a single decision tree classifier. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner, thus increasing the accuracy of the model.

Bagging is a way to decrease the variance in the prediction by generating additional data for training from dataset using combinations with repetitions to produce multi-sets of the original data. 

Boosting is an iterative technique which adjusts the weight of an observation based on the last classification. If an observation was classified incorrectly, it tries to increase the weight of this observation.

**Random Forest**

In [None]:
# %%time
# from sklearn.ensemble import RandomForestRegressor

# rf = RandomForestRegressor(max_depth=2, random_state=0, n_estimators=100)
# rf.fit(X_train, y_train)
# y_pred = rf.predict(X_test)

In [None]:
# print("Test RMSE: %.3f" % mean_squared_error(y_test, y_pred) ** 0.5)

#### Light GBM

In [None]:
import lightgbm as lgb

params = {
        'learning_rate': 0.75,
        'application': 'regression',
        'max_depth': 3,
        'num_leaves': 100,
        'verbosity': -1,
        'metric': 'RMSE',
    }

In [None]:
train_set = lgb.Dataset(X_train, y_train, silent=True)

In [None]:
%%time
lb = lgb.train(params, train_set = train_set, num_boost_round=300)

In [None]:
y_pred = lb.predict(X_test, num_iteration = lb.best_iteration)

In [None]:
print("Test RMSE: %.3f" % mean_squared_error(y_test, y_pred) ** 0.5)

In [None]:
from explainx import *

In [None]:
explainx.ai(X_Data, Y_Data, lb, model_name="xgboost")