# Predicting tip amounts for NYC Taxicabs

by *Ravi Ancil Persad* 


## 1. Introduction & Data overview
The New York Taxi and Limousine Commission has publicly made available [trip datasets](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml) for yellow and green taxis in the city. The data includes details such as passenger count, payment type (e.g., cash, credit card), total fare amount and geospatial variables such as pickup and dropoff locations. In this entry, I use these aforementioned predictor variables to determine the response variable, i.e., the tip amount for a taxi trip. The ***project objective*** is as follows:

* To design a machine learning framework for predicting taxi tip amounts based on 4 categories (in US dollars): \$0-1 (very low), \$1-5 (low), \$5-10 (medium) and greater than $10 (high).

A month's data for green taxis from September 1st 2013 to Ocotber 1st 2013 is used and has  ~50,000 rows of trip data. Additional [weather data](https://www.ncdc.noaa.gov/isd/data-access) for Sept-Oct 2013 from an external source, specifically from NOAA is also used for supporting the predictive analytics. Here is an outline of the main steps undertaken:

* Load the Green taxi and NOAA surface hourly weather datasets as 2 tables into an SQL database.
* Perform preliminary data cleaning and then merge weather and taxi tables in SQL.
* Feature engineering and further data wrangling is performed in Pandas in preparation for predictive modeling.
* A 'multi-class' classification pipeline is applied to predict the category of taxi fares.



## 2. Data wrangling and geospatial visualization of taxi trips

***NOTE:*** The full data wrangling [***SQL Source Code is here***](https://github.com/RaviAnalytics/ravi_NYC_github_folder/blob/master/SQL_Data_Wrangling_Taxi_Ravi.ipynb).  

Processing begins by loading the taxi and weather data as 'chunks' into the SQL database to avoid computational overhead. The first objective is to merge the taxi and weather data into a single SQL table. To facilitate this, some wrangling was carried out to ensure the 'datetime' columns on the individual taxi and weather tables were in a consistent format. Once this is done, a left join operation was used to merge the 2 tables based on the commonality of their dates. Thus, for each taxi trip there is also an associated temperature (in Fahrenheit) value which was available from the weather data. The intuition is that the weather in New York may influence the taxi tip amounts and can be useful for prediction. 

Some exploratory analysis is also carried out directly from the SQL database. This gives us some insights into the geospatial distribution of the data based on factors such as trip distances, and trip distributions as tip amounts vary. The map below uses a CartoDB basemap and shows the New York region where the differently color-coded areas indicate the ***5 different boroughs***. The borough shape [data](https://data.cityofnewyork.us/City-Government/Borough-Boundaries/tqmj-j8zm/data) was accessed from the NYC OpenData site. Using SQL queries, we are also able to illustrate taxi trips based on 2 different fare categories (between \$5 and \$10, as well as greater than \$10). The increasing size of the circular markers indicate trips with greater distances travelled. From visual inspection, the taxi trips with tips greater than $10 have larger circles than the other category. This indicates that:

* **Drivers who undertake longer trips get higher tip amounts**.

In [1]:
import folium
import pandas as pd
import os
from IPython.display import HTML
from IPython.core.display import HTML
import json
NY_COORDINATES = (40.7, -74)
import geopandas


shpfile = os.path.join('Borough Boundaries', 'geo_export_4c5d30a1-b759-453f-a5fa-3f83d09daf2f.shp')
geodf = geopandas.GeoDataFrame.from_file(shpfile)
geodf['style'] = [
    {'fillColor': '#ff0000', 'weight': 2, 'color': 'black'},
    {'fillColor': '#00ff00', 'weight': 2, 'color': 'black'},
    {'fillColor': '#0000ff', 'weight': 2, 'color': 'black'},
    {'fillColor': '#ffff00', 'weight': 2, 'color': 'black'},
    {'fillColor': '#00ffff', 'weight': 2, 'color': 'black'},
]

In [2]:
from shapely.geometry import Polygon, Point
from geopandas import GeoDataFrame
import sqlite3

con = sqlite3.connect('greentaxi4.db')

df = pd.read_sql_query('SELECT * FROM grt5 WHERE Tip_amount>10', con)
# print df.shape

df1 = pd.read_sql_query('SELECT * FROM grt5 WHERE Tip_amount>5 AND Tip_amount<10', con)
# print df1.shape

df2 = pd.read_sql_query('SELECT * FROM grt5 WHERE Tip_amount>=0 AND Tip_amount<5', con)
# print df2.shape

In [4]:
m = folium.Map([40.72, -74], zoom_start=10.5, tiles='cartodbpositron',
              attr = '© OpenStreetMap contributors, © CartoDB')

#----add legend----#
logo_url ='http://img.pixady.com/2017/01/241694_capture6_460x266.jpg'
icon = folium.features.CustomIcon(logo_url,\
                                  icon_size=(160, 100))
fg00=folium.FeatureGroup(name="Key")

fg00.add_child(folium.Marker([40.8419,-74.2056],
          popup='Key',
          icon=icon))


# m = folium.Map([40.7, -74], zoom_start=10, tiles='Stamen Terrain')
fg0=folium.FeatureGroup(name="Boroughs")

# folium.GeoJson(geodf).add_to(m)
fg0.add_child(folium.GeoJson(geodf))

fg1=folium.FeatureGroup(name="taxi trips (Tip greater than $10)")

# Loop through rows (one per trip) and compute marker color (Tip_amount) and size (Trip_distance)
for index, row in df.iterrows():
    fg1.add_child(folium.CircleMarker(location = [row['Pickup_latitude'],row['Pickup_longitude']],
                                    popup = row['lpep_pickup_datetime'], # Add labels 
                                    radius = row['Trip_distance']/2.5,
                                    fill_color='yellow'))

fg2=folium.FeatureGroup(name="taxi trips (Tips between $5 and $10)")

# Loop through rows (one per trip) and compute marker color (Tip_amount) and size (Trip_distance)
for index, row in df1.iterrows():
    fg2.add_child(folium.CircleMarker(location = [row['Pickup_latitude'],row['Pickup_longitude']],
                                    popup = row['lpep_pickup_datetime'], # Add labels 
                                    radius = row['Trip_distance']/2.5,
                                    fill_color='red'))
m.add_child(fg00)
m.add_child(fg0)
m.add_child(fg2)    
m.add_child(fg1)
m.add_child(folium.map.LayerControl(collapsed='false'))

## 3. Preparation of data for predictive modeling

After initial data wrangling in SQL, I switched to Pandas to prepare the data for the predictive analytics phase. Firstly, using a '[data dictionary](http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf)' from the NYC Taxi site, unwanted columns (i.e., those columns populated with mostly missing values and those which are not useful for classification as indicated by the data dictionary). The spread of missing data in the feature columns was not prevelent and did call for the need of imputation techniques. 

### 3.1. Feature engineering and feature re-generation
Afterwards, I generated some intuituve new predictive features. From the raw data, we are given the pickup and dropoff times for each trip. Using this, I computed the 'trip_duration' in minutes for every trip in the dataset. On another note, the dataset has errors. For some trip entries, the given 'trip_distance' is zero even though there are pickup and dropoff coordinates. This is an obvious mis-entry/error. So to avoid any possible errors, I re-generated the trip_distance column using the [Haversine formula](http://andrew.hedges.name/experiments/haversine/).


### 3.2. One-hot encoding and data standardization
The predictive features from the dataset are a mix of categorical and numerical data types. For predicitive modeling, the categorical variables must be transformed into numerical format. This is achieved using one-hot encoding, which tranforms each category into a single binary integer descriptor. For the numerical variables (i.e., non-encoded features), a good practice for machine learning is standardization. If one predictive feature has a variance with higher orders of magnitude compared to the variances of the other features, it may undesirably dominate the learning phase. This negatively affects the influence of the other variables as they have lesser weight. Standardization on the numerical variables is achieved by subtracting the feature's mean and dividing by its standard deviation. Thus, giving a normal distribution. 

Afterwards, I proceeded to label the classes as C1, C2, C3 and C4, based on the following categeorization:

<img src="Capture.jpg">

### 3.3. Split into Training and Test datasets for machine learning
In the final step before classification, I split the dataset into ***70% training*** and ***30% testing*** sizes.There are 34,194 rows of training data and 14,655 rows of test data.



In [5]:
# More data cleaning
# In this section, we remove data which is missing and those which are not useful for the classification.

df = pd.read_sql_query("SELECT * from mergetable3", con)
# df.head()

# Missing values - let's see which columns have missing values
# df.isnull().sum()

# the single missing value here looks suspicious, there shouldnt be one actually
check1 = pd.DataFrame(df['date_cleaned_1'])
# type(check1)
null_data = check1[check1.isnull().any(axis=1)]
# null_data


# remove unwanted columns based on the columns with mostly missing values and those
# which are not useful for classification as provided in the data dictionary found at:
# http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf

df1 = df.drop(['Store_and_fwd_flag', 'date_cleaned_1', 'Ehail_fee', 'Trip_type','Date','date_cleaned'], axis=1)

pd.set_option('display.max_columns', None)


#summary statistics of Tip_amount column
df1['Tip_amount'].describe(include = 'all')

# Missing values - let's see which columns have missing values
# df1.isnull().sum()

# Single missing value here ,let's take a look
check2 = pd.DataFrame(df1['Temp'])
null_data2 = check2[check2.isnull().any(axis=1)]
# null_data2

df1.iloc[null_data2.index]


import math as mt
def trip_distance_calc(long1,lat1,long2,lat2):
     
    dlon = long2 - long1 
    dlat = lat2 - lat1 
    a = (mt.sin(mt.radians(dlat/2)))**2 + mt.cos(mt.radians(lat1)) * mt.cos(mt.radians(lat2)) * (mt.sin(mt.radians(dlon/2)))**2 
    c = 2 * mt.atan2( mt.sqrt(a), mt.sqrt(1-a) ) 
    d = 3960 * c # 3960 is radius of Earth
    return d

df1['Trip_dist_recalc'] = df1.apply(lambda row: trip_distance_calc(row['Pickup_longitude'], row['Pickup_latitude'],row['Dropoff_longitude'], row['Dropoff_latitude']), axis=1)


# Remove duplicate rows.
# Important Note: we have to reset the data frame index when we drop rows as they do not adjust automatically!
df2=df1.drop_duplicates(subset=['lpep_pickup_datetime', 'Lpep_dropoff_datetime','Trip_dist_recalc'], 
                        keep='last')
df2.reset_index(inplace=True) # this is critical as when we drop rows, the indices remain and do not adjust!
# df2.head()


# Feature engineering
# we compute the 'trip_duration' in Minutes by subtracting the pickup and dropoff times.
import warnings
warnings.filterwarnings("ignore")

df2["lpep_pickup_datetime"] = pd.to_datetime(df2["lpep_pickup_datetime"])
df2["Lpep_dropoff_datetime"] = pd.to_datetime(df2["Lpep_dropoff_datetime"])

# compute trip duration in minutes
df2["trip_duration"] = abs((df2.lpep_pickup_datetime-df2.Lpep_dropoff_datetime).astype('timedelta64[m]'))
# df2.head()
df2[['lpep_pickup_datetime','Lpep_dropoff_datetime','trip_duration']].head()

df3 = df2.drop(['index','lpep_pickup_datetime', 'Lpep_dropoff_datetime'], axis=1)

# Missing values - let's see which columns have missing values
df3.isnull().sum()
check3 = pd.DataFrame(df3['Temp'])
null_data3 = check3[check3.isnull().any(axis=1)]
# null_data3

# drop row 48485
df4 = df3.drop(df3.index[[48485]])
# df4.isnull().sum()


In [7]:
import numpy as np
from numpy import inf
# The 'VendorID', 'RateCodeID', 'Payment_type' are given as '1,2,3,etc'. These are not ordinal variables by nature (e.g., Cash is not greater than creditcard, etc). They are simply nominal vaiables. 
# Therefore, we convert these to dummy variables via one-hot encoding. So, lets form a dataframe with these variables alone and apply the encoding.
# We also perform one-hot encoding on the dayofweek column as these are nominal.

df4_sub = df4[['VendorID', 'RateCodeID','Payment_type','dayofweek']]

#--------prepare VendorID column -------------------------#
# print df4_sub.VendorID.unique() # we see that only 1,2,4 and 3 exists, so let's do the string mapping for these
mapping0 = {1: 'CMT_and_LLC',2: 'VeriFone'}
df4_sub['VendorID'] = df4_sub['VendorID'].map(mapping0)
# print df4_sub.head()


#--------prepare RateCodeID column -------------------------#
# print df4_sub.RateCodeID.unique()  # print the unique values from this column to see what mapping we have to set 
# Map actual strings from the data dictionary (http://www.nyc.gov/html/tlc/downloads/pdf/
# data_dictionary_trip_records_green.pdf) for which the numbers represent
mapping = {1: 'Standard_rate',2: 'JFK',3: 'Newark',
               4: 'Nassau_or_Westchester',5: 'Negotiated_fare',
                6: 'Group_ride', 7: 'Undefined'}
df4_sub['RateCodeID'] = df4_sub['RateCodeID'].map(mapping)
# print df4_sub.head()

#--------prepare Payment column -------------------------#
# print df4_sub.Payment_type.unique() # we see that only 1,2,4 and 3 exists, so let's do the string mapping for these
mapping2 = {1: 'Credit_Card',2: 'Cash',3: 'No_Charge',
               4: 'Dispute'}
df4_sub['Payment_type'] = df4_sub['Payment_type'].map(mapping2)

#----- Note: the dayofweek column are strings already and do not need any mapping preparation ----#

# PERFORM ONE HOT ENCODING ON NOMINAL VARIABLES
df4_onehot = pd.get_dummies(df4_sub)

# subset the columns we want to join with the 'one-hot' dataframe 
df4_sub2 = df4[['Pickup_longitude', 'Pickup_latitude','Dropoff_longitude','Dropoff_latitude','Passenger_count',
                'Fare_amount','Extra','MTA_tax','Tip_amount','Tolls_amount','Total_amount','Temp','hour','Trip_dist_recalc',
                'trip_duration']]

# concatenate these with the one hot dataframe
df5 =  pd.concat([df4_onehot,df4_sub2], axis=1)
df5.head()

# We need to categorize the tip amount into the following ranges:
# 0-1, 1-5, 5-10, >10. This is a multi-class classification problem
df6 = df5.copy()
df6['Tip_label'] = pd.cut(df6['Tip_amount'], bins=[0, 1, 5, 10,inf], include_lowest=True, labels=['C1', 'C2', 'C3','C4'])
df6.head()

import plotly.plotly as py
import cufflinks as cf

cf.set_config_file(offline=False, world_readable=True, theme='ggplot')
df6['Tip_label'] = df6['Tip_label'].astype('string')
series1 = df6['Tip_label'].value_counts()
# print series1
series1.iplot(kind='bar', yTitle='Frequency', title='Categories for NYC Green Taxi Tips')




High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~ravi071/0 or inside your plot.ly account where it is named 'Categories for NYC Green Taxi Tips'


In [8]:
# drop the 'Tip_amount' column
df7 = df6.copy()
# print len(df7.columns)
df7 = df7.drop(['Tip_amount'], axis=1)
# print len(df7.columns)


# Convert the response variables as follows:
# 1 -> C1
# 2 -> C2
# 3 -> C3
# 4 -> C4
# do mapping for response variables into numerical
df7a = df7.copy()

# print df7a.Tip_label.unique() # we see that only 1,2,4 and 3 exists, so let's do the string mapping for these
# print dtype(df7a.Tip_label)
mapping3 = {'C1': 1,'C2': 2,'C3': 3,'C4': 4}
df7a['Tip_label'] = df7a['Tip_label'].map(mapping3)
df7a.head()

# Split training and test sets
# check all column types to ensure they are non-strings
df7b = df7a.copy()
# print df7b.dtypes

# Temp and hour are strings so convert to numeric:
df7b['Temp'] = df7b['Temp'].convert_objects(convert_numeric=True)
df7b['hour'] = df7b['hour'].convert_objects(convert_numeric=True)
# print df7b.dtypes

df7b.isnull().sum()
check3 = pd.DataFrame(df7b['Temp'])
null_data3 = check3[check3.isnull().any(axis=1)]
df7b.iloc[null_data3.index].head()
df7a.iloc[null_data3.index].head() # temp column has weird asterisks, probably missing data so let's remove these rows
df7c = df7b.copy()
# print df7c.shape
df7c1 = df7c.drop(null_data3.index)# temp column has weird asterisks, probably missing data so let's remove these rows
df7c1.reset_index(inplace=True) # this is critical as when we drop rows, the indices remain and do not adjust!

# print df7c1.shape
df7c1.head()

# print df7c1.dtypes

# print df7c1.isnull().sum()

from sklearn.model_selection import train_test_split

X = df7c1.ix[:, df7c1.columns != 'Tip_label']
# X.head()
# X = X.values

y = df7c1.ix[:, df7c1.columns == 'Tip_label']
# y = y.values

X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size=0.3, random_state=8940)

# Standardization of the other variables (i.e., non-encoded features)
# If a feature's variance is orders of magnitude greater than the variances of the other features, 
# that feature may dominate the learning algorithm and prevent it from learning from the other variables. 
# Some learning algorithms also converge to the optimal parameter values more slowly when data is not 
# standardized. The value of an explanatory variable can be standardized by subtracting the variable's 
# mean and dividing the difference by the variable's standard deviation. 
# This makes the variables take the form of a normal distribution.

list1 = (range(len(df7c1.columns)))
list2 = list(df7c1.columns)
zip(list1, list2)

onehot_train = X_train[:,1:20]
non_onehot_train = X_train[:,20:34]

onehot_test = X_test[:,1:20]
non_onehot_test = X_test[:,20:34]

# let us now standardize the 'non_onehot' variables
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
non_onehot_train_std = (stdsc.fit_transform(non_onehot_train))
non_onehot_test_std = stdsc.transform(non_onehot_test)

# print non_onehot_train.shape
# print non_onehot_test_std.shape
# print onehot_train.shape

# re-join array columns with onehot and non_onehot variables
Xtrain_1 = np.column_stack((onehot_train,non_onehot_train_std))
Xtest_1 = np.column_stack((onehot_test,non_onehot_test_std))

# pd.DataFrame(Xtrain_1).head()


## 4. Classification of taxi tip categories

To classify the test data using the 4 tip amount categories, I applied 3 classification approaches:

i) ***Random Forest***  
ii) ***Multiclass SVM***   
iii) ***Blending/Stacking*** of both Random Forest & Multiclass SVM (figure below shows ensembling process) 

<img src="Capture2.jpg">

### 4.1 Results

To assess the classification performance, I used an '**accuracy classification score**' measured in percentages. This metric is straightforward and simply computes the ratio of correctly predicted labels relative to the true labels. Based on the results achieved the Blending/Stacking produced the highest accuracy rate compared to the individual classifiers. From the 14,655 entries of test data sample, the table below summarizes the classification results:


<img src="Capture33.jpg">

To visually show the results, an interactive **confusion matrix** is used. The confusion matrix has 2 axes representing the predicted and the actual labels. By hovering over the diagonal of the matrix, the 'Z' value shows the number of correctly classified labels. Likewise, the off-diagonals indicate the incorrectly classified labels.

In [10]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier
from numpy import array


clf = RandomForestClassifier(n_estimators=100)
clf.fit(Xtrain_1, y_train)


y_pred = clf.predict(Xtest_1)
# print y_pred

# list comprehension
my_list = [l[0] for l in y_test]
y_test_1 = array(my_list)

# print y_test_1

# print('Misclassified samples: %d' % (y_test_1 != y_pred).sum())

num_misclass = (y_test_1 != y_pred).sum()
# print num_misclass

Classification_Accuracy = 1 - np.divide(float(num_misclass),float(y_test_1.size))
# Classification_Accuracy

In [12]:
# Multiclass SVM
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
clf = OneVsRestClassifier(LinearSVC(random_state=0))
clf.fit(Xtrain_1, y_train)

y_pred = clf.predict(Xtest_1)
# print y_pred

# list comprehension
my_list = [l[0] for l in y_test]
y_test_1 = array(my_list)

# print y_test_1

# print('Misclassified samples: %d' % (y_test_1 != y_pred).sum())

num_misclass = (y_test_1 != y_pred).sum()
# print num_misclass
Classification_Accuracy = 1 - np.divide(float(num_misclass),float(y_test_1.size))
# Classification_Accuracy


In [16]:
# Blending classifiers

# # do list comprehension
# my_list = [l[0] for l in y_train]
# y_train = array(my_list)

n_trees = 100
n_folds = 20

clfs = [
    RandomForestClassifier(n_estimators = n_trees),
    OneVsRestClassifier(LinearSVC(random_state=0)),
]


from sklearn.cross_validation import StratifiedKFold

# Ready for cross validation
# Split into training and validation sets
skf = list(StratifiedKFold(y_train, n_folds))

# Pre-allocate the data
blend_train = np.zeros((Xtrain_1.shape[0], len(clfs))) # Number of training data x Number of classifiers
blend_test = np.zeros((Xtest_1.shape[0], len(clfs))) # Number of testing data x Number of classifiers

# print 'Xtest_1.shape = %s' % (str(Xtest_1.shape))
# print 'blend_train.shape = %s' % (str(blend_train.shape))
# print 'blend_test.shape = %s' % (str(blend_test.shape))

# For each classifier, we train the number of fold times (=len(skf))
for j, clf in enumerate(clfs):
#     print 'Training classifier [%s]' % (j)
    blend_test_j = np.zeros((Xtest_1.shape[0], len(skf))) # Number of testing data x Number of folds , we will take the mean of the predictions later
    for i, (train_index, cv_index) in enumerate(skf):
#         print 'Fold [%s]' % (i)

        # This is the training and validation set
        X_train_cv = Xtrain_1[train_index]
        Y_train_cv = y_train[train_index]
        X_valid_cv = Xtrain_1[cv_index]
        Y_valid_cv = y_train[cv_index]

        clf.fit(X_train_cv, Y_train_cv)

        # This output will be the basis for our blended classifier to train against,
        # which is also the output of our classifiers
        blend_train[cv_index, j] = clf.predict(X_valid_cv)
        blend_test_j[:, i] = clf.predict(Xtest_1)
    # Take the mean of the predictions of the cross validation set
    blend_test[:, j] = blend_test_j.mean(1)

# print 'Y_dev.shape = %s' % (y_train.shape)

# Do blending....
from sklearn.linear_model import LogisticRegression
bclf = LogisticRegression(solver='lbfgs',multi_class='multinomial')
bclf.fit(blend_train, y_train)

from sklearn import metrics

# Predict now
Y_test_predict = bclf.predict(blend_test)
score = metrics.accuracy_score(y_test_1, Y_test_predict)
# print 'Accuracy = %s' % (score)

# print('Misclassified samples: %d' % (y_test_1 != Y_test_predict).sum())

num_misclass = (y_test_1 != Y_test_predict).sum()
# print num_misclass

Classification_Accuracy = 1 - np.divide(float(num_misclass),float(y_test_1.size))


In [18]:
#plot confusion matrix

from sklearn.metrics import confusion_matrix
cnf_matrix = confusion_matrix(y_test_1, Y_test_predict)
cnf_matrix


import plotly.plotly as py
import plotly.graph_objs as go
colorscale=[[0, 'rgb(166,206,227)'], [0.25, 'rgb(31,120,180)'], [0.45, 'rgb(178,223,138)'], [0.65, 'rgb(51,160,44)'], [0.85, 'rgb(251,154,153)'], [1, 'rgb(227,26,28)']]
data = [
    go.Heatmap(
        z=cnf_matrix,
        x=['C1($0-1)', 'C2($1-5)', 'C3($5-10)', 'C4( > $10)'],
        y=['C1($0-1)', 'C2($1-5)', 'C3($5-10)', 'C4( > $10)'],
        colorscale=  colorscale
    )
    
    
]

layout = go.Layout(
    title='NYC Green Taxi tip amount predictions',
    xaxis = dict(title='Actual'),
    yaxis = dict(title='Predicted')

)

fig = go.Figure(data=data, layout=layout)
fig['layout'].update(width=700,
    height=700,
    autosize=False)

py.iplot(fig)

## 5. Conclusion
In this work, I used a combination of SQL and python's pandas and sklearn libraries to predict the category (i.e., very low, low, medium or high) for expected taxi tips per trip in New York City. Datasets were acquired from 2 sources, NYC Taxi and Limo commission, as well as the NOAA weather website. Current analysis was performed on the 'Green' taxis in NYC. Future work will also investigate a similar predictive pipeline for the 'Yellow' taxis in NYC.