# Project 2: Analyzing Uber Trips in NYC
## Step 3: Classification Green and Uber Cabs
## Jose Oros, Annamalai Kathir

To answer our final question of classifying trips into either Uber or Green based on time and location we use 4 different classifiers. We used 10 fold cross validation and grid search to estimate the best parameters. The best model in the model selection phase was chosen to perform in the TEST data. To evaluate the models, we used the cross validated F1 score (we additionally observed precision, recall, and accuracy).

In [1]:
#import the libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from sklearn import neighbors
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import explained_variance_score, r2_score, mean_absolute_error, mean_squared_error

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import f1_score
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, f1_score, classification_report, fbeta_score
from imblearn.over_sampling import RandomOverSampler


%matplotlib inline

Read cab data

In [2]:
green_trip = pd.read_csv('../Data/green_trips.csv')
#yellow_15q2= pd.read_csv('../Data/yellow_trips_2014Q2.csv')
#yellow_14q3= pd.read_csv('../Data/yellow_trips_2014Q3.csv')
#yellow_15q1= pd.read_csv('../Data/yellow_trips_2015Q1.csv')
#yellow_15q2= pd.read_csv('../Data/yellow_trips_2015Q2.csv')

Takeing equal samples of Green and Uber cabs (50000)

In [3]:
green_trip_sub = green_trip.sample(50000)
green_trip_sub = green_trip_sub[(green_trip_sub.pickup_longitude != 0) & (green_trip_sub.pickup_latitude != 0)]
green_trip_sub = green_trip_sub.drop(['dropoff_datetime','dropoff_longitude','passenger_count','trip_distance', 'total_amount', 'dropoff_latitude'],axis=1)

In [4]:
green_trip_sub.columns

Index(['pickup_datetime', 'pickup_longitude', 'pickup_latitude'], dtype='object')

Map the data for Green cabs (finding NTA codes from latitute and longitude information)

In [5]:
zones_df = pd.read_csv('../Data/zones.csv')
geo_df = pd.read_csv('../Data/geographic.csv')

In [6]:
from shapely.geometry import Point
from shapely.geometry.polygon import Polygon

In [7]:
#creating a polygon for each of the NTA CODES in a dictionary, having the NTA code as the key and the polygon as the value
geo_new = {}
for col in geo_df:
    count = 0
    long = []
    lat = []
    for coord in geo_df[col].dropna():
        if count % 2 == 0:
            long.append(coord)
        else:
            lat.append(coord)
        count += 1
    
    poly = Polygon(list(zip(lat,long)))
    geo_new[col] = poly
    

In [8]:
#Function to check coordinates and output what NTA code they belong to
def check_coords(point, geo_new):
    for key,area in geo_new.items():
        if area.contains(Point(point)):
            return key
    

In [9]:
#run the function
green_trip_sub['zipped'] = list(zip(green_trip_sub['pickup_latitude'],green_trip_sub['pickup_longitude']))
green_trip_sub['nta_code'] = [check_coords(x,geo_new) for x in green_trip_sub['zipped']]

In [35]:
green_trip_sub.head()

Unnamed: 0,pickup_datetime,pickup_longitude,pickup_latitude,zipped,nta_code,time,hour,day_week,day_month,day_year,month,date_only,week_year,trip
0,2015-05-04 08:49:09,-73.961197,40.716579,"(40.7165794373, -73.9611968994)",BK73,08:49:09,8,0,4,124,5,2015-05-04,19,0
1,2015-02-11 15:21:41,-73.953438,40.82259,"(40.8225898743, -73.9534378052)",MN04,15:21:41,15,2,11,42,2,2015-02-11,7,0
2,2015-06-18 18:34:44,-73.912041,40.775227,"(40.775226593, -73.9120407104)",QN72,18:34:44,18,3,18,169,6,2015-06-18,25,0
3,2014-08-03 00:43:22,-73.885834,40.842884,"(40.8428840637, -73.8858337402)",BX17,00:43:22,0,6,3,215,8,2014-08-03,31,0
4,2015-01-15 13:36:28,-73.953346,40.787098,"(40.7870979309, -73.9533462524)",MN33,13:36:28,13,3,15,15,1,2015-01-15,3,0


In [36]:
green_trip_sub.dropna(inplace=True)
green_trip_sub.drop(['zipped','pickup_longitude','pickup_latitude'], axis=1, inplace=True)
green_trip_sub = green_trip_sub.reset_index(drop=True)

Now, include time information features

In [38]:
green_trip_sub['pickup_datetime'] = pd.to_datetime(green_trip_sub['pickup_datetime'])

In [39]:
green_trip_sub['time'] = green_trip_sub.pickup_datetime.dt.time
green_trip_sub['hour'] = green_trip_sub.pickup_datetime.dt.hour
green_trip_sub['day_week'] = green_trip_sub.pickup_datetime.dt.dayofweek
green_trip_sub['day_month'] = green_trip_sub.pickup_datetime.dt.day
green_trip_sub['day_year'] = green_trip_sub.pickup_datetime.dt.dayofyear
green_trip_sub['month'] = green_trip_sub.pickup_datetime.dt.month
green_trip_sub['date_only'] = green_trip_sub.pickup_datetime.dt.date
green_trip_sub['week_year'] = green_trip_sub.pickup_datetime.dt.weekofyear

# Classification Model

Import uber trips and take 50,000 samples so that the data is balanced with Green cabs data

In [17]:
uber_trip = pd.read_csv('uber_trip.csv')
uber_trip_sub = uber_trip.sample(50000)

In [41]:
#all data
uber_trip_sub['pickup_datetime'] = pd.to_datetime(uber_trip_sub['pickup_datetime'])
uber_trip_sub['time'] = uber_trip_sub.pickup_datetime.dt.time
uber_trip_sub['hour'] =uber_trip_sub.pickup_datetime.dt.hour
uber_trip_sub['day_week'] = uber_trip_sub.pickup_datetime.dt.dayofweek
uber_trip_sub['day_month'] = uber_trip_sub.pickup_datetime.dt.day
uber_trip_sub['day_year'] = uber_trip_sub.pickup_datetime.dt.dayofyear
uber_trip_sub['month'] = uber_trip_sub.pickup_datetime.dt.month
uber_trip_sub['date_only'] = uber_trip_sub.pickup_datetime.dt.date
uber_trip_sub['week_year'] = uber_trip_sub.pickup_datetime.dt.weekofyear

In [42]:
uber_trip_sub.head()

Unnamed: 0,pickup_datetime,nta_code,time,hour,day_week,day_month,day_year,month,date_only,trip,week_year
0,2015-03-09 15:17:00,MN24,15:17:00,15,0,9,68,3,2015-03-09,1,11
1,2015-06-13 12:09:00,BK21,12:09:00,12,5,13,164,6,2015-06-13,1,24
2,2015-03-24 23:17:00,MN34,23:17:00,23,1,24,83,3,2015-03-24,1,13
3,2015-04-25 14:40:00,BK09,14:40:00,14,5,25,115,4,2015-04-25,1,17
4,2015-05-15 16:22:00,QN01,16:22:00,16,4,15,135,5,2015-05-15,1,20


In [43]:
uber_trip_sub.shape

(50000, 11)

In [45]:
#uber_trip_sub.drop(['Unnamed: 0','Unnamed: 0.1'], axis=1, inplace=True)

In [46]:
uber_trip_sub.reset_index(drop=True,inplace=True)

In [166]:
uber_trip_sub.head()

Unnamed: 0,pickup_datetime,nta_code,time,hour,day_week,day_month,day_year,month,date_only,trip,week_year
0,2015-03-09 15:17:00,MN24,15:17:00,15,0,9,68,3,2015-03-09,1,11
1,2015-06-13 12:09:00,BK21,12:09:00,12,5,13,164,6,2015-06-13,1,24
2,2015-03-24 23:17:00,MN34,23:17:00,23,1,24,83,3,2015-03-24,1,13
3,2015-04-25 14:40:00,BK09,14:40:00,14,5,25,115,4,2015-04-25,1,17
4,2015-05-15 16:22:00,QN01,16:22:00,16,4,15,135,5,2015-05-15,1,20


Now we label the data trips as 1 for uber and 0 for green cabs at aggregated level so that we can pass it through classification models

In [167]:
uber_trip_sub['trip'] = 1
green_trip_sub['trip'] = 0

In [168]:
#green_trip_sub.drop(['pickup_longitude', 'pickup_latitude'], axis=1, inplace=True)

In [169]:
#Concatenate data sets
all_cabs = pd.concat([uber_trip_sub, green_trip_sub])

### Create training and testing data

In [170]:
all_cabs.head()

Unnamed: 0,date_only,day_month,day_week,day_year,hour,month,nta_code,pickup_datetime,time,trip,week_year
0,2015-03-09,9,0,68,15,3,MN24,2015-03-09 15:17:00,15:17:00,1,11
1,2015-06-13,13,5,164,12,6,BK21,2015-06-13 12:09:00,12:09:00,1,24
2,2015-03-24,24,1,83,23,3,MN34,2015-03-24 23:17:00,23:17:00,1,13
3,2015-04-25,25,5,115,14,4,BK09,2015-04-25 14:40:00,14:40:00,1,17
4,2015-05-15,15,4,135,16,5,QN01,2015-05-15 16:22:00,16:22:00,1,20


In [171]:
nta_dummies = pd.get_dummies(all_cabs['nta_code'])
all_cabs_cla = pd.concat([all_cabs,nta_dummies], axis=1)
all_cabs_cla = all_cabs_cla.drop(['nta_code'], axis=1)

In [172]:
len(all_cabs_cla[all_cabs_cla['trip']==0])

48716

In [173]:
all_cabs_cla.columns

Index(['date_only', 'day_month', 'day_week', 'day_year', 'hour', 'month',
       'pickup_datetime', 'time', 'trip', 'week_year',
       ...
       'SI07', 'SI08', 'SI12', 'SI14', 'SI24', 'SI25', 'SI28', 'SI35', 'SI36',
       'SI37'],
      dtype='object', length=193)

In [174]:
#delete datetime objects - because we cannot use classification on datetime objects
all_cabs_cla.drop(['pickup_datetime','time', 'date_only'], axis=1, inplace=True)
all_cabs_cla.dropna(inplace=True)

In [175]:
all_cabs_cla.shape

(98716, 190)

In [176]:
#We removed month, year information because we wanted the classifier to use only day of the week, hour of the day and location
#information to classify the trip as either Uber or Green
all_cabs_cla.drop(['day_year','month', 'week_year'], axis=1, inplace=True)

In [177]:
#all_cabs_cla.drop(['day_month','day_week', 'hour'], axis=1, inplace=True)

In [178]:
x = all_cabs_cla.drop(['trip'], axis=1)
y = all_cabs_cla.trip

In [198]:
#splitting the data into train and test
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.20)

## Classifiers

We use tuning parameters for each of the model. For random forest its tree depth, for adaboost its the number of estimators for logistic regression its with ridge penality and KNN its the number of neighbours

In [201]:


#Running different classification models. You can uncomment them to run. Currently only LogReg runs.
classifiers = {}
classifier_parameters = {}

##### Random Forest classifier
#classifiers['Random Forest'] = Pipeline([('clf', RandomForestClassifier())])
#classifier_parameters['Random Forest'] = {'clf__max_depth':(1, 3, 9, 12, 15)}

#AdaBoost classifier
#classifiers['AdaBoost'] = Pipeline([('clf', AdaBoostClassifier())])
#classifier_parameters['AdaBoost'] = {'clf__n_estimators':(30, 40, 50, 60, 70)}

##### SVM
#classifiers['SVM'] = Pipeline([('clf', SVC())])
#classifier_parameters['SVM'] = {'clf__C':(0.1, 1, 10), 'clf__kernel': ('poly', 'rbf'), 'clf__gamma': (0.1, 0.5, 1)}

#### Logistic Regression with Lasso
classifiers['LogReg'] = Pipeline([('clf', LogisticRegression(penalty='l1'))])
classifier_parameters['LogReg'] = {'clf__C':(0.1, 1, 10)}

#### kNN
#classifiers['kNN'] = Pipeline([('clf', neighbors.KNeighborsClassifier())])
#classifier_parameters['kNN'] = {'clf__n_neighbors':(3,5,7), 'clf__weights': ('uniform', 'distance')}

### Train Algorithm - Cross Validation

In [202]:
# Create a label encoder to transform output labels.
le = LabelEncoder() 

# Split features and class into two dataframes.
X_training = x_train.values
y_training = le.fit_transform(y_train.values)

# Initialize scores dictionary
scores = pd.DataFrame(columns=['fold', 'algorithm', 'parameters', 'accuracy', 'precision', 'recall', 'fbeta_score', 'f1_score'])

# 10 fold CV
kf = KFold(n_splits=10, shuffle=True)

# Outer Cross Validation
fold = 0
for train_index, test_index in kf.split(X_training):
    X_train, X_test = X_training[train_index], X_training[test_index]
    Y_train, Y_test = y_training[train_index], y_training[test_index]
    
    fold = fold + 1

    # Inner CV
    for name, clf in classifiers.items():
        print('Fold ' + str(fold) + ': ' + name)
        if name in classifier_parameters:
            gs = GridSearchCV(estimator=clf, param_grid=classifier_parameters[name])
            gs.fit(X_train, Y_train)
            y_pred = gs.predict(X_test)
            best_params = str(gs.best_params_)
        else:
            clf.fit(X_train, Y_train)
            y_pred = clf.predict(Y_test)
            best_params = 'default'
        
        # collect the scores for printing out later
        scores = scores.append(pd.DataFrame(data={'fold':[fold],
                                                  'algorithm':[name], 
                                                  'parameters':[best_params], 
                                                  'accuracy':[accuracy_score(Y_test, y_pred)], 
                                                  'precision':[precision_score(Y_test, y_pred, average='weighted')],
                                                  'recall':[recall_score(Y_test, y_pred, average='weighted')],
                                                  'fbeta_score':[fbeta_score(Y_test, y_pred, beta=1)],
                                                  'f1_score':[f1_score(Y_test, y_pred, average='weighted')]}), 
                               ignore_index=True)
        

Fold 1: LogReg
Fold 2: LogReg
Fold 3: LogReg
Fold 4: LogReg
Fold 5: LogReg
Fold 6: LogReg
Fold 7: LogReg
Fold 8: LogReg
Fold 9: LogReg
Fold 10: LogReg


In [203]:
scores[['algorithm', 'accuracy', 'precision', 'recall', 'f1_score', 'fbeta_score']].groupby(['algorithm']).mean()

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,fbeta_score
algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LogReg,0.878045,0.891402,0.878045,0.877149,0.867438


In [204]:
scores[['algorithm', 'f1_score', 'parameters']][scores['algorithm']=='LogReg']

Unnamed: 0,algorithm,f1_score,parameters
0,LogReg,0.871621,{'clf__C': 1}
1,LogReg,0.875391,{'clf__C': 10}
2,LogReg,0.877539,{'clf__C': 10}
3,LogReg,0.880505,{'clf__C': 10}
4,LogReg,0.879925,{'clf__C': 10}
5,LogReg,0.881631,{'clf__C': 10}
6,LogReg,0.875294,{'clf__C': 10}
7,LogReg,0.880645,{'clf__C': 10}
8,LogReg,0.869495,{'clf__C': 1}
9,LogReg,0.879439,{'clf__C': 1}


### Test in test data set

Finally, we used the model with the best performance to see the results in the TEST data set.

In [205]:
le = LabelEncoder() 

# Split features and class into two dataframes.
X_training = x_train.values
y_training = le.fit_transform(y_train.values)

clf = LogisticRegression(penalty='l1')
clf.fit(X_training, y_training)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [206]:
y_pred = clf.predict(x_test)
y_test_le = le.fit_transform(y_test.values)

print(fbeta_score(y_test_le, y_pred, beta=1))

0.870257304841


From classification using 10-Fold Cross Validation to estimate classification scores using tuning parameters we found out that location had a significant effect on finding the type of service that was used than compared to the time of the day.

The best model was LogReg which gave a F1 score of 0.87 on the test data