In [1]:
from matplotlib import pyplot as plt
# linear algebra
import numpy as np 

# data processing
import pandas as pd 

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn import svm

#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

# csv file manipulation
import csv

In [2]:
# read training data 
df = pd.read_csv('./data/train.csv')

# read testing data
test_df = pd.read_csv('./data/test.csv')


check for columns with missing values

After loading the training and the testing dataset as pandas dataframes, the test data is pre-processed. The given dataset is first processed to remove records with missing values for duration and the fare. 



In [6]:
print(df.isnull().any())
df = df.dropna(subset=['duration', 'fare'])
print(df['duration'].isnull().values.sum())

0


And then the feature extraction has taken place to better understand the features of the dataset. 

1.	trip_duration = duration - (meter_waiting_till_pickup + meter_waiting)
2.	trip_fare = fare - ( additional_fare + meter_waiting_fare)


The feature extraction was done to both training and testing data. 

In [8]:
df['trip_duration'] = df['duration']- (df['meter_waiting_till_pickup'] + df['meter_waiting'])

In [9]:
test_df['trip_duration'] = test_df['duration']- (test_df['meter_waiting_till_pickup'] + test_df['meter_waiting'])

In [10]:
df['trip_fare'] = df['fare'] -  (df['additional_fare'] + df['meter_waiting_fare'])
test_df['trip_fare'] = test_df['fare'] -  (test_df['additional_fare'] + test_df['meter_waiting_fare'])


Then the training dataset is splitted into labels ‘Y’ and features ‘X’. In order to write the results against the trip_id, a list of trip identification numbers are created from the testing set. Then after that, trip_id attribute is removed from both training and testing datasets. 

As the models performed best when the pickup_time and drop_time were removed, those features were removed from the data in the next step. 

In [11]:

#split into data and label
Y = df['label'] # label set


In [12]:


tripIds = test_df['tripid'] # separate tripid to use when the results are written back to the submission

#remove attributes
test_df = test_df.drop(columns=['tripid', 'pickup_time', 'drop_time'])

X = df.drop(columns=['tripid', 'label', 'pickup_time', 'drop_time'])

X_test = test_df



After all those steps, the data is standardized with sklearn.preprocessing, StandardScaler. 

In [13]:
#Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X)
X_test = sc.transform(X_test)

The RandomForest classifier is used as the classification algorithm and the function RFClassifier is created to accept three parameters, 
    Training feature set
    Training labels
    Testing feature set

And returns the predicted result set as the output. 

In [14]:


def RFClassifier(X_train, y_train, X_test):
   
    #Create the Classifier
    clf=RandomForestClassifier(n_estimators=100)

    #Train the model using the training sets y_pred=clf.predict(X_test)
    clf.fit(X_train,y_train)

    y_pred=clf.predict(X_test)
    
    return y_pred
    

In [15]:
Y_pred = RFClassifier(X_train,Y, X_test)

The function write_output is created to write the predicted out into the form required by the kaggle submission and this uses the extracted trip_id attributes. 

In [16]:
def write_output(Y_prediction):
    with open('sampleSubmission.csv', 'w', newline='') as csvfile:
        fieldnames = ['tripid', 'prediction']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        writer.writeheader()
        line = 0
    
        for w in Y_prediction:
            output = 1
            if (w == 'incorrect'):
                output = 0
            writer.writerow({'tripid': tripIds[line], 'prediction': output})
            line = line+1


After calling the function RFClassifier, the returned result set is passed to the function write_output to write to sampleSubmission.csv file of the given location. 


In [None]:
write_output(Y_pred)