# New York City Taxi 

<p> In this machine learning model we would predict the fare amount - inclusive of tolls - for a taxi ride in New York city given the pickup and drop location. </p>

Outline for the project:
- Download the dataset
- Explore & analyze the data
- prepare the dataset for the ML training
- Train hardcoded & baseline models
- Make prediction
- Perform feature engineering
- Train and evaluate different models
- Tune hyperparameters for best models
- Document & Publish

## Download the dataset & Import the libraries

In [1]:
!pip install opendatasets --quiet

We would use <b>opendatasets</b> library to download the data directly from kaggle. 

In [1]:
import opendatasets as od
dataset_url = "https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction"
od.download(dataset_url)

In [7]:
%cd C:\Users\apoor\OneDrive\Documents\PythonDataAnalysis\machine-learning-models

C:\Users\apoor\OneDrive\Documents\PythonDataAnalysis\machine-learning-models


In [20]:
!dir "C:\Users\apoor\OneDrive\Documents\PythonDataAnalysis\machine-learning-models\new-york-city-taxi-fare-prediction"

 Volume in drive C is Windows-SSD
 Volume Serial Number is 0658-F810

 Directory of C:\Users\apoor\OneDrive\Documents\PythonDataAnalysis\machine-learning-models\new-york-city-taxi-fare-prediction

01-10-2022  23:30    <DIR>          .
02-10-2022  00:53    <DIR>          ..
01-10-2022  23:30               486 GCP-Coupons-Instructions.rtf
01-10-2022  23:30           343,271 sample_submission.csv
01-10-2022  23:30           983,020 test.csv
01-10-2022  23:30     5,697,178,298 train.csv
               4 File(s)  5,698,505,075 bytes
               2 Dir(s)  227,666,272,256 bytes free


It is 5.4 GB of Dataset - mostly of which is the training dataset. We would use this data to train our model and test the results on test data. 

When we work with large amount of data  - always start to work with sample data - may be 1% of the entire dataset. 
Loading the data set to pandas would be slow, so will be using the following optimisation
 1. ignore Key column [Because we wont need the unique identifier in the training set]
 2. parse pickup datetime while loading the data
 3. Specify the Datatype for columns as: float for geo coordinates, float for fare amount & uint8 for count
 
We would apply these changes while loading data using pandas.

<h5> Importing the Libraries </h5>

In [1]:
import pandas as pd
import numpy as np
import random

<h5> Importing the training dataset</h5>
We would import the 1% of the dataset for our initial analysis

In [2]:
# We are importing this data to see the number of columns, column types etc in the data.  
sample = pd.read_csv("train.csv",nrows = 10)

In [3]:
sample

Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
1,2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
3,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
4,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1
5,2011-01-06 09:50:45.0000002,12.1,2011-01-06 09:50:45 UTC,-74.000964,40.73163,-73.972892,40.758233,1
6,2012-11-20 20:35:00.0000001,7.5,2012-11-20 20:35:00 UTC,-73.980002,40.751662,-73.973802,40.764842,1
7,2012-01-04 17:22:00.00000081,16.5,2012-01-04 17:22:00 UTC,-73.9513,40.774138,-73.990095,40.751048,1
8,2012-12-03 13:10:00.000000125,9.0,2012-12-03 13:10:00 UTC,-74.006462,40.726713,-73.993078,40.731628,1
9,2009-09-02 01:11:00.00000083,8.9,2009-09-02 01:11:00 UTC,-73.980658,40.733873,-73.99154,40.758138,2


In [4]:
import_dtypes ={
    'fare_amount':'float32',
    'pickup_longitude':'float32',
    'pickup_latitude':'float32',
    'dropoff_longitude':'float32',
    'dropoff_latitude':'float32',
    'passenger_count':'uint8'    
}

In [5]:
#Creating a function to generate random numbers for 1% data
sample_fraction = 0.01 

random.seed(42)
def skip_row(idx):
    if idx == 0:
        return False
    return random.random()> sample_fraction


In [6]:
# Actual Sample Data import for the machine learning moodel
prediction_sample_data = pd.read_csv("train.csv", 
                                     usecols = sample.columns[1:], 
                                     parse_dates = ["pickup_datetime"],
                                     dtype = import_dtypes,
                                     skiprows = skip_row)

In [7]:
prediction_sample_data

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,4.0,2014-12-06 20:36:22+00:00,-73.979813,40.751904,-73.979446,40.755482,1
1,8.0,2013-01-17 17:22:00+00:00,0.000000,0.000000,0.000000,0.000000,2
2,8.9,2011-06-15 18:07:00+00:00,-73.996330,40.753223,-73.978897,40.766964,3
3,6.9,2009-12-14 12:33:00+00:00,-73.982430,40.745747,-73.982430,40.745747,1
4,7.0,2013-11-06 11:26:54+00:00,-73.959061,40.781059,-73.962059,40.768604,1
...,...,...,...,...,...,...,...
552445,45.0,2014-02-06 23:59:45+00:00,-73.973587,40.747669,-73.999916,40.602894,1
552446,22.5,2015-01-05 15:29:08+00:00,-73.935928,40.799656,-73.985710,40.726952,2
552447,4.5,2013-02-17 22:27:00+00:00,-73.992531,40.748619,-73.998436,40.740143,1
552448,14.5,2013-01-27 12:41:00+00:00,-74.012115,40.706635,-73.988724,40.756218,1


<h5>Importing the Test Dataset</h5>
we would keep the key column in this import as we would use this for our submission 

In [8]:
test_data = pd.read_csv("test.csv", dtype = import_dtypes)

In [9]:
test_data

Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2015-01-27 13:08:24.0000002,2015-01-27 13:08:24 UTC,-73.973320,40.763805,-73.981430,40.743835,1
1,2015-01-27 13:08:24.0000003,2015-01-27 13:08:24 UTC,-73.986862,40.719383,-73.998886,40.739201,1
2,2011-10-08 11:53:44.0000002,2011-10-08 11:53:44 UTC,-73.982521,40.751259,-73.979652,40.746140,1
3,2012-12-01 21:12:12.0000002,2012-12-01 21:12:12 UTC,-73.981163,40.767807,-73.990448,40.751637,1
4,2012-12-01 21:12:12.0000003,2012-12-01 21:12:12 UTC,-73.966049,40.789776,-73.988564,40.744427,1
...,...,...,...,...,...,...,...
9909,2015-05-10 12:37:51.0000002,2015-05-10 12:37:51 UTC,-73.968124,40.796997,-73.955643,40.780388,6
9910,2015-01-12 17:05:51.0000001,2015-01-12 17:05:51 UTC,-73.945511,40.803600,-73.960213,40.776371,6
9911,2015-04-19 20:44:15.0000001,2015-04-19 20:44:15 UTC,-73.991600,40.726608,-73.789742,40.647011,6
9912,2015-01-31 01:05:19.0000005,2015-01-31 01:05:19 UTC,-73.985573,40.735432,-73.939178,40.801731,6


<h2>Exploring the data sets </h2>

In [10]:
prediction_sample_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552450 entries, 0 to 552449
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype              
---  ------             --------------   -----              
 0   fare_amount        552450 non-null  float32            
 1   pickup_datetime    552450 non-null  datetime64[ns, UTC]
 2   pickup_longitude   552450 non-null  float32            
 3   pickup_latitude    552450 non-null  float32            
 4   dropoff_longitude  552450 non-null  float32            
 5   dropoff_latitude   552450 non-null  float32            
 6   passenger_count    552450 non-null  uint8              
dtypes: datetime64[ns, UTC](1), float32(5), uint8(1)
memory usage: 15.3 MB


In [11]:
prediction_sample_data.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,552450.0,552450.0,552450.0,552450.0,552450.0,552450.0
mean,11.354059,-72.497063,39.9105,-72.504326,39.934265,1.684983
std,9.811924,11.618246,8.061114,12.074346,9.255057,1.337664
min,-52.0,-1183.362793,-3084.490234,-3356.729736,-2073.150635,0.0
25%,6.0,-73.99202,40.734875,-73.991425,40.73399,1.0
50%,8.5,-73.981819,40.752621,-73.980179,40.753101,1.0
75%,12.5,-73.967155,40.767036,-73.963737,40.768059,2.0
max,499.0,2420.209473,404.983337,2467.752686,3351.403076,208.0


In [12]:
prediction_sample_data["pickup_datetime"].min(), prediction_sample_data["pickup_datetime"].max()

(Timestamp('2009-01-01 00:11:46+0000', tz='UTC'),
 Timestamp('2015-06-30 23:59:54+0000', tz='UTC'))

The data were currently looking at is for 6 years. 

1. There are 552k rows in 7 column
2. No null values 
3. The 1% sample data is utilizing 15.3 MB (we should be watchful)
4. Fare amount ranges from -52 to 499: 11.35 being the mean
5. The passenger ranges from 1 to 208
6. There is some issue with the longitude & latitude 
                                                      

In [13]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9914 entries, 0 to 9913
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   key                9914 non-null   object 
 1   pickup_datetime    9914 non-null   object 
 2   pickup_longitude   9914 non-null   float32
 3   pickup_latitude    9914 non-null   float32
 4   dropoff_longitude  9914 non-null   float32
 5   dropoff_latitude   9914 non-null   float32
 6   passenger_count    9914 non-null   uint8  
dtypes: float32(4), object(2), uint8(1)
memory usage: 319.6+ KB


In [14]:
test_data.describe()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,9914.0,9914.0,9914.0,9914.0,9914.0
mean,-73.974716,40.751041,-73.973656,40.75174,1.671273
std,0.042774,0.033541,0.039072,0.035435,1.278747
min,-74.25219,40.573143,-74.263245,40.568974,1.0
25%,-73.9925,40.736125,-73.991249,40.735253,1.0
50%,-73.982327,40.753052,-73.980015,40.754065,1.0
75%,-73.968012,40.767113,-73.964062,40.768757,2.0
max,-72.986534,41.709557,-72.990967,41.696682,6.0


Since the Test data represent the real world data - we can safely play between the data ranges in the test data set for longitute & latitude. Therefore we will remove the data from the training dataset following outside this range. This will drop the outlier and the invalid data points. 

<h2> Exploratory Data Analysis</h2>

Create Graphs like - histograms, line charts, bar & scatter plots - to study the distributions of values in each column and the relationship of each input with the target

1. What is the businest day of the week
2. businest time
3. is there a relation between date/time with fare
4. month wise fare
5. is the fare pick location dependent
6. is the fare drop of location dependent
7. average distance



Asking more questions about the data will help you develop deeper understanding of the data. 

<h5> Adding detailed Date/time info </h5>

<h2> Preparing the Dataset for Training </h2>

1. Split the training & validation data
2. fill/ remove the missing values
3. Extracts inputs and outputs. 

We will set aside 20% of the training dataset for the validation of our model. We would use the random sampling. </n> We have to take care that every time we have the same dataset for validation there for setting the random_state. Otherwise there would be an issue of data leakage.

<h5> Splitting the dataset</h5>

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
train_data, validation_data = train_test_split(prediction_sample_data, test_size = 0.2,random_state=42 )

In [17]:
len(train_data), len(validation_data)

(441960, 110490)

<h5> Missing Values </h5>

The data we are looking at does not have any missing value. But since this only the 1% of the entire data, we would simply drop the rows with missing value instead of trying to fill them. We would do this for both - the test data and the validation data. 

In [18]:
train_data = train_data.dropna()
validation_data = validation_data.dropna()

Since the inputs and outputs needs to be passed separately in the machine learning model, we need to create separate variables for them

<h5> Extracting the inputs and outputs </h5>

Before we train our model, we need to separate the inputs and outputs because they are passed separately in the machine learning models. 

In [19]:
train_data.columns

Index(['fare_amount', 'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count'],
      dtype='object')

We by itself cant pass the datetime column in the machine learning modelbecause its a timestamp not a number. (We only pass the numbers in the model)

In [20]:
input_cols = ['pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
target_col = ['fare_amount']

<h5> Training the data </h5>

In [22]:
train_input  = train_data[input_cols]
train_target = train_data[target_col]

In [23]:
train_input

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
353352,-73.993652,40.741543,-73.977974,40.742352,4
360070,-73.993805,40.724579,-73.993805,40.724579,1
372609,-73.959160,40.780750,-73.969116,40.761230,1
550895,-73.952187,40.783951,-73.978645,40.772602,1
444151,-73.977112,40.746834,-73.991104,40.750404,2
...,...,...,...,...,...
110268,-73.987152,40.750633,-73.979073,40.763168,1
259178,-73.972656,40.764042,-74.013176,40.707840,2
365838,-73.991982,40.749767,-73.989845,40.720551,3
131932,-73.969055,40.761398,-73.990814,40.751328,1


In [24]:
train_target

Unnamed: 0,fare_amount
353352,6.0
360070,3.7
372609,10.0
550895,8.9
444151,7.3
...,...
110268,9.3
259178,18.5
365838,10.1
131932,10.9


In [25]:
val_inputs = validation_data[input_cols]
val_target = validation_data[target_col]

In [26]:
test_inputs = test_data[input_cols]

<h2> Train Hardcoded & Baseline Models </h2>

We should alwsys begin with the simple hardcoded or baseline model to establish the minimum score any ML model should beat. 
* Hardcoded Model - always predict the average fare
* basline Model - is Linear regression

We will create a model that will predict average. It will give us the baseline to beat. 
* fit is used to train our simple model. Since this is a simple model, we are ignoring the inputs and checking our targets directly. 
* predict takes a bunch of inputs and create target. 

In [34]:
# Defining the Model
class MeanRegressor:
    def fit(self, inputs, targets):
        self.mean = targets.mean()
    
    def predict(self, inputs):
        return np.full(inputs.shape[0], self.mean)

In [35]:
#Creating the mean regressor model
mean_model = MeanRegressor()
mean_model.fit(train_input, train_target)

In [36]:
mean_model.mean

fare_amount    11.354714
dtype: float32

In [37]:
train_predict = mean_model.predict(train_input)
train_predict

array([11.354714, 11.354714, 11.354714, ..., 11.354714, 11.354714,
       11.354714], dtype=float32)

To know how good our model is we need to compare the train_predict values with the train_target

In [38]:
from sklearn.metrics import mean_squared_error

In [39]:
def rmse(targets,predictions):
    return mean_squared_error(targets, predictions, squared=False)

In [43]:
train_rmse = rmse(train_target, train_predict)
train_rmse

9.789782

Any model we train should be having a RSME Value lower than this one. 

In [44]:
val_predict = mean_model.predict(val_inputs)
val_predict

array([11.354714, 11.354714, 11.354714, ..., 11.354714, 11.354714,
       11.354714], dtype=float32)

In [45]:
val_rmse = rmse(val_target, val_predict)
val_rmse

9.899954

Our hardcoded model is off by \\$ 9.899 on average, which is pretty bad considering the average price is \\$11.35

<h5> Training & Evaluating the Baseline Model </h5>

We will train a linear regression model as our baseline model, which tries to express the target as a weighted sum of the inputs. 

In [46]:
from sklearn.linear_model import LinearRegression

In [47]:
linear_model = LinearRegression()

In [48]:
linear_model.fit(train_input,train_target)

In [49]:
train_linear_pred = linear_model.predict(train_input)

In [50]:
train_linear_pred

array([[11.546237],
       [11.28461 ],
       [11.28414 ],
       ...,
       [11.458918],
       [11.284281],
       [11.284448]], dtype=float32)

In [51]:
rmse(train_target, train_linear_pred)

9.788632

In [52]:
val_linear_pred = linear_model.predict(val_inputs)
val_linear_pred

array([[11.284328 ],
       [11.284496 ],
       [11.2847805],
       ...,
       [11.8045   ],
       [11.284433 ],
       [11.284133 ]], dtype=float32)

In [53]:
rmse(val_target, val_linear_pred)

9.898088

The Linear regression model is off by \\$ 9.89, which is not much better than simply predicting the average. 
This is mainly because the training data is not in a format useful for the prediction. And we are not using the most important column: pickup date and time. 
    