## New York City Taxi Fare Prediction

Project Outline
1. Download the dataset
2. Explore and analyze the dataset
3. Prepare dataset for machine learing training
4. Train harcoded and baseline model
5. Make predictions
6. Perform feature engineering
7. Train and evaluate models
8. Tune hyperparameters for the best models
9. Train on GPU with entire dataset
10. Document and publish project outline

## 1. Download the Dataset

- Install required libraries
- Download data from Kaggle
- View dataset files
- Load training set with Pandas
- Load test set with Pandas

### Install Required Libraries

In [1]:
pip install pandas numpy scikit-learn xgboost --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install kaggle --quiet

Note: you may need to restart the kernel to use updated packages.


### Download data from Kaggle

!kaggle competitions download -c new-york-city-taxi-fare-prediction --force

I opted out of using the line above. The data was downloaded manually to circumvent access restrictions on my laptop

In [3]:
from pathlib import Path

data_dir = Path("C:/Users/JFADIPE/Downloads/John Fadipe - Software Engineering/Dataset/new-york-city-taxi-fare-prediction")

### View Dataset

In [4]:
!dir "{data_dir}"

 Volume in drive C has no label.
 Volume Serial Number is EEDD-BD71

 Directory of C:\Users\JFADIPE\Downloads\John Fadipe - Software Engineering\Dataset\new-york-city-taxi-fare-prediction

29-Jul-25  08:12 AM    <DIR>          .
29-Jul-25  08:12 AM    <DIR>          ..
29-Jul-25  08:11 AM               486 GCP-Coupons-Instructions.rtf
29-Jul-25  08:11 AM           343,271 sample_submission.csv
29-Jul-25  08:11 AM           983,020 test.csv
29-Jul-25  08:12 AM     5,697,178,298 train.csv
               4 File(s)  5,698,505,075 bytes
               2 Dir(s)  59,496,525,824 bytes free


In [5]:
file_path = fr"{data_dir}/train.csv"

line_count = 0
with open(file_path, 'r', encoding='utf-8') as file:
    for _ in file:
        line_count += 1

print(f"Total lines in train.csv: {line_count}")

Total lines in train.csv: 55423857


In [7]:
!find /v /c "" "{data_dir}/test.csv"


---------- C:\USERS\JFADIPE\DOWNLOADS\JOHN FADIPE - SOFTWARE ENGINEERING\DATASET\NEW-YORK-CITY-TAXI-FARE-PREDICTION/TEST.CSV: 9915


In [8]:
!find /v /c "" "{data_dir}/sample_submission.csv"


---------- C:\USERS\JFADIPE\DOWNLOADS\JOHN FADIPE - SOFTWARE ENGINEERING\DATASET\NEW-YORK-CITY-TAXI-FARE-PREDICTION/SAMPLE_SUBMISSION.CSV: 9915


In [9]:
!powershell -Command "Get-Content '{data_dir}/train.csv' -TotalCount 10"

key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1
2011-01-06 09:50:45.0000002,12.1,2011-01-06 09:50:45 UTC,-74.000964,40.73163,-73.972892,40.758233,1
2012-11-20 20:35:00.0000001,7.5,2012-11-20 20:35:00 UTC,-73.980002,40.751662,-73.973802,40.764842,1
2012-01-04 17:22:00.00000081,16.5,2012-01-04 17:22:00 UTC,-73.9513,40.774138,-73.990095,40.751048,1
2012-12-03 13:10:00.000000125,9,2012-12-03 13:10:00 UTC,-74.006462,40.726713,-73.99

In [10]:
!powershell -Command "Get-Content '{data_dir}/test.csv' -TotalCount 10"

key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
2015-01-27 13:08:24.0000002,2015-01-27 13:08:24 UTC,-73.973320007324219,40.7638053894043,-73.981430053710938,40.74383544921875,1
2015-01-27 13:08:24.0000003,2015-01-27 13:08:24 UTC,-73.986862182617188,40.719383239746094,-73.998886108398438,40.739200592041016,1
2011-10-08 11:53:44.0000002,2011-10-08 11:53:44 UTC,-73.982524,40.75126,-73.979654,40.746139,1
2012-12-01 21:12:12.0000002,2012-12-01 21:12:12 UTC,-73.98116,40.767807,-73.990448,40.751635,1
2012-12-01 21:12:12.0000003,2012-12-01 21:12:12 UTC,-73.966046,40.789775,-73.988565,40.744427,1
2012-12-01 21:12:12.0000005,2012-12-01 21:12:12 UTC,-73.960983,40.765547,-73.979177,40.740053,1
2011-10-06 12:10:20.0000001,2011-10-06 12:10:20 UTC,-73.949013,40.773204,-73.959622,40.770893,1
2011-10-06 12:10:20.0000003,2011-10-06 12:10:20 UTC,-73.777282,40.646636,-73.985083,40.759368,1
2011-10-06 12:10:20.0000002,2011-10-06 12:10:20 UTC,-74.01409

In [11]:
!powershell -Command "Get-Content '{data_dir}/sample_submission.csv' -TotalCount 10"

key,fare_amount
2015-01-27 13:08:24.0000002,11.35
2015-01-27 13:08:24.0000003,11.35
2011-10-08 11:53:44.0000002,11.35
2012-12-01 21:12:12.0000002,11.35
2012-12-01 21:12:12.0000003,11.35
2012-12-01 21:12:12.0000005,11.35
2011-10-06 12:10:20.0000001,11.35
2011-10-06 12:10:20.0000003,11.35
2011-10-06 12:10:20.0000002,11.35


#### Observations:

- This is a supervised learning regression problem
- Training data is 5.5 GB in size
- Training data has 5.5 million rows
- Test set is much smaller (< 10,000 rows)
- The training set has 8 columns:
    - `key` (a unique identifier)
    - `fare_amount` (target column)
    - `pickup_datetime`
    - `pickup_longitude`
    - `pickup_latitude`
    - `dropoff_longitude`
    - `dropoff_latitude`
    - `passenger_count`
- The test set has all columns except the target column `fare_amount`.
- The submission file should contain the `key` and `fare_amount` for each test sample.

### Loading Training Set

Loading the entire dataset into Pandas is not optimal, so the following optimizations were adopted:

- Ignored the `key` column
- Parsed pickup datetime while loading data 
- Specified data types for other columns
   - `float32` for geo coordinates
   - `float32` for fare amount
   - `uint8` for passenger count
- Worked with a 1% sample of the data (~500k rows)

In [12]:
import pandas as pd

In [13]:
selected_columns = "fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count".split(",")

In [14]:
selected_columns

['fare_amount',
 'pickup_datetime',
 'pickup_longitude',
 'pickup_latitude',
 'dropoff_longitude',
 'dropoff_latitude',
 'passenger_count']

In [15]:
data_types = {
     "fare_amount": "float32",
     "pickup_longitude": "float32",
     "pickup_latitude": "float32",
     "dropoff_longitude": "float32",
     "dropoff_latitude": "float32",
     "passenger_count": "uint8"
}

data_types

{'fare_amount': 'float32',
 'pickup_longitude': 'float32',
 'pickup_latitude': 'float32',
 'dropoff_longitude': 'float32',
 'dropoff_latitude': 'float32',
 'passenger_count': 'uint8'}

In [16]:
sample_fraction = 0.01

In [21]:
%%time
import random

def skip_row(row_index):
    if row_index == 0:
        return False
    return random.random() > sample_fraction

random.seed(42)
df = pd.read_csv(data_dir/"train.csv", 
                 usecols = selected_columns, 
                 dtype = data_types, 
                 parse_dates = ["pickup_datetime"], 
                 skiprows = skip_row)

CPU times: total: 38.6 s
Wall time: 38.8 s


### Load Test Set

In [22]:
test_df = pd.read_csv(data_dir/"train.csv", dtype = data_types)

In [23]:
df

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,4.0,2014-12-06 20:36:22+00:00,-73.979813,40.751904,-73.979446,40.755482,1
1,8.0,2013-01-17 17:22:00+00:00,0.000000,0.000000,0.000000,0.000000,2
2,8.9,2011-06-15 18:07:00+00:00,-73.996330,40.753223,-73.978897,40.766964,3
3,6.9,2009-12-14 12:33:00+00:00,-73.982430,40.745747,-73.982430,40.745747,1
4,7.0,2013-11-06 11:26:54+00:00,-73.959061,40.781059,-73.962059,40.768604,1
...,...,...,...,...,...,...,...
552445,45.0,2014-02-06 23:59:45+00:00,-73.973587,40.747669,-73.999916,40.602894,1
552446,22.5,2015-01-05 15:29:08+00:00,-73.935928,40.799656,-73.985710,40.726952,2
552447,4.5,2013-02-17 22:27:00+00:00,-73.992531,40.748619,-73.998436,40.740143,1
552448,14.5,2013-01-27 12:41:00+00:00,-74.012115,40.706635,-73.988724,40.756218,1


In [24]:
test_df

Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844315,40.721317,-73.841614,40.712276,1
1,2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016045,40.711304,-73.979271,40.782005,1
2,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982735,40.761269,-73.991241,40.750561,2
3,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.987129,40.733143,-73.991570,40.758091,1
4,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968094,40.768009,-73.956657,40.783764,1
...,...,...,...,...,...,...,...,...
55423851,2014-03-15 03:28:00.00000070,14.0,2014-03-15 03:28:00 UTC,-74.005272,40.740028,-73.963280,40.762554,1
55423852,2009-03-24 20:46:20.0000002,4.2,2009-03-24 20:46:20 UTC,-73.957787,40.765530,-73.951637,40.773960,1
55423853,2011-04-02 22:04:24.0000004,14.1,2011-04-02 22:04:24 UTC,-73.970505,40.752323,-73.960541,40.797340,1
55423854,2011-10-26 05:57:51.0000002,28.9,2011-10-26 05:57:51 UTC,-73.980904,40.764629,-73.870605,40.773964,1


## 2. Explore the Dataset

- Basic info about training set
- Basic info about test set
- Exploratory data analysis & visualization
- Ask & answer questions

### Training set

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552450 entries, 0 to 552449
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype              
---  ------             --------------   -----              
 0   fare_amount        552450 non-null  float32            
 1   pickup_datetime    552450 non-null  datetime64[ns, UTC]
 2   pickup_longitude   552450 non-null  float32            
 3   pickup_latitude    552450 non-null  float32            
 4   dropoff_longitude  552450 non-null  float32            
 5   dropoff_latitude   552450 non-null  float32            
 6   passenger_count    552450 non-null  uint8              
dtypes: datetime64[ns, UTC](1), float32(5), uint8(1)
memory usage: 15.3 MB


In [41]:
df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,552450.0,552450.0,552450.0,552450.0,552450.0,552450.0
mean,11.354059,-72.497063,39.9105,-72.504326,39.934265,1.684983
std,9.811924,11.618246,8.061114,12.074346,9.255058,1.337664
min,-52.0,-1183.362793,-3084.490234,-3356.729736,-2073.150635,0.0
25%,6.0,-73.99202,40.734875,-73.991425,40.73399,1.0
50%,8.5,-73.981819,40.752621,-73.980179,40.753101,1.0
75%,12.5,-73.967155,40.767036,-73.963737,40.768059,2.0
max,499.0,2420.209473,404.983337,2467.752686,3351.403076,208.0


In [42]:
df["pickup_datetime"].min(), df["pickup_datetime"].max()

(Timestamp('2009-01-01 00:11:46+0000', tz='UTC'),
 Timestamp('2015-06-30 23:59:54+0000', tz='UTC'))

Observations about training data:

- 550k+ rows, as expected
- No missing data (in the sample)
- `fare_amount` ranges from \$-52.0 to \$499.0 
- `passenger_count` ranges from 0 to 208 
- There seem to be some errors in the latitude & longitude values
- Dates range from 1st Jan 2009 to 30th June 2015
- The dataset takes up 15 MB of space in the RAM

We may need to deal with outliers and data entry errors before we train our model.

### Test Set

In [45]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9914 entries, 0 to 9913
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   key                9914 non-null   object 
 1   pickup_datetime    9914 non-null   object 
 2   pickup_longitude   9914 non-null   float32
 3   pickup_latitude    9914 non-null   float32
 4   dropoff_longitude  9914 non-null   float32
 5   dropoff_latitude   9914 non-null   float32
 6   passenger_count    9914 non-null   uint8  
dtypes: float32(4), object(2), uint8(1)
memory usage: 319.6+ KB


In [43]:
test_df.describe()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,9914.0,9914.0,9914.0,9914.0,9914.0
mean,-73.974716,40.751041,-73.973656,40.75174,1.671273
std,0.042774,0.033541,0.039072,0.035435,1.278747
min,-74.25219,40.573143,-74.263245,40.568974,1.0
25%,-73.9925,40.736125,-73.991249,40.735253,1.0
50%,-73.982327,40.753052,-73.980015,40.754065,1.0
75%,-73.968012,40.767113,-73.964062,40.768757,2.0
max,-72.986534,41.709557,-72.990967,41.696682,6.0


In [44]:
test_df["pickup_datetime"].min(), test_df["pickup_datetime"].max()

('2009-01-01 11:04:24 UTC', '2015-06-30 20:03:50 UTC')