# Infromation about the data

Link to download the data: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

This data dictionary describes yellow taxi trip data

Coulmn Name | Description
--- | ---
VendorID | A code indicating the TPEP provider that provided the record.<br>1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
tpep_pickup_datetime | The date and time when the meter was engaged.
tpep_dropoff_datetime | The date and time when the meter was disengaged.
Passenger_count | The number of passengers in the vehicle.<br>This is a driver-entered value.
Trip_distance | The elapsed trip distance in miles reported by the taximeter.
PULocationID | TLC Taxi Zone in which the taximeter was engaged.
DOLocationID | TLC Taxi Zone in which the taximeter was disengaged.
RateCodeID | The final rate code in effect at the end of the trip.<br>1= Standard rate.<br>2=JFK<br>3=Newark<br>4=Nassau or Westchester<br>5=Negotiated fare<br>6=Group ride
Store_and_fwd_flag | This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. <br>Y= store and forward trip <br>N= not a store and forward trip
Payment_type | A numeric code signifying how the passenger paid for the trip. <br>1= Credit card <br>2= Cash <br>3= No charge <br>4= Dispute <br>5= Unknown<br>6= Voided trip
Fare_amount | The time-and-distance fare calculated by the meter.
Extra | Miscellaneous extras and surcharges. Currently, this only includes the 0.50 and 1 rush hour and overnight charges.
MTA_tax | 0.50 MTA tax that is automatically triggered based on the metered rate in use.
Improvement_surcharge | 0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
Tip_amount | This field is automatically populated for credit card tips. Cash tips are not included
Tolls_amount | Total amount of all tolls paid in trip.
Total_amount | The total amount charged to passengers. Does not include cash tips.
Congestion_Surcharge | Total amount collected in trip for NYS congestion surcharge.
Airport_fee | $1.25 for pick up only at LaGuardia and John F. Kennedy Airports

# Importing liberaries

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

## Sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import Ridge

import pickle

# Reading and Cleaning the data

In [3]:
def read_datafram(filname):
    """
    This function read the df and add the duration feature and drop unnecessary features and return clean df
    """
    df = pd.read_parquet(filname)

    df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
    df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)

    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    df.drop(['VendorID',	'tpep_pickup_datetime',	'tpep_dropoff_datetime', 'store_and_fwd_flag'], axis=1, inplace=True)

    df.dropna(inplace=True)

    return df

In [4]:
train_df = read_datafram('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet')
val_df = read_datafram('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet')

In [5]:
train_df.head()

Unnamed: 0,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,duration
0,1.0,0.97,1.0,161,141,2,9.3,1.0,0.5,0.0,0.0,1.0,14.3,2.5,0.0,8.433333
1,1.0,1.1,1.0,43,237,1,7.9,1.0,0.5,4.0,0.0,1.0,16.9,2.5,0.0,6.316667
2,1.0,2.51,1.0,48,238,1,14.9,1.0,0.5,15.0,0.0,1.0,34.9,2.5,0.0,12.75
3,0.0,1.9,1.0,138,7,1,12.1,7.25,0.5,0.0,0.0,1.0,20.85,0.0,1.25,9.616667
4,1.0,1.43,1.0,107,79,1,11.4,1.0,0.5,3.28,0.0,1.0,19.68,2.5,0.0,10.833333


In [6]:
val_df.head()

Unnamed: 0,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,duration
0,2.0,0.3,1.0,142,163,2,4.4,3.5,0.5,0.0,0.0,1.0,9.4,2.5,0.0,1.683333
3,0.0,18.8,1.0,132,26,1,70.9,2.25,0.5,0.0,0.0,1.0,74.65,0.0,1.25,32.083333
4,1.0,3.22,1.0,161,145,1,17.0,1.0,0.5,3.3,0.0,1.0,25.3,2.5,0.0,13.3
5,1.0,5.1,1.0,148,236,1,21.9,3.5,0.5,5.35,0.0,1.0,32.25,2.5,0.0,14.633333
6,1.0,8.9,1.0,137,244,1,41.5,3.5,0.5,3.5,0.0,1.0,50.0,2.5,0.0,27.95


In [7]:
# make the featues name the same
val_df.columns = train_df.columns

# Spliting the data

In [8]:
# select some features for the model
features = ['passenger_count', 'trip_distance', 'PULocationID', 'DOLocationID', 'total_amount']

In [9]:
# spliting the data into features and target
X_train = train_df[features]
y_train = train_df['duration'].values

X_val_full = val_df[features]
y_val_full = val_df['duration'].values

In [10]:
# split the val_full data into test and validation
X_test, X_val, y_test, y_val = train_test_split(X_val_full, y_val_full,
                                                test_size=0.5, shuffle=True)

# Building the model

In [12]:
re = Ridge(alpha=0.8)
re.fit(X_train, y_train)

y_pred = re.predict(X_val)

mse = mean_absolute_error(y_val, y_pred)

print('MSE: ', mse)

MSE:  4.020721125685651


In [13]:
# Evaluate the model on test data
y_pred = re.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print("MAE: ", mae)

MAE:  4.022835436956832


In [16]:
# Saving the model
pickle.dump(re, open('../models/duration-prediction.sav', 'wb'))

In [2]:
# Load model
duration_model = pickle.load(open("../Models/duration-prediction.sav", "rb"))

In [11]:
# Select a sample from test data
sample = X_test.sample(1)
sample

Unnamed: 0,passenger_count,trip_distance,PULocationID,DOLocationID,total_amount
2484935,1.0,4.11,239,234,34.7


In [14]:
# Make prediction with loaded model
prediction = duration_model.predict(sample)
prediction[0]

16.923742992078264