
# Taxi Trip Duration Prediction

This notebook demonstrates the process of predicting taxi trip durations using linear regression. The data is sourced from NYC taxi trips, and we perform the following steps:

1. Import necessary libraries
2. Load and preprocess the January dataset
3. Feature extraction and transformation
4. Train a linear regression model
5. Validate the model on February data

## Import Libraries

We begin by importing the necessary libraries for data manipulation, visualization, and machine learning.


In [26]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from pathlib import Path

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## Load January Data

Next, we load the January data and examine its structure. We also compute the trip duration for each trip.


In [2]:
current_path = Path.cwd()
print(f"Current path: {current_path}")
data_path = current_path / "Data\yellow_tripdata_2023-01.parquet"

Current path: d:\mLOPS\2024-MLOPS-ZOOMCAMP\HW1


In [3]:
df_jan = pd.read_parquet(data_path)
print("The total columns in January datasets :",len(df_jan.columns))


The total columns in January datasets : 19


## Calculate Trip Duration

We calculate the trip duration in minutes and filter out trips that are either too short or too long (outside 1 to 60 minutes).


In [4]:
df_jan["duration_datetime"] = df_jan["tpep_dropoff_datetime"] - df_jan["tpep_pickup_datetime"]

In [5]:
df_jan['trip_duration_minutes'] = df_jan['duration_datetime'].dt.total_seconds() / 60

In [6]:
trip_duration_std = df_jan['trip_duration_minutes'].std()
print(f"Standard Deviation of Trip Durations (in minutes): {trip_duration_std:.2f}")

Standard Deviation of Trip Durations (in minutes): 42.59


In [7]:
filtered_df = df_jan[(df_jan["trip_duration_minutes"]>=1) & (df_jan["trip_duration_minutes"]<=60)]

In [8]:
fraction_left = len(filtered_df) / len(df_jan)

# Display the fraction
print(f"Fraction of records left after dropping outliers: {fraction_left:.2%}")

Fraction of records left after dropping outliers: 98.12%


In [9]:
filtered_df.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'airport_fee',
       'duration_datetime', 'trip_duration_minutes'],
      dtype='object')

## Feature Extraction

We extract relevant features for the model, specifically the pickup and dropoff location IDs. These are then transformed using a `DictVectorizer`.


In [10]:
features =filtered_df[['PULocationID', 'DOLocationID']]
features = features.astype('str')

In [None]:
data_dicts = features.to_dict(orient='records')

# Fit a DictVectorizer
dict_vectorizer = DictVectorizer()
feature_matrix=dict_vectorizer.fit_transform(data_dicts)

In [12]:
print("The dimensionality of this matrix (number of columns) : ",feature_matrix.shape[1])

The dimensionality of this matrix (number of columns) :  515


In [13]:
target = filtered_df['trip_duration_minutes'].values

## Train Linear Regression Model

We train a linear regression model on the transformed feature matrix and the target variable (trip duration).


In [15]:
model = LinearRegression()
model.fit(feature_matrix, target)

# Predict target variable on the training data
predictions = model.predict(feature_matrix)

## Evaluate Model Performance on Training Data

We evaluate the model's performance on the training data by calculating the Mean Squared Error (MSE) and Root Mean Squared Error (RMSE).


In [17]:


# Calculate RMSE
mse = mean_squared_error(target, predictions)

# Display the RMSE
print(f"Mean Squared Error (MSE) on the training data: {mse:.2f}")

Mean Squared Error (MSE) on the training data: 58.51


In [18]:
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error (RMSE) on the training data: {rmse:.2f}")

Root Mean Squared Error (RMSE) on the training data: 7.65


## Validation on February Data

We validate the model on February data by repeating the preprocessing and feature extraction steps, then predicting trip durations and calculating the RMSE.


In [19]:
def data_preprocessing(df):
    df["duration_datetime"] = df["tpep_dropoff_datetime"] - df["tpep_pickup_datetime"]
    df['trip_duration_minutes'] = df['duration_datetime'].dt.total_seconds() / 60
    filtered_df = df[(df["trip_duration_minutes"]>=1) & (df["trip_duration_minutes"]<=60)]
    return filtered_df

In [20]:
def features_extraction(df):
    features =df[['PULocationID', 'DOLocationID']]
    features = features.astype('str')
    data_dicts = features.to_dict(orient='records')
    feature_matrix=dict_vectorizer.transform(data_dicts)
    target = df['trip_duration_minutes'].values
    return feature_matrix, target

    
    

In [21]:
data_path = current_path / "Data\yellow_tripdata_2023-02.parquet"
df_val = pd.read_parquet(data_path)
df_val_filtered = data_preprocessing(df_val)
X_test,y_test = features_extraction(df_val_filtered)


    

In [22]:
X_test.shape

(2855951, 515)

In [23]:
predictions = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
print(f"Root Mean Squared Error (RMSE) on the validation data: {rmse:.2f}")

Root Mean Squared Error (RMSE) on the validation data: 7.81
