
# 🚕 NYC Taxi Fare Prediction

This notebook builds a multiple linear regression model to predict NYC taxi fares based on trip distance and duration using a cleaned sample dataset. It includes preprocessing, feature engineering, model training, and performance evaluation.


## 📂 Load Dataset (Colab Compatible)

In [None]:

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import warnings
warnings.filterwarnings('ignore')

# Download dataset if running in Colab
if "google.colab" in str(get_ipython()):
    !wget https://raw.githubusercontent.com/Rafsun-Chowdhury/NYC-Taxi-Fare-Prediction/main/taxi_fare_data.csv

# Load the dataset
df = pd.read_csv("taxi_fare_data.csv")
df.head()


## 🧮 Feature Engineering

In [None]:

# Convert pickup and dropoff datetime columns if present
if 'tpep_pickup_datetime' in df.columns and 'tpep_dropoff_datetime' in df.columns:
    df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
    df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])
    df['duration'] = (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']).dt.total_seconds() / 60

# Filter out invalid rows
df = df[(df['fare_amount'] > 0) & (df['fare_amount'] < 200)]
df = df[(df['trip_distance'] > 0) & (df['trip_distance'] < 100)]
df = df[(df['duration'] > 0) & (df['duration'] < 180)]

df[['trip_distance', 'duration', 'fare_amount']].describe()


## 🤖 Model Training

In [None]:

features = ['trip_distance', 'duration']
X = df[features]
y = df['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)


## 📈 Model Evaluation

In [None]:

mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"MAE: ${mae:.2f}")
print(f"RMSE: ${rmse:.2f}")
print(f"R² Score: {r2:.2f}")


## 📊 Prediction Visualization

In [None]:

plt.figure(figsize=(8,6))
sns.scatterplot(x=y_test, y=y_pred, alpha=0.3)
plt.xlabel("Actual Fare")
plt.ylabel("Predicted Fare")
plt.title("Actual vs Predicted Fare")
plt.grid(True)
plt.show()



## ✅ Conclusion

The linear regression model shows a strong correlation between distance, trip duration, and fare amount. With additional feature engineering (e.g., pickup locations, time-of-day effects), performance could be improved further.
