## 1. Loading the Dataset

We'll use the NYC Yellow Taxi Trip data again. Our goal: predict the fare amount for a trip based on features like passenger count and trip distance. If you don't have the data, the code will download it for you.

In [1]:
import os
import pandas as pd
nyc_url = 'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2020-01.csv.gz'
local_path = 'yellow_tripdata_2020-01.csv.gz'
if not os.path.exists(local_path):
    import urllib.request
    print('Downloading NYC taxi data...')
    urllib.request.urlretrieve(nyc_url, local_path)
    print('Download complete.')
df = pd.read_csv(local_path, compression='gzip', low_memory=False)
df.head()

Downloading NYC taxi data...
Download complete.


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2020-01-01 00:28:15,2020-01-01 00:33:03,1.0,1.2,1.0,N,238,239,1.0,6.0,3.0,0.5,1.47,0.0,0.3,11.27,2.5
1,1.0,2020-01-01 00:35:39,2020-01-01 00:43:04,1.0,1.2,1.0,N,239,238,1.0,7.0,3.0,0.5,1.5,0.0,0.3,12.3,2.5
2,1.0,2020-01-01 00:47:41,2020-01-01 00:53:52,1.0,0.6,1.0,N,238,238,1.0,6.0,3.0,0.5,1.0,0.0,0.3,10.8,2.5
3,1.0,2020-01-01 00:55:23,2020-01-01 01:00:14,1.0,0.8,1.0,N,238,151,1.0,5.5,0.5,0.5,1.36,0.0,0.3,8.16,0.0
4,2.0,2020-01-01 00:01:58,2020-01-01 00:04:16,1.0,0.0,1.0,N,193,193,2.0,3.5,0.5,0.5,0.0,0.0,0.3,4.8,0.0


## 2. Data Cleaning and Feature Selection

Let's clean the data and select the features we'll use for prediction. We'll focus on `passenger_count`, `trip_distance`, and `fare_amount`. We'll also drop rows with missing or invalid values.

In [2]:
# Select relevant columns and drop missing/invalid values
ml_df = df[['passenger_count', 'trip_distance', 'fare_amount']].copy()
ml_df = ml_df.dropna()
ml_df = ml_df[(ml_df['passenger_count'] > 0) & (ml_df['trip_distance'] > 0) & (ml_df['fare_amount'] > 0)]
ml_df.head()

Unnamed: 0,passenger_count,trip_distance,fare_amount
0,1.0,1.2,6.0
1,1.0,1.2,7.0
2,1.0,0.6,6.0
3,1.0,0.8,5.5
5,1.0,0.03,2.5


## 3. Training a Regression Model

We'll use scikit-learn to build a simple regression model that predicts the fare amount based on passenger count and trip distance. Let's split the data into training and test sets and train our model!

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

## Features and target

In [4]:

X = ml_df[['passenger_count', 'trip_distance']]
y = ml_df['fare_amount']
X.head()

Unnamed: 0,passenger_count,trip_distance
0,1.0,1.2
1,1.0,1.2
2,1.0,0.6
3,1.0,0.8
5,1.0,0.03


In [5]:
y.head()

Unnamed: 0,fare_amount
0,6.0
1,7.0
2,6.0
3,5.5
5,2.5


## Split data

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape

((4913155, 2), (1228289, 2))

## Train model

In [7]:
model = LinearRegression()
model.fit(X_train, y_train)

## Predict

In [8]:
y_pred = model.predict(X_test)
y_pred

array([11.05147839,  5.5618292 , 50.80315924, ..., 15.60319757,
       15.46113235, 44.42661805])

## 4. Interpreting Results

Our regression model gives us a quick way to predict taxi fares based on trip details. The Mean Squared Error and R² Score above tell us how well our model fits the data.

In [9]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
print(f'R^2 Score: {r2:.2f}')

Mean Squared Error: 16.02
R^2 Score: 0.87
