# MLOps Zoomcamp 2025 - Homework 1

This notebook contains the solution for Homework 1 of the MLOps Zoomcamp 2025 course. The goal is to train a model for predicting taxi ride durations using Yellow Taxi Trip Records data.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## Q1. Downloading the data

We'll use the Yellow Taxi Trip Records for January and February 2023.

In [None]:
# Download January data
!wget -P data/ https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet

# Download February data
!wget -P data/ https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet

## Q2. Computing duration

Let's compute the duration variable and calculate its standard deviation.

In [None]:
# Load January data
df_jan = pd.read_parquet('data/yellow_tripdata_2023-01.parquet')

# Compute duration
df_jan['duration'] = df_jan['tpep_dropoff_datetime'] - df_jan['tpep_pickup_datetime']
df_jan['duration'] = df_jan['duration'].dt.total_seconds() / 60

# Calculate standard deviation
std_duration = df_jan['duration'].std()
std_duration

## Q3. Dropping outliers

Let's remove records where duration is not between 1 and 60 minutes.

In [None]:
# Filter out outliers
df_filtered = df_jan[(df_jan['duration'] >= 1) & (df_jan['duration'] <= 60)]

# Calculate fraction of remaining records
fraction_remaining = len(df_filtered) / len(df_jan)
fraction_remaining

## Q4. One-hot encoding

Let's prepare the feature matrix using one-hot encoding for pickup and dropoff location IDs.

In [None]:
# Select relevant columns
categorical = ['PULocationID', 'DOLocationID']
df_filtered[categorical] = df_filtered[categorical].astype(str)

# Convert to list of dictionaries
train_dicts = df_filtered[categorical].to_dict(orient='records')

# Create and fit DictVectorizer
dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)

# Get feature matrix dimensions
X_train.shape

## Q5. Training a model

Let's train a linear regression model and calculate the RMSE on the training data.

In [None]:
# Prepare target variable
y_train = df_filtered['duration'].values

# Train the model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Make predictions
y_pred = lr.predict(X_train)

# Calculate RMSE
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred))
rmse_train

## Q6. Evaluating the model

Let's evaluate the model on the validation data (February 2023).

In [None]:
# Load February data
df_feb = pd.read_parquet('data/yellow_tripdata_2023-02.parquet')

# Compute duration
df_feb['duration'] = df_feb['tpep_dropoff_datetime'] - df_feb['tpep_pickup_datetime']
df_feb['duration'] = df_feb['duration'].dt.total_seconds() / 60

# Filter out outliers
df_feb = df_feb[(df_feb['duration'] >= 1) & (df_feb['duration'] <= 60)]

# Prepare validation data
val_dicts = df_feb[categorical].astype(str).to_dict(orient='records')
X_val = dv.transform(val_dicts)
y_val = df_feb['duration'].values

# Make predictions
y_pred_val = lr.predict(X_val)

# Calculate RMSE
rmse_val = np.sqrt(mean_squared_error(y_val, y_pred_val))
rmse_val