# MLOps Zoomcamp 2025 - Homework 1

This notebook contains the solution for Homework 1 of the MLOps Zoomcamp 2025 course. The goal is to train a model for predicting taxi ride durations using Yellow Taxi Trip Records data.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## Q1. Downloading the data

We'll use the Yellow Taxi Trip Records for January and February 2023.

In [2]:
import os
import pandas as pd

# Define base URL and local download directory
base_url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/'
download_dir = './data'

# List of files to download specified by year-month
# Add or remove entries here based on which months you need
files_to_download = ['2023-01', '2023-02']

# --- Download Files ---

# Create the download directory if it doesn't exist
os.makedirs(download_dir, exist_ok=True)
print(f"Ensured directory '{download_dir}' exists.")

# Loop through the list and download each file if it doesn't exist
downloaded_local_paths = {}

for year_month in files_to_download:
    # Construct the filename and full local path
    file_name = f'yellow_tripdata_{year_month}.parquet'
    local_file_path = os.path.join(download_dir, file_name)
    file_url = f'{base_url}{file_name}'

    # Check if the file already exists locally
    if os.path.exists(local_file_path):
        print(f"File already exists: {local_file_path}. Skipping download.")
    else:
        # If file does not exist, download it
        print(f"Downloading {file_name} from {file_url} to {local_file_path}...")
        # Use !wget command to download with output path specified
        !wget {file_url} -O {local_file_path}
        print(f"Downloaded {file_name}.")

    # Store the local path for later loading
    print(f"Downloaded files: {downloaded_local_paths}")

Ensured directory './data' exists.
Downloading yellow_tripdata_2023-01.parquet from https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet to ./data/yellow_tripdata_2023-01.parquet...
--2025-05-09 23:28:42--  https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 3.160.226.85, 3.160.226.111, 3.160.226.161, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|3.160.226.85|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47673370 (45M) [application/x-www-form-urlencoded]
Saving to: ‘./data/yellow_tripdata_2023-01.parquet’


2025-05-09 23:28:44 (20.9 MB/s) - ‘./data/yellow_tripdata_2023-01.parquet’ saved [47673370/47673370]

Downloaded yellow_tripdata_2023-01.parquet.
Downloaded files: {}
Downloading yellow_tripdata_2023-02.parquet from https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet

In [None]:
# Check the file size and metadata

!ls -lh data/yellow_tripdata_2023-01.parquet

-rw-r--r-- 1 dan dan 46M Mar 20  2023 data/yellow_tripdata_2023-01.parquet


In [5]:
# Load January data
df_jan = pd.read_parquet('data/yellow_tripdata_2023-01.parquet')
# Get the number of columns in the January DataFrame
num_columns_jan = df_jan.shape[1]

print(f"Number of columns in January 2023 data: {num_columns_jan}")

Number of columns in January 2023 data: 19


## Q2. Computing duration

Let's compute the duration variable and calculate its standard deviation.

In [10]:
import pandas as pd
# Load January data
df_jan = pd.read_parquet('data/yellow_tripdata_2023-01.parquet')

# Compute duration
df_jan['duration'] = df_jan['tpep_dropoff_datetime'] - df_jan['tpep_pickup_datetime']
df_jan['duration'] = df_jan['duration'].dt.total_seconds() / 60

# Calculate standard deviation
std_duration = df_jan['duration'].std()
std_duration

np.float64(42.59435124195458)

## Q3. Dropping outliers

Let's remove records where duration is not between 1 and 60 minutes.

In [11]:
# Filter out outliers
df_filtered = df_jan[(df_jan['duration'] >= 1) & (df_jan['duration'] <= 60)]

# Calculate fraction of remaining records
fraction_remaining = len(df_filtered) / len(df_jan)
fraction_remaining

0.9812202822125979

## Q4. One-hot encoding

Let's prepare the feature matrix using one-hot encoding for pickup and dropoff location IDs.

In [12]:
# Select relevant columns
categorical = ['PULocationID', 'DOLocationID']
df_filtered[categorical] = df_filtered[categorical].astype(str)

# Convert to list of dictionaries
train_dicts = df_filtered[categorical].to_dict(orient='records')

# Create and fit DictVectorizer
dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)

# Get feature matrix dimensions
X_train.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered[categorical] = df_filtered[categorical].astype(str)


(3009173, 515)

## Q5. Training a model

Let's train a linear regression model and calculate the RMSE on the training data.

In [14]:
# Prepare target variable
y_train = df_filtered['duration'].values

# Train the model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Make predictions
y_pred = lr.predict(X_train)

# Calculate RMSE
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred))
rmse_train

np.float64(7.649261936284003)

## Q6. Evaluating the model

Let's evaluate the model on the validation data (February 2023).

In [15]:
# Load February data
df_feb = pd.read_parquet('data/yellow_tripdata_2023-02.parquet')

# Compute duration
df_feb['duration'] = df_feb['tpep_dropoff_datetime'] - df_feb['tpep_pickup_datetime']
df_feb['duration'] = df_feb['duration'].dt.total_seconds() / 60

# Filter out outliers
df_feb = df_feb[(df_feb['duration'] >= 1) & (df_feb['duration'] <= 60)]

# Prepare validation data
val_dicts = df_feb[categorical].astype(str).to_dict(orient='records')
X_val = dv.transform(val_dicts)
y_val = df_feb['duration'].values

# Make predictions
y_pred_val = lr.predict(X_val)

# Calculate RMSE
rmse_val = np.sqrt(mean_squared_error(y_val, y_pred_val))
rmse_val

np.float64(7.811818654341152)