https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/cohorts/2024/05-monitoring/homework.md

The goal of this homework is to familiarize users with monitoring for ML batch services, using PostgreSQL database to store metrics and Grafana to visualize them.

In [1]:
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [20]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

In [4]:
def read_data(color, year, month):
    color = 'green'
    year = 2024
    month=3
    url = f'https://d37ci6vzurychx.cloudfront.net/trip-data/{color}_tripdata_{year:04d}-{month:02d}.parquet'
    df = pd.read_parquet(url)
    print(df.shape)
    return df

In [7]:
def prepare_data(df):
    # create target
    df["duration_min"] = df.lpep_dropoff_datetime - df.lpep_pickup_datetime
    df["duration_min"] = df["duration_min"].apply(lambda td : float(td.total_seconds())/60)
    
    # filter out outliers
    df = df[(df.duration_min >= 0) & (df.duration_min <= 60)]
    df = df[(df.passenger_count > 0) & (df.passenger_count <= 8)]
    
    return df

# Q1. Prepare the dataset

Start with `baseline_model_nyc_taxi_data.ipynb`. Download the March 2024 Green Taxi data. We will use this data to simulate a production usage of a taxi trip duration prediction service.

What is the shape of the downloaded data? How many rows are there?

In [5]:
df_mar = read_data('green', 2024, 3)
df_mar.head()

(57457, 20)


Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
0,2,2024-03-01 00:10:52,2024-03-01 00:26:12,N,1.0,129,226,1.0,1.72,12.8,1.0,0.5,3.06,0.0,,1.0,18.36,1.0,1.0,0.0
1,2,2024-03-01 00:22:21,2024-03-01 00:35:15,N,1.0,130,218,1.0,3.25,17.7,1.0,0.5,0.0,0.0,,1.0,20.2,2.0,1.0,0.0
2,2,2024-03-01 00:45:27,2024-03-01 01:04:32,N,1.0,255,107,2.0,4.58,23.3,1.0,0.5,3.5,0.0,,1.0,32.05,1.0,1.0,2.75
3,1,2024-03-01 00:02:00,2024-03-01 00:23:45,N,1.0,181,71,1.0,0.0,22.5,0.0,1.5,0.0,0.0,,1.0,24.0,1.0,1.0,0.0
4,2,2024-03-01 00:16:45,2024-03-01 00:23:25,N,1.0,95,135,1.0,1.15,8.6,1.0,0.5,1.0,0.0,,1.0,12.1,1.0,1.0,0.0


Answer: __57457__

# Q2. Metric

Let's expand the number of data quality metrics we’d like to monitor! Please add one metric of your choice and a quantile value for the "fare_amount" column (quantile=0.5).

Hint: explore evidently metric ColumnQuantileMetric (from evidently.metrics import ColumnQuantileMetric)

What metric did you choose?

Answer: __ColumnCorrelationsMetric__

# Q3. Monitoring

Let’s start monitoring. Run expanded monitoring for a new batch of data (March 2024).

What is the maximum value of metric quantile = 0.5 on the "fare_amount" column during March 2024 (calculated daily)?

In [8]:
df_mar = prepare_data(df_mar)
print(df_mar.shape)

(54135, 21)


In [9]:
# data labeling
target = "duration_min"
num_features = ["passenger_count", "trip_distance", "fare_amount", "total_amount"]
cat_features = ["PULocationID", "DOLocationID"]

In [15]:
split_idx = 30_000
train_data = df_mar[:30000].copy()
val_data = df_mar[30000:].copy()

In [16]:
model = LinearRegression()
model.fit(train_data[num_features + cat_features], train_data[target])

In [17]:
train_preds = model.predict(train_data[num_features + cat_features])
train_data['prediction'] = train_preds

In [18]:
val_preds = model.predict(val_data[num_features + cat_features])
val_data['prediction'] = val_preds

In [21]:
print(mean_absolute_error(train_data.duration_min, train_data.prediction))
print(mean_absolute_error(val_data.duration_min, val_data.prediction))

3.772473239359444
3.716814567929365


In [22]:
with open('models/lin_reg.bin', 'wb') as f_out:
    pickle.dump(model, f_out)

In [24]:
val_data.to_parquet('data/reference.parquet')