
## Homework

The goal of this homework is to familiarize users with monitoring for ML batch services, using PostgreSQL database to store metrics and Grafana to visualize them.




In [1]:
import requests
import datetime
import pandas as pd

from evidently import DataDefinition
from evidently import Dataset
from evidently import Report
from evidently.metrics import ValueDrift, DriftedColumnsCount, MissingValueCount, QuantileValue

from joblib import load, dump
from tqdm import tqdm

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error

In [2]:
! mkdir data

mkdir: cannot create directory ‘data’: File exists



## Q1. Prepare the dataset

Start with `baseline_model_nyc_taxi_data.ipynb`. Download the March 2024 Green Taxi data. We will use this data to simulate a production usage of a taxi trip duration prediction service.

What is the shape of the downloaded data? How many rows are there?

* 72044
* 78537 
* **57457**
* 54396

In [3]:
files = [('green_tripdata_2024-03.parquet', './data')]

print("Download files:")
for file, path in files:
    url=f"https://d37ci6vzurychx.cloudfront.net/trip-data/{file}"
    resp=requests.get(url, stream=True)
    save_path=f"{path}/{file}"
    with open(save_path, "wb") as handle:
        for data in tqdm(resp.iter_content(),
                        desc=f"{file}",
                        postfix=f"save to {save_path}",
                        total=int(resp.headers["Content-Length"])):
            handle.write(data)

Download files:


green_tripdata_2024-03.parquet: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1372372/1372372 [00:08<00:00, 167684.60it/s, save to ./data/green_tripdata_2024-03.parquet]


In [4]:
march_data = pd.read_parquet('data/green_tripdata_2024-03.parquet')

In [5]:
march_data.shape

(57457, 20)




## Q2. Metric

Let's expand the number of data quality metrics we’d like to monitor! Please add one metric of your choice and a quantile value for the `"fare_amount"` column (`quantile=0.5`).

Hint: explore evidently metric `ColumnQuantileMetric` (from `evidently.metrics import ColumnQuantileMetric`) 

What metric did you choose?

**I chose the *QuantileValue* metric on the column fare_amount with quantile=0.5.**


In [6]:
report = Report(metrics=[
    ValueDrift(column='prediction'),
    DriftedColumnsCount(),
    MissingValueCount(column='prediction'),
    QuantileValue(column="fare_amount", quantile=0.5)
]
)


## Q3. Monitoring

Let’s start monitoring. Run expanded monitoring for a new batch of data (March 2024). 

What is the maximum value of metric `quantile = 0.5` on the `"fare_amount"` column during March 2024 (calculated daily)?

* 10
* 12.5
* **14.2**
* 14.8



In [7]:
# create target
march_data["duration_min"] = march_data.lpep_dropoff_datetime - march_data.lpep_pickup_datetime
march_data.duration_min = march_data.duration_min.apply(lambda td : float(td.total_seconds())/60)

In [8]:
# filter out outliers
march_data = march_data[(march_data.duration_min >= 0) & (march_data.duration_min <= 60)]
march_data = march_data[(march_data.passenger_count > 0) & (march_data.passenger_count <= 8)]

In [9]:
# data labeling
num_features = ["passenger_count", "trip_distance", "fare_amount", "total_amount"]
cat_features = ["PULocationID", "DOLocationID"]

In [10]:
data_definition = DataDefinition(
    numerical_columns=num_features + ['prediction'],
    categorical_columns=cat_features
)

In [11]:
q50_results = []

for day in range(1, 32):
    day_start = datetime.datetime(2024, 3, day)
    day_end = day_start + datetime.timedelta(days=1)

    day_data = march_data[
        (march_data.lpep_pickup_datetime >= day_start) &
        (march_data.lpep_pickup_datetime < day_end)
    ]

    if len(day_data) == 0:
        continue

    dataset = Dataset.from_pandas(day_data, data_definition=data_definition)

    report = Report(metrics=[
        QuantileValue(column="fare_amount", quantile=0.5)
    ])

    result = report.run(current_data=dataset).dict()
    q50 = result['metrics'][0]['value']
    q50_results.append(q50)

    print(f"{day_start.date()} -> Q50 = {q50}")

2024-03-01 -> Q50 = 13.5
2024-03-02 -> Q50 = 12.8
2024-03-03 -> Q50 = 14.2
2024-03-04 -> Q50 = 12.8
2024-03-05 -> Q50 = 12.8
2024-03-06 -> Q50 = 12.8
2024-03-07 -> Q50 = 13.5
2024-03-08 -> Q50 = 12.8
2024-03-09 -> Q50 = 13.5
2024-03-10 -> Q50 = 14.2
2024-03-11 -> Q50 = 12.8
2024-03-12 -> Q50 = 12.8
2024-03-13 -> Q50 = 13.5
2024-03-14 -> Q50 = 14.2
2024-03-15 -> Q50 = 13.5
2024-03-16 -> Q50 = 13.5
2024-03-17 -> Q50 = 13.5
2024-03-18 -> Q50 = 12.8
2024-03-19 -> Q50 = 13.5
2024-03-20 -> Q50 = 12.8
2024-03-21 -> Q50 = 13.5
2024-03-22 -> Q50 = 12.8
2024-03-23 -> Q50 = 12.8
2024-03-24 -> Q50 = 14.015
2024-03-25 -> Q50 = 13.5
2024-03-26 -> Q50 = 13.5
2024-03-27 -> Q50 = 12.8
2024-03-28 -> Q50 = 13.25
2024-03-29 -> Q50 = 12.8
2024-03-30 -> Q50 = 14.0
2024-03-31 -> Q50 = 13.5


In [12]:
max_q50 = max(q50_results)
print(f"\n🔎 Max quantile 0.5 of fare_amount in March 2024: {max_q50}")


🔎 Max quantile 0.5 of fare_amount in March 2024: 14.2



## Q4. Dashboard


Finally, let’s add panels with new added metrics to the dashboard. After we customize the  dashboard let's save a dashboard config, so that we can access it later. Hint: click on “Save dashboard” to access JSON configuration of the dashboard. This configuration should be saved locally.

Where to place a dashboard config file?

* `project_folder` (05-monitoring)
* **`project_folder/config`  (05-monitoring/config)**
* `project_folder/dashboards`  (05-monitoring/dashboards)
* `project_folder/data`  (05-monitoring/data)



In [13]:
!ls config

grafana_dashboards.yaml  grafana_datasources.yaml




## Submit the results

* Submit your answers here: https://courses.datatalks.club/mlops-zoomcamp-2025/homework/hw5