## Homework

In this homework, we’ll deploy the ride duration model in batch mode.
Like in homework 1, we’ll use the Yellow Taxi Trip Records dataset.

You’ll find the starter code in the [homework](homework) directory.



In [None]:
import pandas as pd
import numpy as np
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path

#sklearn
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

## Q1. Notebook

We’ll start with the same notebook we ended up with in homework 1. We
cleaned it a little bit and kept only the scoring part. You can find the
initial notebook [here](homework/starter.ipynb).

Run this notebook for the February 2022 data.

What’s the standard deviation of the predicted duration for this
dataset?

-   5.28
-   10.28
-   15.28
-   20.28



In [None]:
def fetch(dataset_url: str) -> pd.DataFrame:
    print(dataset_url)
    # df = pd.read_csv(dataset_url, compression='gzip')
    df = pd.read_parquet(dataset_url)
    return df

In [None]:
def save_to_file(path):
    path = Path(f"{path}/trips_data_{months[0]:02}-{years[0]}_{months[-1]:02}-{years[-1]}.parquet")
    if not path.parent.is_dir():
        path.parent.mkdir(parents=True)
    path = Path(path).as_posix()
    df.to_parquet(path, compression="gzip")
    print(f"data saved to file: {path}")

In [None]:
colors = ["green"]
months = [3]
years = [2023]

In [None]:
df = pd.DataFrame()
for color in colors:
    for month in months:
        for year in years:
            url = f"https://d37ci6vzurychx.cloudfront.net/trip-data/{color}_tripdata_{year}-{month:02}.parquet"

            df = pd.concat([df, fetch(url)], ignore_index=True)

save_to_file("data")


In [None]:
df = pd.read_parquet(f"data/trips_data_{months[0]:02}-{years[0]}_{months[-1]:02}-{years[-1]}.parquet")

#### Q2.

In [None]:
df.describe()



## Q2. Metric

Let's expand the number of data quality metrics we’d like to monitor! Please add one metric of your choice and a quantile value for the `"fare_amount"` column (`quantile=0.5`).

Hint: explore evidently metric `ColumnQuantileMetric` (from `evidently.metrics import ColumnQuantileMetric`)

What metric did you choose?
ColumnQuantileMetric(column_name='fare_amount', quantile=0.5),

*class ColumnQuantileMetric(column_name: str, quantile: float)


## Q3. Prefect flow

Let’s update prefect tasks by giving them nice meaningful names, specifying a number of delays and retries.

Hint: use `evidently_metrics_calculation.py` script as a starting point to implement your solution. Check the  prefect docs to check task parameters.

What is the correct way of doing that?

* `@task(retries_num=2, retry_seconds=5, task_name="calculate metrics")`
* `@task(retries_num=2, retry_delay_seconds=5, name="calculate metrics")`
* `@task(retries=2, retry_seconds=5, task_name="calculate metrics")`
* `@task(retries=2, retry_delay_seconds=5, name="calculate metrics")`     *


solution:
Prefect doucmentation [task](https://docs.prefect.io/2.10.18/concepts/tasks/)


## Q4. Monitoring

Let’s start monitoring. Run expanded monitoring for a new batch of data (March 2023).

What is the maximum value of metric `quantile = 0.5` on th `"fare_amount"` column during March 2023 (calculated daily)?

* 10
* 12.5
* 14   *
* 14.8

## Q5. Dashboard


Finally, let’s add panels with new added metrics to the dashboard. After we customize the  dashboard lets save a dashboard config, so that we can access it later. Hint: click on “Save dashboard” to access JSON configuration of the dashboard. This configuration should be saved locally.

Where to place a dashboard config file?

* `project_folder` (05-monitoring)
* `project_folder/config`  (05-monitoring/config)    *
* `project_folder/dashboards`  (05-monitoring/dashboards)
* `project_folder/data`  (05-monitoring/data)

Solution:
config yaml files are saved under config. The json file of dashboards and panels are saved under dashboards.

## Submit the results

* Submit your results here: https://forms.gle/PJaYeWsnWShAEBF79
* You can submit your solution multiple times. In this case, only the last submission will be used
* If your answer doesn't match options exactly, select the closest one


## Deadline

The deadline for submitting is 7 July (Friday), 23:00 CEST (Berlin time).

After that, the form will be closed.


In [None]:
with open('model.bin', 'rb') as f_in:
    dv, model = pickle.load(f_in)

In [None]:
df = pd.read_parquet(f"data/trips_data_{months[0]:02}-{years[0]}_{months[-1]:02}-{years[-1]}.parquet")

categorical = ['PULocationID', 'DOLocationID']

df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
df['duration'] = df.duration.dt.total_seconds() / 60

df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')

In [None]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)

print(y_pred)

print(y_pred.std())


## Q2. Preparing the output

Like in the course videos, we want to prepare the dataframe with the
output.

First, let’s create an artificial `ride_id` column:

``` python
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
```

Next, write the ride id and the predictions to a dataframe with results.

Save it as parquet:

``` python
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)
```

What’s the size of the output file?

-   28M
-   38M
-   48M
-   58M

**Note:** Make sure you use the snippet above for saving the file. It
should contain only these two columns. For this question, don’t change
the dtypes of the columns and use pyarrow, not fastparquet.



In [None]:
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

In [None]:
df_result = pd.DataFrame()
df_result['ride_id'] = df['ride_id']
df_result['predicted_duration'] = y_pred

df_result.to_parquet(
    "output.parquet",
    engine='pyarrow',
    compression=None,
    index=False
)

In [None]:
import os

file_stats = os.stat("output.parquet")

print(file_stats.st_size)

## Q3. Creating the scoring script

Now let’s turn the notebook into a script.

Which command you need to execute for that?



jupyter nbconvert --to script starter.ipynb

## Q4. Virtual environment

Now let’s put everything into a virtual environment. We’ll use pipenv
for that.

Install all the required libraries. Pay attention to the Scikit-Learn
version: it should be `scikit-learn==1.2.2`.

After installing the libraries, pipenv creates two files: `Pipfile` and
`Pipfile.lock`. The `Pipfile.lock` file keeps the hashes of the
dependencies we use for the virtual env.

What’s the first hash for the Scikit-Learn dependency?



In [None]:
pip install pipenv
pipenv --version
pipenv --python=3.9
pipenv install scikit-learn==1.2.2 pandas

## Q5. Parametrize the script

Let’s now make the script configurable via CLI. We’ll create two
parameters: year and month.

Run the script for March 2022.

What’s the mean predicted duration?

-   7.76
-   12.76
-   17.76
-   22.76

Hint: just add a print statement to your script.



## Q6. Docker container

Finally, we’ll package the script in the docker container. For that,
you’ll need to use a base image that we prepared.

This is how it looks like:

    FROM python:3.10.0-slim

    WORKDIR /app
    COPY [ "model2.bin", "model.bin" ]

(see [`homework/Dockerfile`](homework/Dockerfile))

We pushed it to
[`svizor/zoomcamp-model:mlops-3.10.0-slim`](https://hub.docker.com/layers/svizor/zoomcamp-model/mlops-3.10.0-slim/images/sha256-595bf690875f5b9075550b61c609be10f05e6915609ef4ea4ce9797116c99eff?context=repo),
which you should use as your base image.

That is, this is how your Dockerfile should start:

``` docker
FROM svizor/zoomcamp-model:mlops-3.10.0-slim

# do stuff here
```

This image already has a pickle file with a dictionary vectorizer and a
model. You will need to use them.

Important: don’t copy the model to the docker image. You will need to
use the pickle file already in the image.

Now run the script with docker. What’s the mean predicted duration for
April 2022?

-   7.92
-   12.83
-   17.92
-   22.83

## Bonus: upload the result to the cloud (Not graded)

Just printing the mean duration inside the docker image doesn’t seem
very practical. Typically, after creating the output file, we upload it
to the cloud storage.

Modify your code to upload the parquet file to S3/GCS/etc.

## Publishing the image to dockerhub

This is how we published the image to Docker hub:

``` bash
docker build -t mlops-zoomcamp-model:v1 .
docker tag mlops-zoomcamp-model:v1 svizor/zoomcamp-model:mlops-3.10.0-slim
docker push svizor/zoomcamp-model:mlops-3.10.0-slim
```