# Homework

In [2]:
from pathlib import Path
import pickle
import pandas as pd
import numpy as np

In [3]:
path_data = Path("../../data")
assert path_data.exists()

## Q1. Notebook

We'll start with the same notebook we ended up with in homework 1. We cleaned it a little bit and kept only the scoring part. You can find the initial notebook [here](https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/cohorts/2025/04-deployment/homework/starter.ipynb).

Run this notebook for the March 2023 data.

What's the standard deviation of the predicted duration for this dataset?

- 1.24
- 6.24
- 12.28
- 18.28

In [4]:
!uv pip freeze | grep scikit-learn

[2mUsing Python 3.10.13 environment at: /home/calmscout/Projects/PythonProjects/mlops-zoomcamp-2025/.venv[0m
[1mscikit-learn[0m==1.5.0


In [5]:
!python -V

Python 3.10.13


In [6]:
with open('model.bin', 'rb') as f_in:
    dv, model = pickle.load(f_in)

In [7]:
categorical = ['PULocationID', 'DOLocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

In [9]:
df = read_data('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet')

In [10]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)

In [11]:
np.std(y_pred)

np.float64(6.247488852238703)

❓: `6.24`

## Q2. Preparing the output

Like in the course videos, we want to prepare the dataframe with the output.

First, let's create an artificial ride_id column:

In [12]:
year = 2023
month = 3
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

Next, write the ride id and the predictions to a dataframe with results.

In [13]:
df_result = pd.DataFrame()
df_result['ride_id'] = df['ride_id']
df_result['predicted_duration'] = y_pred

Save it as parquet:

In [14]:
df_result.to_parquet(
    'predictions.parquet',
    engine='pyarrow',
    compression=None,
    index=False
)

What's the size of the output file?

- 36M
- 46M
- 56M
- 66M

**Note**: Make sure you use the snippet above for saving the file. It should contain only these two columns. For this question, don't change the dtypes of the columns and use `pyarrow`, not `fastparquet`.

In [15]:
!du -h predictions.parquet

66M	predictions.parquet


❓: `66M`

## Q3. Creating the scoring script

Now let's turn the notebook into a script.

Which command you need to execute for that?

```bash
jupyter nbconvert --to script your_notebook.ipynb
```

In [16]:
!jupyter nbconvert --to script 04.ipynb --TemplateExporter.exclude_markdown=True

[NbConvertApp] Converting notebook 04.ipynb to script
[NbConvertApp] Writing 1621 bytes to 04.py


## Q4. Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version: it should be the same as in the starter notebook.

After installing the libraries, pipenv creates two files: Pipfile and Pipfile.lock. The Pipfile.lock file keeps the hashes of the dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?

```bash
grep -A 5 '"scikit-learn"' Pipfile.lock
        "scikit-learn": {
            "hashes": [
                "sha256:057b991ac64b3e75c9c04b5f9395eaf19a6179244c089afdebaad98264bff37c",
                "sha256:118a8d229a41158c9f90093e46b3737120a165181a1b58c03461447aa4657415",
                "sha256:12e40ac48555e6b551f0a0a5743cc94cc5a765c9513fe708e01f0aa001da2801",
                "sha256:174beb56e3e881c90424e21f576fa69c4ffcf5174632a79ab4461c4c960315ac",
```

❓: `sha256:057b991ac64b3e75c9c04b5f9395eaf19a6179244c089afdebaad98264bff37c`

## Q5. Parametrize the script

Let's now make the script configurable via CLI. We'll create two parameters: year and month.

Run the script for April 2023.

What's the mean predicted duration?

- 7.29
- 14.29
- 21.29
- 28.29

Hint: just add a print statement to your script.

```bash
❯ python 04.py --year 2023 --month 4
Mean predicted duration: 14.292282936862449
```

❓: `14.29`

## Q6. Docker container

Finally, we'll package the script in the docker container. For that, you'll need to use a base image that we prepared.

This is what the content of this image is:

```dockerfile
FROM python:3.10.13-slim

WORKDIR /app
COPY [ "model2.bin", "model.bin" ]
```

Note: you don't need to run it. We have already done it.

It is pushed to [agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim](https://hub.docker.com/layers/agrigorev/zoomcamp-model/mlops-2024-3.10.13-slim/images/sha256-f54535b73a8c3ef91967d5588de57d4e251b22addcbbfb6e71304a91c1c7027f?context=repo), which you need to use as your base image.

That is, your Dockerfile should start with:

```dockerfile
FROM agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

# do stuff here
```

This image already has a pickle file with a dictionary vectorizer and a model. You will need to use them.

Important: don't copy the model to the docker image. You will need to use the pickle file already in the image.

Now run the script with docker. What's the mean predicted duration for May 2023?

- 0.19
- 7.24
- 14.24
- 21.19

### 🧱 Step 1: Prepare Your Files

Your directory should have:
```
.
├── Dockerfile
├── predict.py         ← your CLI Python script
├── requirements.txt   ← for Python deps (like pandas, scikit-learn)
```

### 📜 Step 2: requirements.txt

```
pandas
scikit-learn==1.5.0
pyarrow
```


### 🐳 Step 3: Dockerfile

```dockerfile
FROM agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

WORKDIR /app

# Copy your script and dependencies
COPY predict.py .
COPY requirements.txt .

# Install the Python packages
RUN pip install --no-cache-dir -r requirements.txt

# Run prediction script by default (can be overridden at runtime)
ENTRYPOINT ["python", "predict.py"]
```

### 🏗️ Step 4: Build the Docker Image

```bash
docker build -t duration-predictor .
```

### 🚀 Step 5: Run the Container

We're running the script inside Docker for May 2023:

```bash
❯ docker run duration-predictor --year 2023 --month 5

Mean predicted duration: 0.19174419265916945
```

❓: `0.19`