In [1]:
import pickle
import pandas as pd
import sklearn

In [2]:
sklearn.__version__

'1.0.2'

In [3]:
!python -V

Python 3.9.12


In [7]:
!pip install --upgrade scikit-learn==1.5.0

Collecting scikit-learn==1.5.0
  Downloading scikit_learn-1.5.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.4 MB)
[K     |████████████████████████████████| 13.4 MB 985 kB/s eta 0:00:01
[?25hCollecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Collecting joblib>=1.2.0
  Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
[K     |████████████████████████████████| 301 kB 76.3 MB/s eta 0:00:01
Installing collected packages: threadpoolctl, joblib, scikit-learn
  Attempting uninstall: threadpoolctl
    Found existing installation: threadpoolctl 2.2.0
    Uninstalling threadpoolctl-2.2.0:
      Successfully uninstalled threadpoolctl-2.2.0
  Attempting uninstall: joblib
    Found existing installation: joblib 1.1.0
    Uninstalling joblib-1.1.0:
      Successfully uninstalled joblib-1.1.0
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninst

In [8]:
!pip show scikit-learn

Name: scikit-learn
Version: 1.5.0
Summary: A set of python modules for machine learning and data mining
Home-page: https://scikit-learn.org
Author: 
Author-email: 
License: new BSD
Location: /home/codespace/anaconda3/lib/python3.9/site-packages
Requires: numpy, threadpoolctl, scipy, joblib
Required-by: scikit-learn-intelex


In [9]:
with open('model.bin', 'rb') as f_in:
    dv, model = pickle.load(f_in)

In [10]:
categorical = ['PULocationID', 'DOLocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

In [11]:
df = read_data('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet')

In [12]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)

In [13]:
y_pred

array([16.24590642, 26.1347962 , 11.88426424, ..., 11.59533603,
       13.11317847, 12.89999218])

In [14]:
y_pred.std()

6.247488852238703

Q1. Notebook
We'll start with the same notebook we ended up with in homework 1. We cleaned it a little bit and kept only the scoring part. You can find the initial notebook here.

Run this notebook for the March 2023 data.

What's the standard deviation of the predicted duration for this dataset?

1.24
6.24
12.28
18.28

![Screenshot%202024-06-28%20115405.png](attachment:Screenshot%202024-06-28%20115405.png)

Answer: 6.247

In [16]:
year = 2023
month = 3

input_file = f'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{year:04d}-{month:02d}.parquet'
output_file = f'output/yellow_tripdata_{year:04d}-{month:02d}.parquet'

In [17]:
!mkdir output

In [18]:
df = read_data(input_file)
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

In [19]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,duration,ride_id
0,2,2023-03-01 00:06:43,2023-03-01 00:16:43,1.0,0.0,1.0,N,238,42,2,...,1.0,0.5,0.0,0.0,1.0,11.1,0.0,0.0,10.0,2023/03_0
1,2,2023-03-01 00:08:25,2023-03-01 00:39:30,2.0,12.4,1.0,N,138,231,1,...,6.0,0.5,12.54,0.0,1.0,76.49,2.5,1.25,31.083333,2023/03_1
2,1,2023-03-01 00:15:04,2023-03-01 00:29:26,0.0,3.3,1.0,N,140,186,1,...,3.5,0.5,4.65,0.0,1.0,28.05,2.5,0.0,14.366667,2023/03_2
3,1,2023-03-01 00:49:37,2023-03-01 01:01:05,1.0,2.9,1.0,N,140,43,1,...,3.5,0.5,4.1,0.0,1.0,24.7,2.5,0.0,11.466667,2023/03_3
4,2,2023-03-01 00:08:04,2023-03-01 00:11:06,1.0,1.23,1.0,N,79,137,1,...,1.0,0.5,2.44,0.0,1.0,14.64,2.5,0.0,3.033333,2023/03_4


In [20]:
df_result = pd.DataFrame()
df_result['ride_id'] = df['ride_id']
df_result['predicted_duration'] = y_pred

In [21]:
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)

In [22]:
!ls -lh output

total 66M
-rw-rw-rw- 1 codespace codespace 66M Jun 28 06:30 yellow_tripdata_2023-03.parquet


Q2. Preparing the output
Like in the course videos, we want to prepare the dataframe with the output.

First, let's create an artificial ride_id column:

df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
Next, write the ride id and the predictions to a dataframe with results.

Save it as parquet:

df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)
What's the size of the output file?

36M
46M
56M
66M
Note: Make sure you use the snippet above for saving the file. It should contain only these two columns. For this question, don't change the dtypes of the columns and use pyarrow, not fastparquet.

![Screenshot%202024-06-28%20120107.png](attachment:Screenshot%202024-06-28%20120107.png)

Answer: 66M

Q3. Creating the scoring script
Now let's turn the notebook into a script.

Which command you need to execute for that?

In [25]:
!jupyter nbconvert --to script 04-homework.ipynb

[NbConvertApp] Converting notebook 04-homework.ipynb to script
[NbConvertApp] Writing 3413 bytes to 04-homework.py


![Screenshot%202024-06-28%20120808.png](attachment:Screenshot%202024-06-28%20120808.png)

Answer: jupyter nbconvert --to script 04-homework.ipynb

Q4. Virtual environment
Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version: it should be the same as in the starter notebook.

After installing the libraries, pipenv creates two files: Pipfile and Pipfile.lock. The Pipfile.lock file keeps the hashes of the dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?

![Screenshot%202024-07-01%20102320.png](attachment:Screenshot%202024-07-01%20102320.png)

Answer: sha256:057b991ac64b3e75c9c04b5f9395eaf19a6179244c089afdebaad98264bff37c

Q5. Parametrize the script
Let's now make the script configurable via CLI. We'll create two parameters: year and month.

Run the script for April 2023.

What's the mean predicted duration?

7.29
14.29
21.29
28.29
Hint: just add a print statement to your script.

![Screenshot%202024-07-01%20103824.png](attachment:Screenshot%202024-07-01%20103824.png)

Answer: 14.29

Q6. Docker container
Finally, we'll package the script in the docker container. For that, you'll need to use a base image that we prepared.

This is what the content of this image is:

FROM python:3.10.13-slim

WORKDIR /app
COPY [ "model2.bin", "model.bin" ]
Note: you don't need to run it. We have already done it.

It is pushed to agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim, which you need to use as your base image.

That is, your Dockerfile should start with:

FROM agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

# do stuff here
This image already has a pickle file with a dictionary vectorizer and a model. You will need to use them.

Important: don't copy the model to the docker image. You will need to use the pickle file already in the image.

Now run the script with docker. What's the mean predicted duration for May 2023?

0.19
7.24
14.24
21.19

![Screenshot%202024-07-01%20110520.png](attachment:Screenshot%202024-07-01%20110520.png)

Answer: 0.19