In [1]:
!pip freeze | grep scikit-learn

scikit-learn==1.6.1


In [2]:
!python -V


Python 3.12.3


In [3]:
import pickle
import pandas as pd
import numpy as np

In [4]:
with open('model.bin', 'rb') as f_in:
    dv, model = pickle.load(f_in)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [5]:
categorical = ['PULocationID', 'DOLocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

In [6]:
df = read_data('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet')


In [8]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)


## Q1. Notebook

We'll start with the same notebook we ended up with in homework 1.
We cleaned it a little bit and kept only the scoring part. You can find the initial notebook [here](homework/starter.ipynb).

Run this notebook for the March 2023 data.

What's the standard deviation of the predicted duration for this dataset?


In [9]:
std_dev = np.std(y_pred)
print(std_dev)

6.247488852238703


## Q2. Preparing the output

Like in the course videos, we want to prepare the dataframe with the output. 

First, let's create an artificial `ride_id` column:

In [11]:
df.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'Airport_fee', 'duration'],
      dtype='object')

In [12]:
df['prediction'] = y_pred
df['year'] = df.tpep_pickup_datetime.dt.year
df['month'] = df.tpep_pickup_datetime.dt.month
df['ride_id'] = df['year'].astype(str).str.zfill(4) + '/' + df['month'].astype(str).str.zfill(2) + '_' + df.index.astype('str')

In [13]:
output_file = 'result.parquet'

In [19]:
df_result = df[['ride_id', 'prediction']].copy()
df_result.to_parquet(
        output_file,
        engine='pyarrow',
        compression=None,
        index=False
    )
print(f"Predictions saved to {output_file}")
print(f"Mean predicted duration: {np.mean(y_pred)}") 

Predictions saved to result.parquet
Mean predicted duration: 14.203865642696083


In [22]:
import os

size_bytes = os.path.getsize(output_file)
size_MB = size_bytes / (1024 * 1024)
print(f'Size of the output file: {size_MB:.2f} MB')

Size of the output file: 65.46 MB


## Q3. Creating the scoring script

Now let's turn the notebook into a script. 

Which command you need to execute for that?

In [28]:
# Turn notebook to script
!jupyter nbconvert --to=script starter.ipynb

[NbConvertApp] Converting notebook starter.ipynb to script
[NbConvertApp] Writing 3057 bytes to starter.py


## Q4. Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version: it should be the same as in the starter
notebook.

After installing the libraries, pipenv creates two files: `Pipfile`
and `Pipfile.lock`. The `Pipfile.lock` file keeps the hashes of the
dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?

In [33]:
cat Pipfile.lock | grep -A 5 '"scikit-learn"' | head -n 3


        "scikit-learn": {
            "hashes": [
                "sha256:0650e730afb87402baa88afbf31c07b84c98272622aaba002559b614600ca691",


## Q5. Parametrize the script

Let's now make the script configurable via CLI. We'll create two 
parameters: year and month.

Run the script for April 2023. 

What's the mean predicted duration? 

In [15]:
!python script_cli.py 2023 4

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
Mean predicted duration for 04/2023: 14.29 minutes


## Q6. Docker container 

Finally, we'll package the script in the docker container. 
For that, you'll need to use a base image that we prepared. 

This is what the content of this image is:

```dockerfile
FROM python:3.10.13-slim

WORKDIR /app
COPY [ "model2.bin", "model.bin" ]
```

Note: you don't need to run it. We have already done it.

It is pushed to [`agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim`](https://hub.docker.com/layers/agrigorev/zoomcamp-model/mlops-2024-3.10.13-slim/images/sha256-f54535b73a8c3ef91967d5588de57d4e251b22addcbbfb6e71304a91c1c7027f?context=repo),
which you need to use as your base image.

That is, your Dockerfile should start with:

In [26]:
!docker build -t ride-duration-prediction .

[1A[1B[0G[?25l
[?25h[1A[0G[?25l[+] Building 0.0s (0/1)                                          docker:default
 => [internal] load build definition from dockerfile                       0.0s
[?25h[1A[1A[0G[?25l[+] Building 0.2s (1/1)                                          docker:default
[34m => [internal] load build definition from dockerfile                       0.2s
[0m[34m => => transferring dockerfile: 386B                                       0.0s
[0m[?25h[1A[1A[1A[0G[?25l[+] Building 0.4s (1/2)                                          docker:default
[34m => [internal] load build definition from dockerfile                       0.2s
[0m[34m => => transferring dockerfile: 386B                                       0.0s
[0m => [internal] load metadata for docker.io/agrigorev/zoomcamp-model:mlops  0.2s
[?25h[1A[1A[1A[1A[0G[?25l[+] Building 0.6s (1/2)                                          docker:default
[34m => [internal] load build definition

In [27]:
!docker run --rm ride-duration-prediction


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
Mean predicted duration for 05/2023: 0.19 minutes
