In [1]:
!pip freeze | grep scikit-learn

scikit-learn==1.5.0


In [2]:
!python -V

Python 3.10.18


In [3]:
import sklearn 
sklearn.__version__

'1.5.0'

In [4]:
#import pickle
#import pandas as pd

In [5]:
#with open('model.bin', 'rb') as f_in:
#    dv, model = pickle.load(f_in)

In [6]:
#categorical = ['PULocationID', 'DOLocationID']

#def read_data(filename):
#    df = pd.read_parquet(filename)
    
#    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
#    df['duration'] = df.duration.dt.total_seconds() / 60

#    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

#    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
#    return df

In [7]:
#df = read_data('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_????-??.parquet')

In [8]:
#dicts = df[categorical].to_dict(orient='records')
#X_val = dv.transform(dicts)
#y_pred = model.predict(X_val)

## Homework

In this homework, we'll deploy the ride duration model in batch mode. Like in homework 1, we'll use the Yellow Taxi Trip Records dataset. 

You'll find the starter code in the [homework](homework) directory.

Solution: [homework_solution/](homework_solution/)


In [9]:
import pickle
import pandas as pd
import os

In [10]:
with open('model.bin', 'rb') as f_in:
    dv, model = pickle.load(f_in)

In [11]:
categorical = ['PULocationID', 'DOLocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

In [12]:
DATASET_DIR = "Data"
OUTPUT_DIR = "output"
DATA_URL = "https://d37ci6vzurychx.cloudfront.net"
taxi_type = "yellow"
year = 2023
month = 3
input_file = f'{DATA_URL}/trip-data/{taxi_type}_tripdata_{year:04d}-{month:02d}.parquet'
print(f"input_file: {input_file}")

input_file: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet


In [13]:
for dir in [DATASET_DIR, OUTPUT_DIR]:
    if os.path.isdir(dir): 
        print(f"The {dir} directory exists")
        continue
    # if the directory is  
    # not present then create it. 
    os.makedirs(dir, exist_ok=True)
    print(f"The {dir} directory is created")

The Data directory exists
The output directory exists


In [14]:
df = read_data(input_file)

In [15]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)

## Q1. Notebook

We'll start with the same notebook we ended up with in homework 1.
We cleaned it a little bit and kept only the scoring part. You can find the initial notebook [here](homework/starter.ipynb).

Run this notebook for the March 2023 data.

What's the standard deviation of the predicted duration for this dataset?

* 1.24
* 6.24
* 12.28
* 18.28


In [16]:
# Calculate Std.
y_pred.std()

np.float64(6.247488852238703)

What's the standard deviation of the predicted duration for this dataset?

* 1.24
* **6.24** <span style="font-size: 1.2em;">⬅️</span>
* 12.28
* 18.28

## Q2. Preparing the output

Like in the course videos, we want to prepare the dataframe with the output. 

First, let's create an artificial `ride_id` column:

```python
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
```

Next, write the ride id and the predictions to a dataframe with results. 

Save it as parquet:

```python
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)
```

What's the size of the output file?

* 36M
* 46M
* 56M
* 66M

__Note:__ Make sure you use the snippet above for saving the file. It should contain only these two columns. For this question, don't change the
dtypes of the columns and use `pyarrow`, not `fastparquet`. 

In [17]:
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

In [18]:
df_result = df[['ride_id']].copy()
df_result['prediction'] = y_pred

In [19]:
output_file = f'{OUTPUT_DIR}/pred_yellow_tripdata_{year:04}-{month:02}.parquet'
print(f'output_file {output_file}...')

output_file output/pred_yellow_tripdata_2023-03.parquet...


In [20]:
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)

In [21]:
!ls -lh {OUTPUT_DIR}

total 196M
-rw-rw-rw- 1 codespace codespace 66M Jun 15 17:03 pred_yellow_tripdata_2023-03.parquet
-rw-rw-rw- 1 codespace codespace 64M Jun 15 15:49 pred_yellow_tripdata_2023-04.parquet
-rw-rw-rw- 1 codespace codespace 68M Jun 15 15:50 pred_yellow_tripdata_2023-05.parquet


In [22]:
# Get size of file in bytes
file_stats = os.stat(output_file)
file_size = file_stats.st_size  / (1024 * 1024)
# Imprimir size in MB
print(f"File Size in Bytes is {file_size:.2f} MB")

File Size in Bytes is 65.46 MB


What's the size of the output file?

* 36M
* 46M
* 56M
* **66M** <span style="font-size: 1.2em;">⬅️</span>


## Q3. Creating the scoring script

Now let's turn the notebook into a script. 

Which command you need to execute for that?

In [23]:
!jupyter nbconvert --to python starter.ipynb

[NbConvertApp] Converting notebook starter.ipynb to python
[NbConvertApp] Writing 13024 bytes to starter.py


## Q4. Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version: it should be the same as in the starter
notebook.

After installing the libraries, pipenv creates two files: `Pipfile`
and `Pipfile.lock`. The `Pipfile.lock` file keeps the hashes of the
dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?

In [24]:
import json

# Ruta al archivo Pipenv.lock
lock_file_path = 'Pipfile.lock'

# Abrir y leer el archivo Pipenv.lock
with open(lock_file_path, 'r') as f:
    pipenv_lock = json.load(f)

# Buscar Scikit-Learn en las dependencias
scikit_learn_hash = None
for package in pipenv_lock['default']:
    if 'scikit-learn' in package:
        scikit_learn_hash = pipenv_lock['default'][package].get('hashes', None)

# Mostrar el hash
if scikit_learn_hash:
    print(f"Hash de Scikit-Learn: {scikit_learn_hash[0]}")
else:
    print("Scikit-Learn no encontrado en Pipenv.lock.")


Hash de Scikit-Learn: sha256:057b991ac64b3e75c9c04b5f9395eaf19a6179244c089afdebaad98264bff37c


**057b991ac64b3e75c9c04b5f9395eaf19a6179244c089afdebaad98264bff37c**

## Q5. Parametrize the script

Let's now make the script configurable via CLI. We'll create two 
parameters: year and month.

Run the script for April 2023. 

What's the mean predicted duration? 

* 7.29
* 14.29
* 21.29
* 28.29

Hint: just add a print statement to your script.

In [25]:
%%writefile starter.py
import pickle
import pandas as pd
import argparse
import os

DATASET_DIR = "Data"
OUTPUT_DIR = "output"
DATA_URL = 'https://d37ci6vzurychx.cloudfront.net/trip-data'
categorical = ['PULocationID', 'DOLocationID']
DEFAULT_MODEL_FILE = 'model.bin'

for dir in [DATASET_DIR, OUTPUT_DIR]:
    if os.path.isdir(dir): 
        print(f"The {dir} directory exists")
        continue
    # if the directory is  
    # not present then create it. 
    os.makedirs(dir, exist_ok=True)
    print(f"The {dir} directory is created")


def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df


def save_results(df, y_pred, output_file):
    df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
    df_result = df[['ride_id']].copy()
    df_result['prediction'] = y_pred
    
    df_result.to_parquet(
        output_file,
        engine='pyarrow',
        compression=None,
        index=False
    )


def load_model(model_file: str):
    print(f'Loading model {model_file}...')
    with open(model_file, 'rb') as f_in:
        dv, model = pickle.load(f_in)
    return dv, model



def apply_model(model_file, input_file:str, output_file:str):
    print(f'reading data {input_file}...')
    df = read_data(input_file)
    dicts = df[categorical].to_dict(orient='records')
    
    print('predicting...')
    dv, model = load_model(model_file)
    X_val = dv.transform(dicts)
    y_pred = model.predict(X_val)
    print(f'the mean of prediction is {y_pred.mean()}')
    
    print(f'save results {output_file}...')
    save_results(df, y_pred, output_file)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Process ride duration prediction.')
    parser.add_argument('taxi_type', type=str, help='enter taxi type', default='yellow')
    parser.add_argument('year', type=int, help='enter year from 2023')
    parser.add_argument('month', type=int,  help='enter month from 1 to 12')
    args = parser.parse_args()

    taxi_type = args.taxi_type
    year = args.year
    month = args.month

    path = "./"
    dirs = os.listdir( path )
    for file in dirs:
        print(file)

    model_file = DEFAULT_MODEL_FILE
    input_file = f'{DATA_URL}/{taxi_type}_tripdata_{year:04d}-{month:02d}.parquet'
    output_file = f'{OUTPUT_DIR}/pred_yellow_tripdata_{year:04}-{month:02}.parquet'
    
    apply_model(model_file, input_file, output_file)

Overwriting starter.py


In [26]:
taxi_type = 'yellow'
year = 2023
month = 4

In [27]:
!python starter.py {taxi_type} {year} {month}

The Data directory exists
The output directory exists
output
Pipfile.lock
model.bin
starter.ipynb
starter.py
.ipynb_checkpoints
Data
Pipfile
Dockerfile
Untitled.ipynb
reading data https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-04.parquet...
predicting...
Loading model model.bin...
the mean of prediction is 14.292282936862449
save results output/pred_yellow_tripdata_2023-04.parquet...


What's the mean predicted duration? 

* 7.29
* **14.29** <span style="font-size: 1.2em;">⬅️</span>
* 21.29
* 28.29


## Q6. Docker container 

Finally, we'll package the script in the docker container. 
For that, you'll need to use a base image that we prepared. 

This is what the content of this image is:

```dockerfile
FROM python:3.10.13-slim

WORKDIR /app
COPY [ "model2.bin", "model.bin" ]
```

Note: you don't need to run it. We have already done it.

It is pushed to [`agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim`](https://hub.docker.com/layers/agrigorev/zoomcamp-model/mlops-2024-3.10.13-slim/images/sha256-f54535b73a8c3ef91967d5588de57d4e251b22addcbbfb6e71304a91c1c7027f?context=repo),
which you need to use as your base image.

That is, your Dockerfile should start with:

```dockerfile
FROM agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

# do stuff here
```

This image already has a pickle file with a dictionary vectorizer
and a model. You will need to use them.

Important: don't copy the model to the docker image. You will need
to use the pickle file already in the image. 

Now run the script with docker. What's the mean predicted duration
for May 2023? 

* 0.19
* 7.24
* 14.24
* 21.19

In [31]:
%%writefile Dockerfile
FROM agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

WORKDIR /app

# RUN apt-get update && apt-get install -y awscli

COPY [ "Pipfile", "Pipfile.lock", "./" ]

RUN pip install pipenv 

# RUN pip install s3fs ffspec
#RUN pipenv install --deploy --ignore-pipfile
RUN pipenv install --system --deploy

COPY  "starter.py" "./"

ENTRYPOINT [ "python", "starter.py" ]

Overwriting Dockerfile


In [29]:
# Set parameters
taxi_type = 'yellow'
year = 2023
month = 5

In [30]:
# Test without docker
!python starter.py {taxi_type} {year} {month}

The Data directory exists
The output directory exists
output
Pipfile.lock
model.bin
starter.ipynb
starter.py
.ipynb_checkpoints
Data
Pipfile
Dockerfile
Untitled.ipynb
reading data https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-05.parquet...
predicting...
Loading model model.bin...
the mean of prediction is 14.242595513316317
save results output/pred_yellow_tripdata_2023-05.parquet...


In [32]:
# Build image
!docker build -t mlops-zoomcamp-model:2024-3.10.13-slim .

[1A[1B[0G[?25l[+] Building 0.0s (0/1)                                          docker:default
[?25h[1A[0G[?25l[+] Building 0.2s (1/2)                                          docker:default
[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 403B                                       0.0s
[0m => [internal] load metadata for docker.io/agrigorev/zoomcamp-model:mlops  0.1s
[?25h[1A[1A[1A[1A[0G[?25l[+] Building 0.3s (1/3)                                          docker:default
[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 403B                                       0.0s
[0m => [internal] load metadata for docker.io/agrigorev/zoomcamp-model:mlops  0.2s
 => [auth] agrigorev/zoomcamp-model:pull token for registry-1.docker.io    0.0s
[?25h[1A[1A[1A[1A[1A[0G[?25l[+] Building 0.4s (2/3)                                   

In [33]:
# Show image
!docker image ls

REPOSITORY                 TAG                 IMAGE ID       CREATED             SIZE
mlops-zoomcamp-model       2024-3.10.13-slim   556e6f1da836   15 seconds ago      759MB
<none>                     <none>              1d4c795aaf20   About an hour ago   675MB
zenml-mlflow               latest              488268938e7f   6 days ago          1.08GB
zenmldocker/zenml-server   0.83.0              c5bc14d167c9   2 weeks ago         787MB
mysql                      8.0                 43006ac274fd   2 months ago        772MB


In [34]:
# Image model
!docker run -i -t mlops-zoomcamp-model:2024-3.10.13-slim {taxi_type} {year} {month}

The Data directory is created
The output directory is created
output
Data
starter.py
Pipfile.lock
Pipfile
model.bin
reading data https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-05.parquet...
predicting...
Loading model model.bin...
the mean of prediction is 0.19174419265916945
save results output/pred_yellow_tripdata_2023-05.parquet...


Now run the script with docker. What's the mean predicted duration
for May 2023? 

* **0.19** <span style="font-size: 1.2em;">⬅️</span>
* 7.24
* 14.24
* 21.19

## Bonus: upload the result to the cloud (Not graded)

Just printing the mean duration inside the docker image 
doesn't seem very practical. Typically, after creating the output 
file, we upload it to the cloud storage.

Modify your code to upload the parquet file to S3/GCS/etc.

## Bonus: Use an orchestrator for batch inference

Here we didn't use any orchestration. In practice we usually do.

* Split the code into logical code blocks
* Use a workflow orchestrator for the code execution

## Publishing the image to dockerhub

This is how we published the image to Docker hub:

```bash
docker build -t mlops-zoomcamp-model:2024-3.10.13-slim .
docker tag mlops-zoomcamp-model:2024-3.10.13-slim agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

docker login --username USERNAME
docker push agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim
```

This is just for your reference, you don't need to do it.


## Submit the results

* Submit your results here: https://courses.datatalks.club/mlops-zoomcamp-2025/homework/hw4
* It's possible that your answers won't match exactly. If it's the case, select the closest one.