In [1]:
!pip freeze | grep scikit-learn

scikit-learn==1.5.0


In [2]:
!python -V

Python 3.10.13


In [9]:
!pip install pyarrow

Collecting pyarrow
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.0 kB)
Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (40.8 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m0m eta [36m0:00:01[0m[36m0:00:01[0m
[?25hInstalling collected packages: pyarrow
Successfully installed pyarrow-16.1.0


In [2]:
import pickle
import pandas as pd
import os

In [3]:
with open('model.bin', 'rb') as f_in:
    dv, model = pickle.load(f_in)

In [4]:
categorical = ['PULocationID', 'DOLocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

In [5]:
DATASET_DIR = "Data"
OUTPUT_DIR = "output"
DATA_URL = "https://d37ci6vzurychx.cloudfront.net"
taxi_type = "yellow"
year = 2023
month = 3
input_file = f'{DATA_URL}/trip-data/{taxi_type}_tripdata_{year:04d}-{month:02d}.parquet'
print(f"input_file: {input_file}")

input_file: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet


In [6]:
for dir in [DATASET_DIR, OUTPUT_DIR]:
    if os.path.isdir(dir): 
        print(f"The {dir} directory exists")
        continue
    # if the directory is  
    # not present then create it. 
    os.makedirs(dir, exist_ok=True)
    print(f"The {dir} directory is created")

The Data directory exists
The output directory exists


In [40]:
df = read_data(input_file)

In [41]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)

## Q1. Notebook



We'll start with the same notebook we ended up with in homework 1.
We cleaned it a little bit and kept only the scoring part. You can find the initial notebook [here](homework/starter.ipynb).

Run this notebook for the March 2023 data.



In [42]:
# Calculate Std.
y_pred.std()

6.247488852238703

What's the standard deviation of the predicted duration for this dataset?

* 1.24
* **6.24**
* 12.28
* 18.28

## Q2. Preparing the output

Like in the course videos, we want to prepare the dataframe with the output. 

First, let's create an artificial `ride_id` column:

```python
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
```

Next, write the ride id and the predictions to a dataframe with results. 

Save it as parquet:

```python
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)
```



In [43]:
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

In [44]:
df_result = df[['ride_id']].copy()
df_result['prediction'] = y_pred

In [45]:
output_file = f'{OUTPUT_DIR}/pred_yellow_tripdata_{year:04}-{month:02}.parquet'
print(f'output_file {output_file}...')

output_file output/pred_yellow_tripdata_2023-03.parquet...


In [46]:
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)

In [47]:
!ls -lh {OUTPUT_DIR}

total 199M
-rw-rw-rw- 1 codespace codespace 66M Jun 17 11:31 pred_yellow_tripdata_2023-03.parquet
-rw-rw-rw- 1 codespace codespace 66M Jun 17 11:30 pred_yellow_tripdata_2023-04.parquet
-rw-rw-rw- 1 codespace codespace 68M Jun 17 10:59 pred_yellow_tripdata_2023-05.parquet


In [48]:
# Get size of file in bytes
file_stats = os.stat(output_file)
file_size = file_stats.st_size  / (1024 * 1024)
# Imprimir size in MB
print(f"File Size in Bytes is {file_size:.2f} MB")

File Size in Bytes is 65.46 MB


What's the size of the output file?

* 36M
* 46M
* 56M
* **66M**

__Note:__ Make sure you use the snippet above for saving the file. It should contain only these two columns. For this question, don't change the
dtypes of the columns and use `pyarrow`, not `fastparquet`. 

## Q3. Creating the scoring script

Now let's turn the notebook into a script. 

Which command you need to execute for that?


**nbconvert**

## Q4. Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version: it should be the same as in the starter
notebook.

After installing the libraries, pipenv creates two files: `Pipfile`
and `Pipfile.lock`. The `Pipfile.lock` file keeps the hashes of the
dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?

057b991ac64b3e75c9c04b5f9395eaf19a6179244c089afdebaad98264bff37c

## Q5. Parametrize the script

Let's now make the script configurable via CLI. We'll create two 
parameters: year and month.

Run the script for April 2023. 



In [7]:
taxi_type = 'yellow'
year = 2023
month = 4

In [8]:
!python starter.py {taxi_type} {year} {month}

The Data directory exists
The output directory exists
starter.py
Dockerfile
Data
output
Pipfile
model.bin
Pipfile.lock
.ipynb_checkpoints
starter.ipynb
reading data https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-04.parquet...
predicting...
Loading model model.bin...
the mean of prediction is 14.292282936862449
save results output/pred_yellow_tripdata_2023-04.parquet...


In [9]:
!ls -lh {OUTPUT_DIR}

total 196M
-rw-rw-rw- 1 codespace codespace 66M Jun 17 11:31 pred_yellow_tripdata_2023-03.parquet
-rw-rw-rw- 1 codespace codespace 64M Jun 17 11:32 pred_yellow_tripdata_2023-04.parquet
-rw-rw-rw- 1 codespace codespace 68M Jun 17 10:59 pred_yellow_tripdata_2023-05.parquet


What's the mean predicted duration? 

* 7.29
* **14.29**
* 21.29
* 28.29

Hint: just add a print statement to your script.

## Q6. Docker container 

Finally, we'll package the script in the docker container. 
For that, you'll need to use a base image that we prepared. 

This is what the content of this image is:
```
FROM python:3.10.13-slim

WORKDIR /app
COPY [ "model2.bin", "model.bin" ]
```

Note: you don't need to run it. We have already done it.

It is pushed it to [`agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim`](https://hub.docker.com/layers/agrigorev/zoomcamp-model/mlops-2024-3.10.13-slim/images/sha256-f54535b73a8c3ef91967d5588de57d4e251b22addcbbfb6e71304a91c1c7027f?context=repo),
which you need to use as your base image.

That is, your Dockerfile should start with:

```docker
FROM agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

# do stuff here
```

This image already has a pickle file with a dictionary vectorizer
and a model. You will need to use them.

Important: don't copy the model to the docker image. You will need
to use the pickle file already in the image. 



Now run the script with docker. What's the mean predicted duration
for **May 2023**? 

In [10]:
taxi_type = 'yellow'
year = 2023
month = 5

In [11]:
# Test 
!python starter.py {taxi_type} {year} {month}

The Data directory exists
The output directory exists
starter.py
Dockerfile
Data
output
Pipfile
model.bin
Pipfile.lock
.ipynb_checkpoints
starter.ipynb
reading data https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-05.parquet...
predicting...
Loading model model.bin...
the mean of prediction is 14.242595513316317
save results output/pred_yellow_tripdata_2023-05.parquet...


In [None]:
# Build image
!docker build -t mlops-zoomcamp-model:2024-3.10.13-slim .

In [15]:
# Image model
!docker run -i -t mlops-zoomcamp-model:2024-3.10.13-slim {taxi_type} {year} {month}

The Data directory is created
The output directory is created
Data
output
starter.py
Pipfile
Pipfile.lock
model.bin
reading data https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-05.parquet...
predicting...
Loading model model.bin...
the mean of prediction is 0.19174419265916945
save results output/pred_yellow_tripdata_2023-05.parquet...



* **0.19**
* 7.24
* 14.24
* 21.19