Name: Isaac Ndirangu Muturi
Email: ndirangumuturi749@gmail.com

## Homework

In this homework, we'll deploy the ride duration model in batch mode.


## Q1. Notebook

We'll start with the same notebook we ended up with in homework 1.

Run this notebook for the March 2023 data.

What's the standard deviation of the predicted duration for this dataset?

In [1]:
!pip install -q pandas scikit-learn pyarrow numpy


In [2]:
import pickle
import pandas as pd
import numpy as np


In [3]:
with open('model.bin', 'rb') as f_in:
    dv, model = pickle.load(f_in)


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [4]:
categorical = ['PULocationID', 'DOLocationID']

def read_data(filename):
    df = pd.read_parquet(filename)

    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')

    return df


In [5]:
df = read_data('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet')
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,duration
0,2,2023-03-01 00:06:43,2023-03-01 00:16:43,1.0,0.0,1.0,N,238,42,2,8.6,1.0,0.5,0.0,0.0,1.0,11.1,0.0,0.0,10.0
1,2,2023-03-01 00:08:25,2023-03-01 00:39:30,2.0,12.4,1.0,N,138,231,1,52.7,6.0,0.5,12.54,0.0,1.0,76.49,2.5,1.25,31.083333
2,1,2023-03-01 00:15:04,2023-03-01 00:29:26,0.0,3.3,1.0,N,140,186,1,18.4,3.5,0.5,4.65,0.0,1.0,28.05,2.5,0.0,14.366667
3,1,2023-03-01 00:49:37,2023-03-01 01:01:05,1.0,2.9,1.0,N,140,43,1,15.6,3.5,0.5,4.1,0.0,1.0,24.7,2.5,0.0,11.466667
4,2,2023-03-01 00:08:04,2023-03-01 00:11:06,1.0,1.23,1.0,N,79,137,1,7.2,1.0,0.5,2.44,0.0,1.0,14.64,2.5,0.0,3.033333


In [6]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)


In [7]:
std_dev = np.std(y_pred)
print(f'Standard Deviation of the predicted duration: {std_dev}')


Standard Deviation of the predicted duration: 6.247488852238703



## Q2. Preparing the output

Like in the course videos, we want to prepare the dataframe with the output.

First, let's create an artificial `ride_id` column:

```python
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
```

Next, write the ride id and the predictions to a dataframe with results.

Save it as parquet:

```python
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)
```

What's the size of the output file?

* 36M
* 46M
* 56M
* 66M

__Note:__ Make sure you use the snippet above for saving the file. It should contain only these two columns. For this question, don't change the
dtypes of the columns and use `pyarrow`, not `fastparquet`.



In [8]:
# Add ride_id column
year, month = 2023, 3
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

# Create dataframe with results
df_result = pd.DataFrame({
    'ride_id': df['ride_id'],
    'predicted_duration': y_pred
})

# Save the result as a parquet file
output_file = 'result.parquet'

df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)


In [9]:
!ls -lh result.parquet


-rw-rw-rw- 1 codespace codespace 66M Jun 17 14:26 result.parquet


In [10]:
import os
file_size = os.path.getsize(output_file) / (1024 * 1024)  # Convert bytes to megabytes
file_size

65.46199798583984


## Q3. Creating the scoring script

Now let's turn the notebook into a script.

Which command you need to execute for that?



In [11]:
!jupyter nbconvert --to script homework_04_deployment.ipynb


[NbConvertApp] Converting notebook homework_04_deployment.ipynb to script
[NbConvertApp] Writing 6212 bytes to homework_04_deployment.py



## Q4. Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version: it should be the same as in the starter
notebook.

After installing the libraries, pipenv creates two files: `Pipfile`
and `Pipfile.lock`. The `Pipfile.lock` file keeps the hashes of the
dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?


In [13]:
!pipenv --python 3.9


[1mCreating a virtualenv for this project...[0m
Pipfile: [33m[1m/workspaces/MLOps-zoomcamp-2024/04-deployment/homework/Pipfile[0m
[1mUsing[0m [33m[1m/usr/bin/python3.9[0m [32m(3.9.19)[0m [1mto create virtualenv...[0m
[2K[32m⠹[0m Creating virtual environment.....[36mcreated virtual environment CPython3.9.19.final.0-64 in 1394ms
  creator CPython3Posix(dest=/home/codespace/.local/share/virtualenvs/homework-C3PGYoNw, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/codespace/.local/share/virtualenv)
    added seed packages: pip==24.0, setuptools==69.5.1, wheel==0.43.0
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
[0m
✔ Successfully created virtual environment!
[2K[32m⠸[0m Creating virtual environment...
[1A[2K[32mVirtualenv location: /home/codespace/.local/share/virtualenvs/homework-C3PGYoNw[

In [14]:
!pipenv install boto3 mlflow pyarrow numpy pandas scikit-learn==1.5.0


[1;32mInstalling boto3[0m[1;33m...[0m
[?25lResolving boto3[33m...[0m
[2K[1mAdded [0m[1;32mboto3[0m to Pipfile's [1;33m[[0m[33mpackages[0m[1;33m][0m [33m...[0m
[2K✔ Installation Succeeded..
[2K[32m⠋[0m Installing boto3...
[1A[2K[1;32mInstalling mlflow[0m[1;33m...[0m
[?25lResolving mlflow[33m...[0m
[2K[1mAdded [0m[1;32mmlflow[0m to Pipfile's [1;33m[[0m[33mpackages[0m[1;33m][0m [33m...[0m
[2K✔ Installation Succeeded...
[2K[32m⠋[0m Installing mlflow...
[1A[2K[1;32mInstalling pyarrow[0m[1;33m...[0m
[?25lResolving pyarrow[33m...[0m
[2K[1mAdded [0m[1;32mpyarrow[0m to Pipfile's [1;33m[[0m[33mpackages[0m[1;33m][0m [33m...[0m
[2K✔ Installation Succeededw...
[2K[32m⠋[0m Installing pyarrow...
[1A[2K[1;32mInstalling numpy[0m[1;33m...[0m
[?25lResolving numpy[33m...[0m
[2K[1mAdded [0m[1;32mnumpy[0m to Pipfile's [1;33m[[0m[33mpackages[0m[1;33m][0m [33m...[0m
[2K✔ Installation Succeeded..
[2K[32m⠋[0m I



## Q5. Parametrize the script

Let's now make the script configurable via CLI. We'll create two
parameters: year and month.

Run the script for April 2023.

What's the mean predicted duration?

* 7.29
* 14.29
* 21.29
* 28.29

Hint: just add a print statement to your script.



In [16]:
!python homework_04_deployment.py --year 2023 --month 4

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
Mean predicted duration for 2023-04: 14.29



## Q6. Docker container

Finally, we'll package the script in the docker container.
For that, you'll need to use a base image that we prepared.

This is what the content of this image is:
```
FROM python:3.10.13-slim

WORKDIR /app
COPY [ "model2.bin", "model.bin" ]
```

Note: you don't need to run it. We have already done it.

It is pushed it to [`agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim`](https://hub.docker.com/layers/agrigorev/zoomcamp-model/mlops-2024-3.10.13-slim/images/sha256-f54535b73a8c3ef91967d5588de57d4e251b22addcbbfb6e71304a91c1c7027f?context=repo),
which you need to use as your base image.

That is, your Dockerfile should start with:

```docker
FROM agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

# do stuff here
```

This image already has a pickle file with a dictionary vectorizer
and a model. You will need to use them.

Important: don't copy the model to the docker image. You will need
to use the pickle file already in the image.

Now run the script with docker. What's the mean predicted duration
for May 2023?

* 0.19
* 7.24
* 14.24
* 21.19



In [20]:
!docker build -t predict_duration_image .


[1A[1B[0G[?25l[+] Building 0.0s (0/1)                                          docker:default
[?25h[1A[0G[?25l[+] Building 0.2s (2/3)                                          docker:default
[34m => [internal] load build definition from Dockerfile                       0.1s
[0m[34m => => transferring dockerfile: 616B                                       0.0s
[0m => [internal] load metadata for docker.io/agrigorev/zoomcamp-model:mlops  0.2s
[34m => [auth] agrigorev/zoomcamp-model:pull token for registry-1.docker.io    0.0s
[0m[?25h[1A[1A[1A[1A[1A[0G[?25l[+] Building 0.4s (2/3)                                          docker:default
[34m => [internal] load build definition from Dockerfile                       0.1s
[0m[34m => => transferring dockerfile: 616B                                       0.0s
[0m => [internal] load metadata for docker.io/agrigorev/zoomcamp-model:mlops  0.3s
[34m => [auth] agrigorev/zoomcamp-model:pull token for registry-1.docker.io    0

In [19]:
!docker run --rm predict_duration_image


Mean predicted duration for 2023-05: 0.19


Follow me on Twitter 🐦, connect with me on LinkedIn 🔗, and check out my GitHub 🐙. You won't be disappointed!

🐦 Twitter: https://twitter.com/NdiranguMuturi1  
💼 LinkedIn: https://www.linkedin.com/in/isaac-muturi-3b6b2b237  
🔗 GitHub: https://github.com/Isaac-Ndirangu-Muturi-749