# MLOps Zoomcamp Homework 4 

The goal of this homework is to familiarize users with deploy models in batch mode

- Module 4 Introduction  link: https://github.com/DataTalksClub/mlops-zoomcamp/tree/main/04-deployment

#### Q1. Notebook

We'll start with the same notebook we ended up with in homework 1.

We cleaned it a little bit and kept only the scoring part. Now it's in [homework/starter.ipynb](homework/starter.ipynb).

Run this notebook for the February 2021 FVH data.

What's the mean predicted duration for this dataset?

* 11.19
* 16.19
* 21.19
* 26.19

In [1]:
!pip freeze | grep scikit-learn

scikit-learn==1.0.2


In [2]:
import pickle
import numpy as np
import pandas as pd

In [3]:
with open('/home/rodrigoperes/mlops-zoomcamp/04-deployment/homework/model.bin', 'rb') as f_in:
    dv, lr = pickle.load(f_in)

In [4]:
categorical = ['PUlocationID', 'DOlocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.dropOff_datetime - df.pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

In [5]:
df = read_data('/home/rodrigoperes/notebooks/data/fhv_tripdata_2021-02.parquet')

In [6]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = lr.predict(X_val)

In [7]:
y_pred

array([14.53986486, 13.74042222, 15.59333908, ..., 15.83492293,
       16.78317605, 19.65462607])

In [8]:
np.mean(y_pred)

16.191691679979066

My answer: 16.19

#### Q2. Preparing the output

Like in the course videos, we want to prepare the dataframe with the output. 

First, let's create an artificial `ride_id` column:

```python
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
```

Next, write the ride id and the predictions to a dataframe with results. 

Save it as parquet:

```python
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)
```

What's the size of the output file?

* 9M
* 19M
* 29M
* 39M

Make sure you use the snippet above for saving the file. It should contain only these two columns. For this question, don't change the
dtypes of the columns and use pyarrow, not fastparquet. 

In [12]:
year = 2021
month = 2

df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

In [13]:
df['predictions'] = y_pred

In [14]:
df_results = df[['ride_id', 'predictions']]

In [16]:
output_file = "hw04_files/ride_duration_predictions.parquet"

df_results.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)

In [19]:
!cd hw04_files/

In [20]:
!ls -l *

-rw-rw-r-- 1 rodrigoperes rodrigoperes  2384 Jun 12 19:23 homework.py
-rw-rw-r-- 1 rodrigoperes rodrigoperes 15676 May 22 22:32 hw01_intro.ipynb
-rw-rw-r-- 1 rodrigoperes rodrigoperes 10743 May 29 22:59 hw02_exp_tracking.ipynb
-rw-rw-r-- 1 rodrigoperes rodrigoperes 11302 Jun 26 15:34 hw03_orchestration.ipynb
-rw-rw-r-- 1 rodrigoperes rodrigoperes 11302 Jun 27 00:00 hw04_deployment.ipynb

hw04_data:
total 19252
-rw-rw-r-- 1 rodrigoperes rodrigoperes 19711440 Jun 27 00:19 ride_duration_predictions.parquet

mlruns:
total 4
drwxrwxr-x 2 rodrigoperes rodrigoperes 4096 May 29 22:38 [0m[01;34m0[0m/


My answer: 19M

#### Q3. Creating the scoring script

Now let's turn the notebook into a script. 

Which command you need to execute for that?

My answer: jupyter nbconvert --to script starter.ipynb

#### Q4. Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version:
check the starter notebook for details. 

After installing the libraries, pipenv creates two files: `Pipfile`
and `Pipfile.lock`. The `Pipfile.lock` file keeps the hashes of the
dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?

My answer: sha256:08ef968f6b72033c16c479c966bf37ccd49b06ea91b765e1cc27afefe723920b

#### Q5. Parametrize the script

Let's now make the script configurable via CLI. We'll create two 
parameters: year and month.

Run the script for March 2021. 

What's the mean predicted duration? 

* 11.29
* 16.29
* 21.29
* 26.29

Hint: just add a print statement to your script.

!wget https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-03.parquet

!python starter.py 2021 03

Mean ride duration predictions 16.298821614015107

My answer: 16.29

#### Q6. Docker contaner 

Finally, we'll package the script in the docker container. 
For that, you'll need to use a base image that we prepared. 

This is how it looks like:

```
FROM python:3.9.7-slim

WORKDIR /app
COPY [ "model2.bin", "model.bin" ]
```

(see [`homework/Dockerfile`](homework/Dockerfile))

We pushed it to [`agrigorev/zoomcamp-model:mlops-3.9.7-slim`](https://hub.docker.com/layers/zoomcamp-model/agrigorev/zoomcamp-model/mlops-3.9.7-slim/images/sha256-7fac33c783cc6018356ce16a4b408f6c977b55a4df52bdb6c4d0215edf83af5d?context=explore),
which you should use as your base image.

That is, this is how your Dockerfile should start:

```docker
FROM agrigorev/zoomcamp-model:mlops-3.9.7-slim

# do stuff here
```

This image already has a pickle file with a dictionary vectorizer
and a model. You will need to use them.

Important: don't copy the model to the docker image. You will need
to use the pickle file already in the image. 

Now run the script with docker. What's the mean predicted duration
for April 2021? 


* 9.96
* 16.55
* 25.96
* 36.55

FROM agrigorev/zoomcamp-model:mlops-3.9.7-slim

RUN pip install -U pip
RUN pip install pipenv 

WORKDIR /app

COPY [ "Pipfile", "Pipfile.lock", "./" ]

RUN pipenv install --system --deploy

COPY [ "starter.py", "./" ]

COPY [ "fhv_tripdata_2021-04.parquet", "./" ]

RUN python starter.py 2021 04

Mean ride duration predictions 9.967573179784523

My answer: 9.96