In [2]:
import pickle
import pandas as pd

In [158]:
!pip freeze | grep flow

mlflow==1.26.0


In [9]:
!pip freeze | grep scikit-learn

scikit-learn==1.0.2


In [10]:
with open('model.bin', 'rb') as f_in:
    dv, lr = pickle.load(f_in)

In [11]:
categorical = ['PUlocationID', 'DOlocationID']

def read_data(filename):
    df = pd.read_parquet(filename)

    df['duration'] = df.dropOff_datetime - df.pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')

    return df

In [5]:
df = read_data('https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-02.parquet')

In [17]:
def predict(df, dv, lr):
    dicts = df[categorical].to_dict(orient='records')
    X_val = dv.transform(dicts)
    y_pred = lr.predict(X_val)

    return y_pred

## Q1. Notebook

We'll start with the same notebook we ended up with in homework 1.

We cleaned it a little bit and kept only the scoring part. Now it's in [homework/starter.ipynb](homework/starter.ipynb).

Run this notebook for the February 2021 FVH data.

What's the mean predicted duration for this dataset?

* 11.19
* 16.19
* 21.19
* 26.19

In [18]:
y_pred = predict(df, dv, lr)
print(f'Mean predicted duration = {y_pred.mean():.2f}')

Mean predicted duration = 16.19


Q1 answer :
--> B) 16.19

## Q2. Preparing the output

Like in the course videos, we want to prepare the dataframe with the output.

First, let's create an artificial `ride_id` column:

```python
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
```

Next, write the ride id and the predictions to a dataframe with results.

Save it as parquet:

```python
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)
```

What's the size of the output file?

* 9M
* 19M
* 29M
* 39M

Make sure you use the snippet above for saving the file. It should contain only these two columns. For this question, don't change the
dtypes of the columns and use pyarrow, not fastparquet.


In [132]:
def preprocess_data(df):
    # df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
    df_copied = df.copy()
    df_copied["pickup_yyyy_mm"] = pd.to_datetime(df_copied['pickup_datetime']).dt.strftime('%Y/%m')
    #df_copied["ride_id"] = df_copied["pickup_yyyy_mm"] + "_" + df_copied.index.astype('str')
    # apply has to be used
    df_copied['ride_id'] = df_copied.apply(lambda row: row["pickup_yyyy_mm"] + "_" + str(row.name) , axis=1)

    return df_copied


In [148]:
df_copied = preprocess_data(df)

df_result = df_copied[['ride_id']]
df_result['y_pred'] = y_pred.tolist()

df_result.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_result['y_pred'] = y_pred.tolist()


Unnamed: 0,ride_id,y_pred
1,2021/02_1,14.539865
2,2021/02_2,13.740422
3,2021/02_3,15.593339
4,2021/02_4,15.188118
5,2021/02_5,13.817206


In [142]:
df_result.to_parquet(
    'df_result.parquet',
    engine='pyarrow',
    compression=None,
    index=False
)

In [147]:
!du -BM df_result.parquet

19M	df_result.parquet


What's the size of the output file?

* 9M
* 19M
* 29M
* 39M

Answer:
B) 19M

## Q3. Creating the scoring script

Now let's turn the notebook into a script.

Which command you need to execute for that?



Q4. Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version: check the starter notebook for details.

After installing the libraries, pipenv creates two files: Pipfile and Pipfile.lock. The Pipfile.lock file keeps the hashes of the dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?

Answer:
08ef968f6b72033c16c479c966bf37ccd49b06ea91b765e1cc27afefe723920b

## Q5. Parametrize the script

Let's now make the script configurable via CLI. We'll create two
parameters: year and month.

Run the script for March 2021.

What's the mean predicted duration?

* 11.29
* 16.29
* 21.29
* 26.29

Hint: just add a print statement to your script.



In [161]:
!python hw4-MG.py 2021 3

23:44:05.686 | INFO    | prefect.engine - Created flow run 'innocent-macaque' for flow 'ride-duration-prediction'
23:44:05.686 | INFO    | Flow run 'innocent-macaque' - Using task runner 'ConcurrentTaskRunner'
23:44:05.728 | INFO    | Flow run 'innocent-macaque' - Created task run 'apply_model-b21fdc82-0' for task 'apply_model'
23:44:05.753 | INFO    | Task run 'apply_model-b21fdc82-0' - reading the data from https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-03.parquet...
23:44:21.152 | INFO    | Task run 'apply_model-b21fdc82-0' - df columns: Index(['dispatching_base_num', 'pickup_datetime', 'dropOff_datetime',
       'PUlocationID', 'DOlocationID', 'SR_Flag', 'Affiliated_base_number',
       'duration', 'pickup_yyyy_mm', 'ride_id'],
      dtype='object') 
23:44:24.918 | INFO    | Task run 'apply_model-b21fdc82-0' - loading the model
23:44:25.571 | INFO    | Task run 'apply_model-b21fdc82-0' - applying the model...
23:44:27.181 | INFO    | Task run 'apply_model-b2

23:44:27.181 | INFO    | Task run 'apply_model-b21fdc82-0' - Mean predicted duration = 16.30

Answer:
B) 16.29

## Q6. Docker contaner

Finally, we'll package the script in the docker container.
For that, you'll need to use a base image that we prepared.

This is how it looks like:

```
FROM python:3.9.7-slim

WORKDIR /app
COPY [ "model2.bin", "model.bin" ]
```

(see [`homework/Dockerfile`](homework/Dockerfile))

We pushed it to [`agrigorev/zoomcamp-model:mlops-3.9.7-slim`](https://hub.docker.com/layers/zoomcamp-model/agrigorev/zoomcamp-model/mlops-3.9.7-slim/images/sha256-7fac33c783cc6018356ce16a4b408f6c977b55a4df52bdb6c4d0215edf83af5d?context=explore),
which you should use as your base image.

That is, this is how your Dockerfile should start:

```docker
FROM agrigorev/zoomcamp-model:mlops-3.9.7-slim

# do stuff here
```

This image already has a pickle file with a dictionary vectorizer
and a model. You will need to use them.

Important: don't copy the model to the docker image. You will need
to use the pickle file already in the image.

Now run the script with docker. What's the mean predicted duration
for April 2021?


* 9.96
* 16.55
* 25.96
* 36.55



Updated Dockerfile:

In [160]:
!cat Dockerfile

FROM agrigorev/zoomcamp-model:mlops-3.9.7-slim

RUN pip install -U pip
RUN pip install pipenv

WORKDIR /app

COPY [ "Pipfile", "Pipfile.lock", "./" ]

RUN pipenv install --system --deploy

COPY [ "hw4-MG.py", "./" ]

ENTRYPOINT ["python", "hw4-MG.py"]


To run it and pass arguments, use:
`
docker run -it --rm \
    -v /home/michal/.aws:/root/.aws \
    mlops-zoomcamp-enkidupal:v1 \
    2021 4
`
specyfing volume to docker, so python script has access to aws credentials inside docker container.

After running `docker run -it --rm \
    -v /home/michal/.aws:/root/.aws \
    mlops-zoomcamp-enkidupal:v1 \
    2021 4`
the flow is run and mean prediction is being output.
21:29:54.266 | INFO    | Task run 'apply_model-b21fdc82-0' - Mean predicted duration = 9.97
Answer:
A) 9.96

To upload result to the cloud: ( S3 here )
1. Create bucket in S3
2. Update script to parametrize output_file based on run_id and provide bucket_id to S3 bucket :
`    output_file = f's3://nyc-duration-prediction-enkidupal/taxi_type={taxi_type}/year={year:04d}/month={month:02d}/{run_id}.parquet'
`
The result will be saved to s3 in:
`df_result.to_parquet(output_file, index=False)`