In [1]:
!pip freeze | grep scikit-learn

scikit-learn==1.2.2


In [15]:
import pickle
import pandas as pd
import os

In [3]:
with open('model.bin', 'rb') as f_in:
    dv, model = pickle.load(f_in)

In [4]:
categorical = ['PULocationID', 'DOLocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

## Q1

Since it wants the data from **February 2022**, **year** will be **2022** and **month** will be **02**.

In [5]:
df = read_data('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parquet')

In [6]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)

In [8]:
pred_std = y_pred.std()

In [9]:
print(f"Standard deviation of the predicted duration is {pred_std:.2f}")

Standard deviation of the predicted duration is 5.28


## Q2

In [10]:
year = 2022
month = 2
taxi_type = "yellow"

In [11]:
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

Since it wants only **ride id** and the **predictions**, these two columns will be used for the creation of the new dataframe named **df_result**.

In [12]:
df_result = pd.DataFrame({"ride_id": df['ride_id'].values, "predictions": y_pred})

In [13]:
output_file = f"output/output-{taxi_type}_tripdata_{year:04d}-{month:02d}.parquet"

In [14]:
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)

In [22]:
def get_filesize(file: str):
    
    file_stats = os.stat(file)
    
    return file_stats.st_size / (1024 * 1024)

In [23]:
print(f'File Size in MegaBytes is {get_filesize(output_file):.2f}')

File Size in MegaBytes is 57.22


## Q3

In [24]:
!jupyter nbconvert --to script deployment_assignment.ipynb

[NbConvertApp] Converting notebook deployment_assignment.ipynb to script
[NbConvertApp] Writing 2027 bytes to deployment_assignment.py


## Q4

This is how **scikit-learn** with **specific version** can be installed by using **pipenv** in command line

<img src="img/q4-a.png">

This is the hashes of **scikit-learn** inside **Pipfile.lock**

<img src="img/q4-b.png" width=600>

The hash is **065e9673e24e0dc5113e2dd2b4ca30c9d8aa2fa90f4c0597241c93b63130d233**

## Q5

In [2]:
!python deployment_assignment.py 3 2022

Reading the data for 03-2022 (month/year)...
Making dictionaries...
Loading the model...
Predicting the duration...
The mean predicted duration for 03-2022 is 12.76


So, The mean predicted duration for March 2022 is **12.76**

## Q6

I built the docker name **mlops-zoomcamp-hw4:v1** with [DockerFile](Dockerfile)

In [1]:
!docker run -it --rm --name april_duration_predict mlops-zoomcamp-hw4:v1 4 2022

Reading the data for 04-2022 (month/year)...
Making dictionaries...
Loading the model...
Predicting the duration...
The mean predicted duration for 04-2022 is 12.83


So, The mean predicted duration for April 2022 is **12.83**

## Bonus

After adding additional codes to save the output data in S3, this [script](./deployment_assignment.py) can also be used to push the output to s3

- boto3 and s3fs needs to be installed to work with s3.
- **aws cli** needs to be configured to save the aws credentials like **aws_access_key_id** and **aws_secret_access_key**

The below is the running result

<img src="img/bonus-1.png">

Output is saved in s3

<img src="img/bonus-2.png">

## Publishing the image to dockerhub

- The dockerfile image needed to be rebuilt to push output file to s3
- To publish image to dockerhub, ```docker login``` must be run in cmd.
- You can run more detail instruction in my [readme file](README.md)

Pushing to dockerhub

<img src="img/bonus-3.png">

At the dockerhub,

<img src="img/bonus-4.png">