In [1]:
!pip freeze | grep scikit-learn

scikit-learn==1.1.1


In [2]:
import pickle
import pandas as pd

In [3]:
with open('model.bin', 'rb') as f_in:
    dv, lr = pickle.load(f_in)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [4]:
categorical = ['PUlocationID', 'DOlocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.dropOff_datetime - df.pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

In [5]:
year = 2021
month = 2

df = read_data(f'https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_{year}-{month:02d}.parquet')

In [6]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = lr.predict(X_val)

# Q1. Notebook

What's the mean predicted duration for this dataset?

In [8]:
y_pred.mean()

16.191691679979066

# Q2. Preparing the output

In [14]:
output_file = f'result_fhv_tripdata_{year}-{month:02d}.parquet'

In [9]:
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

In [12]:
df_result = pd.DataFrame()
df_result['ride_id'] = df['ride_id']
df_result['predictions'] = y_pred

In [15]:
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)

What's the size of the output file?

In [19]:
!ls -l --block-size=M | grep result

-rw-rw-r-- 1 tim tim 19M Jun 26 11:22 result_fhv_tripdata_2021-02.parquet


# Q3. Creating the scoring script

Which command you need to execute for that?

In [21]:
!jupyter nbconvert --to script starter.ipynb

[NbConvertApp] Converting notebook starter.ipynb to script
[NbConvertApp] Writing 1575 bytes to starter.py


# Q4. Virtual environment

!pipenv --python=3.9 install scikit-learn==1.0.2 pandas pyarrow

What's the first hash for the Scikit-Learn dependency?

scikit-learn": {
            "hashes": [
                "sha256:08ef968f6b72033c16c479c966bf37ccd49b06ea91b765e1cc27afefe723920b",

# Q5. Parametrize the script

In [23]:
!python starter.py 2021 3

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
mean predicted duration 16.298821614015107
wrote predictions to result_fhv_tripdata_2021-03.parquet


What's the mean predicted duration? (same with correct environment)

mean predicted duration 16.298821614015107

# Q6. Docker contaner

In [28]:
!docker build -t ride-duration:v1 .

Sending build context to Docker daemon  55.33MB
Step 1/6 : FROM agrigorev/zoomcamp-model:mlops-3.9.7-slim
 ---> 8cffad87c549
Step 2/6 : RUN pip install -U pip
 ---> Using cache
 ---> 72446bb858f0
Step 3/6 : RUN pip install pipenv
 ---> Using cache
 ---> afb6542751a1
Step 4/6 : COPY [ "Pipfile", "Pipfile.lock", "starter.py", "./"]
 ---> 36bbee97e18d
Step 5/6 : RUN pipenv install --system --deploy
 ---> Running in 0856a5d75f36
Installing dependencies from Pipfile.lock (4076db)...
  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 9/9 — 00:00:24 00:00:00m  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 1/9 — 00:00:00[91m  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 4/9 — 00:00:00[91m  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 7/9 — 00:00:00[91m  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 8/9 — 00:00:02[91m  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 9/9 — 00:00:02
[0mRemoving intermediate container 0856a5d75f36
 ---> c24d9d7f3bae
Step 6/6 : ENTRYPOINT [ "python", "starter.py", "2021", "4" ]
 ---> Running in 165191e99164
Removing

In [29]:
!docker run -it --rm ride-duration:v1

mean predicted duration 9.967573179784523
wrote predictions to result_fhv_tripdata_2021-04.parquet
