In [33]:
!pip freeze | grep scikit-learn
!python -V

scikit-learn==1.6.1
Python 3.9.7


In [34]:
import pickle
import pandas as pd
import numpy as np
import os

# Ignore warnings...
import warnings
warnings.filterwarnings("ignore")

In [35]:
with open('model.bin', 'rb') as f_in:
    dv, model = pickle.load(f_in)

In [36]:
year = 2023
month = 3
taxi_type = 'yellow'

categorical = ['PULocationID', 'DOLocationID']

def read_dataframe(filename):
    df = pd.read_parquet(filename)
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
    
    return df

def prepare_dictionaries(df: pd.DataFrame):
    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    dicts = df[categorical].to_dict(orient='records')
    
    return dicts

input_file=f"https://d37ci6vzurychx.cloudfront.net/trip-data/{taxi_type}_tripdata_{year:04d}-{month:02d}.parquet"
output_file = f'output/{taxi_type}/{year:04d}-{month:02d}.parquet'

In [37]:
def apply_model(input_file, output_file):

    df = read_dataframe(input_file)
    dicts = prepare_dictionaries(df)

    X_val = dv.transform(dicts)
    y_pred = model.predict(X_val)
    
    # What's the standard deviation of the predicted duration for this dataset?
    print("The value of the standard deviation of the predicted duration for this dataset is: ", np.std(y_pred)) 

    df_result = pd.DataFrame()
    df_result['ride_id'] = df['ride_id']
    df_result['predicted_duration'] = y_pred
    
    # ✅ Ensure output directory exists
    output_dir = os.path.dirname(output_file)
    os.makedirs(output_dir, exist_ok=True)
    
    df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
    )

In [38]:
apply_model(input_file=input_file, output_file=output_file)

The value of the standard deviation of the predicted duration for this dataset is:  6.247488852238703


**Question 1** <br>
What's the standard deviation of the predicted duration for this dataset?

**Question 2** <br>
What's the size of the output file?

In [39]:
import os

file_size_bytes = os.path.getsize(output_file)
file_size_mb = file_size_bytes / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")

File size: 65.46 MB


**Q3. Creating the scoring script** <br>
Now let's turn the notebook into a script. Which command you need to execute for that?

jupyter nbconvert --to script starter.ipynb

**Q4. Virtual environment**

Now let's put everything into a virtual environment. We'll use pipenv for that.

- Install all the required libraries. Pay attention to the Scikit-Learn version: it should be the same as in the starter notebook.

- After installing the libraries, pipenv creates two files: Pipfile and Pipfile.lock. The Pipfile.lock file keeps the hashes of the dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?
Answer - sha256:08ef968f6b72033c16c479c966bf37ccd49b06ea91b765e1cc27afefe723920b

**Q5. Parametrize the script**
Let's now make the script configurable via CLI. We'll create two parameters: year and month.

- Run the script for April 2023.

What's the mean predicted duration?
Answer - 14.29