In [1]:
!pip freeze | grep scikit-learn

scikit-learn==1.6.1


In [2]:
!python -V

Python 3.12.1


In [3]:
import pickle
import pandas as pd

In [4]:
with open('model.bin', 'rb') as f_in:
    dv, model = pickle.load(f_in)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [6]:
categorical = ['PULocationID', 'DOLocationID']

def read_data(filename, year, month):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    # Generate ride_id
    df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

    # Assuming 'categorical' is a predefined list of columns
    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')

    return df


In [7]:
df = read_data('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet', year=2023, month=3)

In [8]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)

In [9]:
df['duration'].std()

np.float64(10.604646619926864)

## Q1. Notebook

We'll start with the same notebook we ended up with in homework 1.
We cleaned it a little bit and kept only the scoring part. You can find the initial notebook [here](homework/starter.ipynb).

Run this notebook for the March 2023 data.

What's the standard deviation of the predicted duration for this dataset?

* 1.24
* **6.24**
* 12.28
* 18.28


In [10]:
y_pred.std()

np.float64(6.247488852238703)

## Q2. Preparing the output

Like in the course videos, we want to prepare the dataframe with the output. 

First, let's create an artificial `ride_id` column:

```python
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
```

Next, write the ride id and the predictions to a dataframe with results. 

Save it as parquet:

```python
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)
```

What's the size of the output file?

* 36M
* 46M
* 56M
* **66M**

__Note:__ Make sure you use the snippet above for saving the file. It should contain only these two columns. For this question, don't change the
dtypes of the columns and use `pyarrow`, not `fastparquet`. 


In [14]:
df_result = pd.DataFrame({
    'ride_id': df['ride_id'].values,
    'prediction': y_pred
})


In [16]:
output_file = 'results.parquet'

In [18]:
import os

df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)

# Get file size in bytes
file_size_bytes = os.path.getsize(output_file)

# Convert to megabytes (optional)
file_size_mb = file_size_bytes / (1024 * 1024)

print(f"Output file size: {file_size_mb:.2f} MB")


Output file size: 65.46 MB


## Q3. Creating the scoring script

Now let's turn the notebook into a script. 

Which command you need to execute for that?

prepare_feautres  
predict

## Q4. Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version: it should be the same as in the starter
notebook.

After installing the libraries, pipenv creates two files: `Pipfile`
and `Pipfile.lock`. The `Pipfile.lock` file keeps the hashes of the
dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?



In [None]:
sha256:0650e730afb87402baa88afbf31c07b84c98272622aaba002559b614600ca691