In [2]:
import pickle
import pandas as pd

In [9]:
!pip freeze | grep scikit-learn

scikit-learn==1.0.2


In [10]:
with open('model.bin', 'rb') as f_in:
    dv, lr = pickle.load(f_in)

In [11]:
categorical = ['PUlocationID', 'DOlocationID']

def read_data(filename):
    df = pd.read_parquet(filename)

    df['duration'] = df.dropOff_datetime - df.pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')

    return df

In [5]:
df = read_data('https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-02.parquet')

In [17]:
def predict(df, dv, lr):
    dicts = df[categorical].to_dict(orient='records')
    X_val = dv.transform(dicts)
    y_pred = lr.predict(X_val)

    return y_pred

## Q1. Notebook

We'll start with the same notebook we ended up with in homework 1.

We cleaned it a little bit and kept only the scoring part. Now it's in [homework/starter.ipynb](homework/starter.ipynb).

Run this notebook for the February 2021 FVH data.

What's the mean predicted duration for this dataset?

* 11.19
* 16.19
* 21.19
* 26.19

In [18]:
y_pred = predict(df, dv, lr)
print(f'Mean predicted duration = {y_pred.mean():.2f}')

Mean predicted duration = 16.19


Q1 answer :
--> B) 16.19

## Q2. Preparing the output

Like in the course videos, we want to prepare the dataframe with the output.

First, let's create an artificial `ride_id` column:

```python
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
```

Next, write the ride id and the predictions to a dataframe with results.

Save it as parquet:

```python
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)
```

What's the size of the output file?

* 9M
* 19M
* 29M
* 39M

Make sure you use the snippet above for saving the file. It should contain only these two columns. For this question, don't change the
dtypes of the columns and use pyarrow, not fastparquet.


In [125]:
df.head(5)

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number,duration
1,B00021,2021-02-01 00:55:40,2021-02-01 01:06:20,173,82,,B00021,10.666667
2,B00021,2021-02-01 00:14:03,2021-02-01 00:28:37,173,56,,B00021,14.566667
3,B00021,2021-02-01 00:27:48,2021-02-01 00:35:45,82,129,,B00021,7.95
4,B00037,2021-02-01 00:12:50,2021-02-01 00:26:38,-1,225,,B00037,13.8
5,B00037,2021-02-01 00:00:37,2021-02-01 00:09:35,-1,61,,B00037,8.966667


In [132]:
def preprocess_data(df):
    # df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
    df_copied = df.copy()
    df_copied["pickup_yyyy_mm"] = pd.to_datetime(df_copied['pickup_datetime']).dt.strftime('%Y/%m')
    #df_copied["ride_id"] = df_copied["pickup_yyyy_mm"] + "_" + df_copied.index.astype('str')
    # apply has to be used
    df_copied['ride_id'] = df_copied.apply(lambda row: row["pickup_yyyy_mm"] + "_" + str(row.name) , axis=1)

    return df_copied


In [148]:
df_copied = preprocess_data(df)

df_result = df_copied[['ride_id']]
df_result['y_pred'] = y_pred.tolist()

df_result.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_result['y_pred'] = y_pred.tolist()


Unnamed: 0,ride_id,y_pred
1,2021/02_1,14.539865
2,2021/02_2,13.740422
3,2021/02_3,15.593339
4,2021/02_4,15.188118
5,2021/02_5,13.817206


In [142]:
df_result.to_parquet(
    'df_result.parquet',
    engine='pyarrow',
    compression=None,
    index=False
)

In [147]:
!du -BM df_result.parquet

19M	df_result.parquet


What's the size of the output file?

* 9M
* 19M
* 29M
* 39M

Answer:
B) 19M

## Q3. Creating the scoring script

Now let's turn the notebook into a script.

Which command you need to execute for that?



Q4. Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version: check the starter notebook for details.

After installing the libraries, pipenv creates two files: Pipfile and Pipfile.lock. The Pipfile.lock file keeps the hashes of the dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?

Answer:
08ef968f6b72033c16c479c966bf37ccd49b06ea91b765e1cc27afefe723920b