## Homework

The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.

In [1]:
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), but instead of "Green Taxi Trip Records", we'll use "Yellow Taxi Trip Records".

Download the data for January and February 2022.

In [None]:
!wget -P ../data https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet
!wget -P ../data https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parquet

Read the data for January. How many columns are there?

* 16
* 17
* 18
* 19

In [79]:
df_train = pd.read_parquet('../data/yellow_tripdata_2022-01.parquet')

In [80]:
print(f'There are {len(df_train.columns)} columns')

There are 19 columns


## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January?

* 41.45
* 46.45
* 51.45
* 56.45

In [81]:
# create field duration of taxi rides in minutes
df_train['duration'] = (df_train.tpep_dropoff_datetime - df_train.tpep_pickup_datetime).dt.total_seconds().div(60)

In [82]:
print(f'The average trip duration in January is {df_train.duration.std():,.2f} minutes.')

The average trip duration in January is 46.45 minutes.


## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

In [83]:
df_train_filtered = df_train[df_train.duration.between(1, 60)]

What fraction of the records left after you dropped the outliers?

* 90%
* 92%
* 95%
* 98%

In [84]:
f'Records left: {df_train_filtered.shape[0] / df_train.shape[0]:.0%}'

'Records left: 98%'

## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

* 2
* 155
* 345
* 515
* 715

In [88]:
categorical = ['PULocationID', 'DOLocationID']

train_dicts = df_train_filtered[categorical].astype(str).to_dict(orient='records')

dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)

In [89]:
print(f'The matrix has {X_train.shape[1]} columns.')

The matrix has 515 columns.


## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 6.99
* 11.99
* 16.99
* 21.99

In [90]:
target = 'duration'
y_train = df_train_filtered[target].values

lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

In [91]:
f'Train RMSE: {mean_squared_error(y_train, y_pred, squared=False):.2f}'

'Train RMSE: 6.99'

## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2022). 

What's the RMSE on validation?

* 7.79
* 12.79
* 17.79
* 22.79

In [98]:
df_val = pd.read_parquet('../data/yellow_tripdata_2022-02.parquet')

In [99]:
target = 'duration'
df_val[target] = (df_val.tpep_dropoff_datetime - df_val.tpep_pickup_datetime).dt.total_seconds().div(60)
df_val_filtered = df_val[df_val[target].between(1, 60)]

y_val = df_val_filtered[target].values

val_dicts = df_val_filtered[categorical].astype(str).to_dict(orient='records')
X_val = dv.transform(val_dicts)

y_pred = lr.predict(X_val)

In [100]:
f'Validation RMSE: {mean_squared_error(y_val, y_pred, squared=False):.2f}'

'Validation RMSE: 7.79'