In [1]:
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## Homework 1

The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.

## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "Green Taxi Trip Records", we'll use "For-Hire Vehicle Trip Records".

Download the data for January and February 2021.

Note that you need "For-Hire Vehicle Trip Records", not "High Volume For-Hire Vehicle Trip Records".

Read the data for January. How many records are there?

* 1054112
* 1154112 &check;
* 1254112
* 1354112

In [2]:
df = pd.read_parquet('./data/fhv_tripdata_2021-01.parquet')
df.shape

(1154112, 7)

## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the average trip duration in January?

* 15.16
* 19.16 &check;
* 24.16
* 29.16

In [3]:
df['duration'] = df.dropOff_datetime - df.pickup_datetime
df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)
df.duration.mean()

19.167224093791006

## Data preparation

Check the distribution of the duration variable. There are some outliers. 

Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

How many records did you drop? 

In [4]:
df = df[(df.duration >= 1) & (df.duration <= 60)]

## Q3. Missing values

The features we'll use for our model are the pickup and dropoff location IDs. 

But they have a lot of missing values there. Let's replace them with "-1".

What's the fractions of missing values for the pickup location ID? I.e. fraction of "-1"s after you filled the NAs.

* 53%
* 63%
* 73%
* 83%  &check;

In [5]:
df.loc[:, ['PUlocationID', 'DOlocationID']] = df[['PUlocationID', 'DOlocationID']].fillna(value=-1)
df.PUlocationID.value_counts(normalize=True)[-1]

0.8352732770722617

## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix? (The number of columns).

* 2
* 152
* 352
* 525 &check;
* 725


In [6]:
dicts = df[['PUlocationID', 'DOlocationID']].astype(str).to_dict(orient='records')
dv = DictVectorizer()
X = dv.fit_transform(dicts)
y = df.duration.values
X.shape

(1109826, 525)

## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 5.52
* 10.52 &check;
* 15.52
* 20.52

In [7]:
lr = LinearRegression()
lr.fit(X, y)
y_pred = lr.predict(X)
mean_squared_error(y, y_pred, squared=False)

10.528519429348535

## Q6. Evaluating the model

Now let's apply this model to the validation dataset (Feb 2021). 

What's the RMSE on validation?

* 6.01
* 11.01  &check;
* 16.01
* 21.01

In [8]:
df_valid = pd.read_parquet('./data/fhv_tripdata_2021-02.parquet')
df_valid['duration'] = df_valid.dropOff_datetime - df_valid.pickup_datetime
df_valid.duration = df_valid.duration.apply(lambda td: td.total_seconds() / 60)
df_valid = df_valid[(df_valid.duration >= 1) & (df_valid.duration <= 60)]
df_valid.loc[:, ['PUlocationID', 'DOlocationID']] = df_valid[['PUlocationID', 'DOlocationID']].fillna(value=-1)
dicts_valid = df_valid[['PUlocationID', 'DOlocationID']].astype(str).to_dict(orient='records')

X_valid = dv.transform(dicts_valid)
y_valid = df_valid.duration.values
y_pred = lr.predict(X_valid)
mean_squared_error(y_valid, y_pred, squared=False)

11.014286123189528