## Homework

The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.


## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records".

Download the data for January and February 2022.

Read the data for January. How many columns are there?

* 16
* 17
* 18
* 19

In [30]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression , Lasso , Ridge
from sklearn.metrics import mean_squared_error
import seaborn as sns
import matplotlib.pyplot as plt

In [8]:
df_yel_jan_2022_train= pd.read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet')
df_yel_feb_2022_val = pd.read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parquet')


In [9]:
df_yel_jan_2022_train.shape

(2463931, 19)

In [11]:
df_yel_jan_2022_train.shape[1] #of columns

19

ans: 19




## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January?

* 41.45
* 46.45
* 51.45
* 56.45

In [15]:
df_yel_jan_2022_train['duration'] = df_yel_jan_2022_train.tpep_dropoff_datetime - df_yel_jan_2022_train.tpep_pickup_datetime
df_yel_jan_2022_train['duration'] = df_yel_jan_2022_train.duration.apply(lambda td: td.total_seconds() / 60)

df_yel_feb_2022_val['duration'] = df_yel_feb_2022_val.tpep_dropoff_datetime - df_yel_feb_2022_val.tpep_pickup_datetime
df_yel_feb_2022_val['duration'] = df_yel_feb_2022_val.duration.apply(lambda td: td.total_seconds() / 60)


In [16]:
df_yel_jan_2022_train['duration'].std()

46.44530513776499

ans: 46.45

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.80,1.0,N,142,236,1,14.50,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.10,1.0,N,236,42,1,8.00,0.5,0.5,4.00,0.0,0.3,13.30,0.0,0.0
2,2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,1.0,N,166,166,1,7.50,0.5,0.5,1.76,0.0,0.3,10.56,0.0,0.0
3,2,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,1.0,N,114,68,2,8.00,0.5,0.5,0.00,0.0,0.3,11.80,2.5,0.0
4,2,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.30,1.0,N,68,163,1,23.50,0.5,0.5,3.00,0.0,0.3,30.30,2.5,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2463926,2,2022-01-31 23:36:53,2022-01-31 23:42:51,,1.32,,,90,170,0,8.00,0.0,0.5,2.39,0.0,0.3,13.69,,
2463927,2,2022-01-31 23:44:22,2022-01-31 23:55:01,,4.19,,,107,75,0,16.80,0.0,0.5,4.35,0.0,0.3,24.45,,
2463928,2,2022-01-31 23:39:00,2022-01-31 23:50:00,,2.10,,,113,246,0,11.22,0.0,0.5,2.00,0.0,0.3,16.52,,
2463929,2,2022-01-31 23:36:42,2022-01-31 23:48:45,,2.92,,,148,164,0,12.40,0.0,0.5,0.00,0.0,0.3,15.70,,





## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

* 90%
* 92%
* 95%
* 98%



In [17]:
df_yel_jan_2022_train = df_yel_jan_2022_train[(df_yel_jan_2022_train.duration >= 1) & (df_yel_jan_2022_train.duration <= 60)]
df_yel_feb_2022_val = df_yel_feb_2022_val[(df_yel_feb_2022_val.duration >= 1) & (df_yel_feb_2022_val.duration <= 60)]


In [21]:
original_len=2463931
print(f"fraction {(len(df_yel_jan_2022_train)/original_len)*100}")

fraction 98.27547930522405


In [None]:
ans: 98


## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

* 2
* 155
* 345
* 515
* 715


In [28]:
categorical = ['PULocationID', 'DOLocationID']

df_yel_jan_2022_train[categorical] = df_yel_jan_2022_train[categorical].astype(str)
df_yel_feb_2022_val[categorical] = df_yel_feb_2022_val[categorical].astype(str)

target = 'duration'
y_train = df_yel_jan_2022_train[target].values
y_val = df_yel_feb_2022_val[target].values


dv = DictVectorizer()

train_dicts = df_yel_jan_2022_train[categorical].to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

val_dicts = df_yel_feb_2022_val[categorical].to_dict(orient='records')
X_val = dv.transform(val_dicts)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_yel_jan_2022_train[categorical] = df_yel_jan_2022_train[categorical].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_yel_feb_2022_val[categorical] = df_yel_feb_2022_val[categorical].astype(str)


In [29]:
X_train.shape

(2421440, 515)

In [36]:
X_train.shape[1]

515

ans: 515



## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 6.99
* 11.99
* 16.99
* 21.99

In [34]:
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

mean_squared_error(y_train, y_pred, squared=False)

6.986190686400816

ans: 6.99




## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2022). 

What's the RMSE on validation?

* 7.79
* 12.79
* 17.79
* 22.79

In [35]:

y_pred = lr.predict(X_val)

mean_squared_error(y_val, y_pred, squared=False)

7.7864076631030095



## Submit the results

* Submit your results here: https://forms.gle/uYTnWrcsubi2gdGV7
* You can submit your solution multiple times. In this case, only the last submission will be used
* If your answer doesn't match options exactly, select the closest one


## Deadline

The deadline for submitting is 23 May 2023 (Tuesday), 23:00 CEST (Berlin time). 

After that, the form will be closed.