## Q1. Downloading the data
- We'll use the same NYC taxi dataset, but instead of "Green Taxi Trip Records", we'll use "For-Hire Vehicle Trip Records".

- Download the data for January and February 2021.

- Note that you need "For-Hire Vehicle Trip Records", not "High Volume For-Hire Vehicle Trip Records".

- Read the data for January. How many records are there?

In [69]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

In [70]:
df = pd.read_parquet('data/fhv_tripdata_2021-01.parquet')

In [71]:
initial_length = len(df)

In [72]:
initial_length

1154112

## Q.2 Computing duration
- Now let's compute the duration variable. It should contain the duration of a ride in minutes.

- What's the average trip duration in January?

In [73]:
df['duration'] = (df.dropOff_datetime - df.pickup_datetime).apply(lambda td: td.total_seconds() /60)

In [74]:
round(df.duration.mean(),3)

19.167

# Data preparation
- Check the distribution of the duration variable. There are some outliers.

- Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

- How many records did you drop?

In [75]:
df = df[((df.duration >=1) & (df.duration <= 60))].copy()
cleaned_length = len(df)
initial_length - cleaned_length

44286

## Q3. Missing values
- The features we'll use for our model are the pickup and dropoff location IDs.

- But they have a lot of missing values there. Let's replace them with "-1".

- What's the fractions of missing values for the pickup location ID? I.e. fraction of "-1"s after you filled the NAs.

In [100]:
df.dtypes
df.PUlocationID = df.PUlocationID.fillna(-1).astype('int')
df.DOlocationID = df.DOlocationID.fillna(-1).astype('int')
df.PUlocationID.value_counts()[-1]/(len(df))

0.8352732770722617

## Q4. One-hot encoding
- Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

- Turn the dataframe into a list of dictionaries
- Fit a dictionary vectorizer
- Get a feature matrix from it
- What's the dimensionality of this matrix? (The number of columns).

In [101]:
dv = DictVectorizer()

categorical = ['PUlocationID','DOlocationID']

df_dict = df[categorical].astype('str').to_dict(orient='records')
X_train = dv.fit_transform(df_dict)

In [102]:
X_train.shape #Num cols is second number ie the number of features

(1109826, 525)

## Q5. Training a model
- Now let's use the feature matrix from the previous step to train a model.

- Train a plain linear regression model with default parameters
- Calculate the RMSE of the model on the training data
- What's the RMSE on train?

In [103]:
lr = LinearRegression()
target = 'duration'
y_train = df[target].values
lr.fit(X_train,y_train)
y_pred = lr.predict(X_train)

In [104]:
mean_squared_error(y_train,y_pred,squared=False)

10.528519424941802

## Q6. Evaluating the model
- Now let's apply this model to the validation dataset (Feb 2021).

- What's the RMSE on validation?

In [105]:
df_val = pd.read_parquet('data/fhv_tripdata_2021-02.parquet')

In [106]:
def clean_data(df):
    df['duration'] = (df.dropOff_datetime - df.pickup_datetime).apply(lambda td: td.total_seconds() /60)

    df = df[((df.duration >=1) & (df.duration <= 60))].copy()

    df.PUlocationID = df.PUlocationID.fillna(-1).astype('int').astype('str')
    df.DOlocationID = df.DOlocationID.fillna(-1).astype('int').astype('str')
    
    return df
    

In [107]:
df_val = clean_data(df_val)

In [108]:
val_dict = df_val[categorical].to_dict(orient='records')
X_val = dv.transform(val_dict)

In [109]:
y_val = df_val.duration.values
y_pred = lr.predict(X_val)

In [110]:
mean_squared_error(y_val,y_pred,squared=False)

11.014287281120989