Download the data for January and February 2022 for Yellow Taxi Trip Records

In [199]:
# EDA
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error

# 
import pickle

In [200]:
# read in both parquets

df_jan = pd.read_parquet('./data/yellow_tripdata_2022-01.parquet')

# df_feb = pd.read_parquet('./data/yellow_tripdata_2022-02.parquet')

# dfs = [jan_df, feb_df]

# df_main = pd.concat(dfs)

Now let's compute the duration variable. It should contain the duration of a ride in minutes.

**What's the standard deviation of the trips duration in January?**

1) 41.45

2) 46.45 <--- CORRECT ANSWER, SEE BELOW

3) 51.45

4) 56.45

In [202]:
# create duration field
df_jan['duration'] = df_jan.tpep_dropoff_datetime - df_jan.tpep_pickup_datetime

# convert duration into minutes as requested
df_jan.duration = df_jan.duration.apply(lambda tm: tm.total_seconds() / 60)

In [203]:
round(df_jan.duration.std(), 2)

46.45

Next, we need to check the distribution of the duration variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

**What fraction of the records left after you dropped the outliers?**

1) 90%

2) 92%

3) 95%

4) 98% <-- CORRECT ANSWER, SEE BELOW

In [204]:
df_jan_1_to_60 = df_jan[(df_jan.duration >= 1) & (df_jan.duration <= 60)]

In [205]:
# find the cardinal percent difference between df_jan and the df_jan_1_to_60
print("Fraction of data left after removing outliers: ", round(len(df_jan_1_to_60) / len(df_jan), 2))

Fraction of data left after removing outliers:  0.98


Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

**What's the dimensionality of this matrix (number of columns)?**
1) 2

2) 155

3) 345

4) 515 <-- CORRECT ANSWER, SEE BELOW

5) 715

In [206]:
# pickup and dropoff location ID, we'll use only these two features for our model
categorical = ['PULocationID', 'DOLocationID']
numerical = ['duration']

In [207]:
# cast values to strs
df_jan_1_to_60[categorical] = df_jan_1_to_60[categorical].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_jan_1_to_60[categorical] = df_jan_1_to_60[categorical].astype(str)


In [208]:
# Turn the dataframe into a list of dictionaries
train_dicts = df_jan_1_to_60[categorical].to_dict(orient='records')

In [209]:
# Fit a dictionary vectorizer
# Get a feature matrix from it
dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)

In [220]:
print("The dimensionality of this matrix (number of columns) is: ",X_train.get_shape()[1])

The dimensionality of this matrix (number of columns) is:  515


In [210]:
# get labels for training
y_train = df_jan_1_to_60['duration'].values

Now let's use the feature matrix from the previous step to train a model.

Train a plain linear regression model with default parameters
Calculate the RMSE of the model on the training data
What's the RMSE on train?

1) 6.99 <-- CORRECT ANSWER, SEE BELOW

2) 11.99

3) 16.99

4) 21.99

In [211]:
# create and train LR model on training data
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

In [212]:
print("The RMSE on train is:",round(mean_squared_error(y_train, y_pred, squared=False), 2))

The RMSE on train is: 6.99


Now let's apply this model to the validation dataset (February 2022).

What's the RMSE on validation?

1) 7.79 <-- CORRECT ANSWER, SEE BOTTOM

2) 12.79

3) 17.79

4) 22.79

In [213]:
# save some time processing the validation set
def read_dataframe(filename):
    if filename.endswith('.csv'):
        df = pd.read_csv(filename)

        df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)
        df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
    elif filename.endswith('.parquet'):
        df = pd.read_parquet(filename)

    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df

In [214]:
df_feb_val = read_dataframe('./data/yellow_tripdata_2022-02.parquet')
df_feb_val.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'airport_fee', 'duration'],
      dtype='object')

In [215]:
# convert into dict, fit vectorizer
val_dicts = df_feb_val[categorical].to_dict(orient='records')

X_val = dv.transform(val_dicts)

In [216]:
# get target var from validation set
y_train_val = df_feb_val['duration'].values

In [217]:
# test model with validation set
y_pred_val = lr.predict(X_val)

In [218]:
# observe RMSE
print("The RMSE on the February validation set is:",round(mean_squared_error(y_train_val, y_pred_val, squared=False), 2))

The RMSE on the February validation set is: 7.79
