This is the homework notebook
Questions are available at 'https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/cohorts/2024/01-intro/homework.md'

To understand the fields visit 'https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf'

----------------------------------------------------------------------------------------

In [17]:
import pandas as pd
from sklearn.feature_extraction import DictVectorizer # for one hot encoding
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [18]:
# !pip install pyarrow

Now we load New York City Yellow Taxi Trip Records from January and February 2023
'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet'
'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet'

In [19]:
df = pd.read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet')
df_test = pd.read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet')

df.head(1)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,2,2023-01-01 00:32:10,2023-01-01 00:40:36,1.0,0.97,1.0,N,161,141,2,9.3,1.0,0.5,0.0,0.0,1.0,14.3,2.5,0.0


In [20]:
df.shape, df_test.shape

((3066766, 19), (2913955, 19))

In [21]:
df['duration']= df['tpep_dropoff_datetime']- df['tpep_pickup_datetime']
df['duration'] = df['duration'].apply(lambda td:td.total_seconds()/60)

df_test['duration']= df_test['tpep_dropoff_datetime']- df_test['tpep_pickup_datetime']
df_test['duration'] = df_test['duration'].apply(lambda td:td.total_seconds()/60)

----------------------------------------------------------------------------------------

Q3. Next, we need to check the distribution of the duration variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

In [22]:
df['duration_1to60'] = df['duration'][(df['duration']>=1) & (df['duration']<=60)]

df_test['duration_1to60'] = df_test['duration'][(df_test['duration']>=1) & (df_test['duration']<=60)]

In [23]:
df['PU_DO'] = df['PULocationID'].astype(str)+'_'+df['DOLocationID'].astype(str)
df_test['PU_DO'] = df_test['PULocationID'].astype(str)+'_'+df_test['DOLocationID'].astype(str)

In [24]:
df_test[:1]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,duration,duration_1to60,PU_DO
0,1,2023-02-01 00:32:53,2023-02-01 00:34:34,2.0,0.3,1.0,N,142,163,2,...,0.5,0.0,0.0,1.0,9.4,2.5,0.0,1.683333,1.683333,142_163


In [25]:
categorical = ['PU_DO']
numerical = ['trip_distance'] 
target = 'duration_1to60'

dv = DictVectorizer()  # for one hot encoding

train_dict = df[categorical+numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)
y_train = df[target].values


test_dict = df_test[categorical+numerical].to_dict(orient='records')
X_test = dv.transform(test_dict) # dont do fit_transform again
y_test = df_test[target].values


A4. dimensionality of the X_train matrix is 519

-----------------------------------------------------------------------------------------

Q5. Training a model
Now let's use the feature matrix from the previous step to train a model.

Train a plain linear regression model with default parameters, where duration is the response variable
Calculate the RMSE of the model on the training data
What's the RMSE on train?

In [26]:
print(pd.isna(y_train).sum())

mask = ~np.isnan(y_train)
X_train = X_train[mask]
y_train = y_train[mask]

print(pd.isna(y_train).sum())

print(pd.isna(y_test).sum())

mask = ~np.isnan(y_test)
X_test = X_test[mask]
y_test = y_test[mask]

print(pd.isna(y_test).sum())

57593
0
58004
0


In [27]:
model_linear = LinearRegression()
model_linear.fit(X_train, y_train)

y_pred = model_linear.predict(X_test)

In [28]:
# sns.distplot(y_pred, label= 'pred')
# sns.distplot(y_test, label = 'actual')
# plt.legend()

In [29]:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(rmse)

5.247427968159032


We try different models to see what may work better

In [30]:
model_1 = Lasso()
model_1.fit(X_train, y_train)

y_pred_1 = model_1.predict(X_test)

np.sqrt(mean_squared_error(y_test, y_pred))

np.float64(5.247427968159032)

In [31]:
model_2 = Lasso(alpha=0.001)
model_2.fit(X_train, y_train)

y_pred_2 = model_2.predict(X_test)

np.sqrt(mean_squared_error(y_test, y_pred))

np.float64(5.247427968159032)

In [32]:
model_3 = Ridge(alpha=0.001)
model_3.fit(X_train, y_train)

y_pred_3 = model_3.predict(X_test)

np.sqrt(mean_squared_error(y_test, y_pred))

np.float64(5.247427968159032)

Saving our model to be used later in the course and for deploying

In [33]:
import pickle

In [34]:
with open('models/lin_reg.bin','wb') as f_out:
    pickle.dump((dv,model_linear), f_out)