# MLOps Zoomcamp Homework 1

The goal of this homework is to train a simple model for predicting the duration of a ride.

In [1]:
#!pip install pyarrow

In [2]:
import pandas as pd
import pickle
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [3]:
df1 = pd.read_parquet('/home/rodrigoperes/notebooks/data/fhv_tripdata_2021-01.parquet')
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1154112 entries, 0 to 1154111
Data columns (total 7 columns):
 #   Column                  Non-Null Count    Dtype         
---  ------                  --------------    -----         
 0   dispatching_base_num    1154112 non-null  object        
 1   pickup_datetime         1154112 non-null  datetime64[ns]
 2   dropOff_datetime        1154112 non-null  datetime64[ns]
 3   PUlocationID            195845 non-null   float64       
 4   DOlocationID            991892 non-null   float64       
 5   SR_Flag                 0 non-null        object        
 6   Affiliated_base_number  1153227 non-null  object        
dtypes: datetime64[ns](2), float64(2), object(3)
memory usage: 61.6+ MB


#### Q1. Downloading the data

We'll use the same NYC taxi dataset, but instead of "Green Taxi Trip Records", we'll use "For-Hire Vehicle Trip Records".

Download the data for January and February 2021.

Note that you need "For-Hire Vehicle Trip Records", not "High Volume For-Hire Vehicle Trip Records".

Read the data for January. How many records are there?

In [4]:
df1.shape

(1154112, 7)

#### Q2. Computing duration

Now let's compute the duration variable. It should contain the duration of a ride in minutes.

What's the average trip duration in January?

In [5]:
df1['duration'] = df1['dropOff_datetime'] - df1['pickup_datetime']
df1['duration'] = df1['duration'].apply(lambda td: td.total_seconds() / 60)

In [6]:
df1['duration'].mean()

19.1672240937939

In [7]:
df1 = df1[(df1.duration >= 1) & (df1.duration <= 60)]
df1.shape

(1109826, 8)

In [8]:
1154112 - 1109826

44286

#### Q3. Missing values

The features we'll use for our model are the pickup and dropoff location IDs.

But they have a lot of missing values there. Let's replace them with "-1".

What's the fractions of missing values for the pickup location ID? I.e. fraction of "-1"s after you filled the NAs.

In [9]:
df1['PUlocationID'].fillna("-1", inplace=True)
df1['DOlocationID'].fillna("-1", inplace=True)

In [10]:
df1.shape

(1109826, 8)

In [11]:
df1[df1['PUlocationID'] == '-1'].shape

(927008, 8)

In [12]:
100 * df1[df1['PUlocationID'] == '-1'].shape[0] / df1.shape[0]

83.52732770722618

In [13]:
categorical = ['PUlocationID', 'DOlocationID']

In [14]:
df1['PUlocationID'].value_counts(dropna=False)

-1       927008
221.0      8330
206.0      6797
129.0      5379
115.0      4082
          ...  
111.0         5
27.0          4
34.0          3
2.0           2
110.0         1
Name: PUlocationID, Length: 262, dtype: int64

In [15]:
df1['DOlocationID'].value_counts(dropna=False)

-1       147907
76.0      26375
217.0     19488
265.0     18628
17.0      18422
          ...  
27.0         18
30.0         13
2.0          11
105.0         4
199.0         1
Name: DOlocationID, Length: 263, dtype: int64

#### Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

- Turn the dataframe into a list of dictionaries
- Fit a dictionary vectorizer
- Get a feature matrix from it

What's the dimensionality of this matrix? (The number of columns).

In [16]:
df1[categorical] = df1[categorical].astype(str)

In [17]:
train_dicts = df1[categorical].to_dict(orient='records')

dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)

In [18]:
X_train

<1109826x525 sparse matrix of type '<class 'numpy.float64'>'
	with 2219652 stored elements in Compressed Sparse Row format>

In [19]:
dv.feature_names_

['DOlocationID=-1',
 'DOlocationID=1.0',
 'DOlocationID=10.0',
 'DOlocationID=100.0',
 'DOlocationID=101.0',
 'DOlocationID=102.0',
 'DOlocationID=105.0',
 'DOlocationID=106.0',
 'DOlocationID=107.0',
 'DOlocationID=108.0',
 'DOlocationID=109.0',
 'DOlocationID=11.0',
 'DOlocationID=111.0',
 'DOlocationID=112.0',
 'DOlocationID=113.0',
 'DOlocationID=114.0',
 'DOlocationID=115.0',
 'DOlocationID=116.0',
 'DOlocationID=117.0',
 'DOlocationID=118.0',
 'DOlocationID=119.0',
 'DOlocationID=12.0',
 'DOlocationID=120.0',
 'DOlocationID=121.0',
 'DOlocationID=122.0',
 'DOlocationID=123.0',
 'DOlocationID=124.0',
 'DOlocationID=125.0',
 'DOlocationID=126.0',
 'DOlocationID=127.0',
 'DOlocationID=128.0',
 'DOlocationID=129.0',
 'DOlocationID=13.0',
 'DOlocationID=130.0',
 'DOlocationID=131.0',
 'DOlocationID=132.0',
 'DOlocationID=133.0',
 'DOlocationID=134.0',
 'DOlocationID=135.0',
 'DOlocationID=136.0',
 'DOlocationID=137.0',
 'DOlocationID=138.0',
 'DOlocationID=139.0',
 'DOlocationID=14.0'

In [20]:
len(dv.feature_names_)

525

#### Q5. Training a model

Now let's use the feature matrix from the previous step to train a model.

Train a plain linear regression model with default parameters
Calculate the RMSE of the model on the training data
What's the RMSE on train?

In [21]:
target = 'duration'
y_train = df1[target].values

In [22]:
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

mean_squared_error(y_train, y_pred, squared=False)

10.528519107206316

#### Q6. Evaluating the model

Now let's apply this model to the validation dataset (Feb 2021).

What's the RMSE on validation?

In [23]:
df2 = pd.read_parquet('/home/rodrigoperes/notebooks/data/fhv_tripdata_2021-02.parquet')
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1037692 entries, 0 to 1037691
Data columns (total 7 columns):
 #   Column                  Non-Null Count    Dtype         
---  ------                  --------------    -----         
 0   dispatching_base_num    1037692 non-null  object        
 1   pickup_datetime         1037692 non-null  datetime64[ns]
 2   dropOff_datetime        1037692 non-null  datetime64[ns]
 3   PUlocationID            153001 non-null   float64       
 4   DOlocationID            885340 non-null   float64       
 5   SR_Flag                 0 non-null        object        
 6   Affiliated_base_number  1037692 non-null  object        
dtypes: datetime64[ns](2), float64(2), object(3)
memory usage: 55.4+ MB


In [24]:
df2['duration'] = df2['dropOff_datetime'] - df2['pickup_datetime']
df2['duration'] = df2['duration'].apply(lambda td: td.total_seconds() / 60)
df2['duration'].mean()

20.70698622520125

In [25]:
df2 = df2[(df2.duration >= 1) & (df2.duration <= 60)]
df2.shape

(990113, 8)

In [26]:
df2['PUlocationID'].fillna("-1", inplace=True)
df2['DOlocationID'].fillna("-1", inplace=True)

In [27]:
100 * df2[df2['PUlocationID'] == '-1'].shape[0] / df2.shape[0]

85.71354986754037

In [28]:
df2[categorical] = df2[categorical].astype(str)

In [29]:
val_dicts = df2[categorical].to_dict(orient='records')
X_val = dv.transform(val_dicts)

In [30]:
target = 'duration'
y_val = df2[target].values

In [31]:
y_pred = lr.predict(X_val)

mean_squared_error(y_val, y_pred, squared=False)

11.014283149347039