Week 1 Homework
https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/01-intro/homework.md

In [23]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### Q1. Downloading the data

In [24]:
import os
os.path.dirname(os.getcwd())

'/home/ubuntu'

In [25]:
df = pd.read_parquet('/home/ubuntu/data/homework_week1/fhv_tripdata_2021-01.parquet')
df.shape

(1154112, 7)

### Q2. Computing duration

In [26]:
df.dtypes

dispatching_base_num              object
pickup_datetime           datetime64[ns]
dropOff_datetime          datetime64[ns]
PUlocationID                     float64
DOlocationID                     float64
SR_Flag                           object
Affiliated_base_number            object
dtype: object

In [27]:
df['duration'] = df['dropOff_datetime'] - df['pickup_datetime']
df['duration'] = df['duration'].apply(lambda row: (row.total_seconds())/60)

In [28]:
df['duration'].isnull().sum()

0

In [29]:
df['duration'].mean()

19.1672240937939

## Data preparation

In [30]:
#sns.displot(df['duration'], kind = 'hist')

In [31]:
df.shape

(1154112, 8)

In [32]:
df = df.loc[(df['duration'] >=1) & (df['duration'] <=60), :]

In [33]:
df.shape

(1109826, 8)

Records Dropped

In [34]:
1154112 - 1109826

44286

### Q3. Missing values

In [36]:
df.fillna({'PUlocationID': -1, 'DOlocationID': -1}, inplace = True)

Number of Non NAN Values in a column : df[column].count()

In [47]:
(df['PUlocationID'] == -1).sum()*100/df['PUlocationID'].count()

83.52732770722618

### Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix? (The number of columns).

* 2
* 152
* 352
* 525
* 725

In [53]:
df.head()

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number,duration
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,-1.0,-1.0,,B00009,17.0
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,-1.0,-1.0,,B00009,17.0
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,-1.0,72.0,,B00037,8.283333
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,-1.0,61.0,,B00037,15.216667
5,B00037,2021-01-01 00:59:02,2021-01-01 01:08:05,-1.0,71.0,,B00037,9.05


In [54]:
categorical = ['PUlocationID', 'DOlocationID']
df[categorical] = df[categorical].astype(str)

In [55]:
from sklearn.feature_extraction import DictVectorizer
train_dict = df[categorical].to_dict(orient = 'records')

In [57]:
dv = DictVectorizer()
X_train = dv.fit_transform(train_dict)

In [60]:
len(dv.feature_names_)

525

## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 5.52
* 10.52
* 15.52
* 20.52

In [63]:
numerical = ['duration']
y_train = df[numerical].values

In [64]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression()

In [66]:
y_pred = lr.predict(X_train)

**squared** : bool, default=True <br>
If True returns MSE value, if False returns RMSE value

In [68]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_train, y_pred, squared = False)

10.528519107206316

### Q6. Evaluating the model

Now let's apply this model to the validation dataset (Feb 2021). 

What's the RMSE on validation?

* 6.01
* 11.01
* 16.01
* 21.01

In [76]:
df_val = pd.read_parquet('/home/ubuntu/data/homework_week1/fhv_tripdata_2021-02.parquet')

Preprocessing

In [86]:
categorical = ['PUlocationID', 'DOlocationID']
def preprocess_data(df):
    df['duration'] = df['dropOff_datetime'] - df['pickup_datetime']
    df['duration'] = df['duration'].apply(lambda row: (row.total_seconds())/60)
    df = df.loc[(df['duration'] >=1) & (df['duration'] <=60), :]
    #If I do just df.fillna({'PUlocationID': -1, 'DOlocationID': -1}, inplace = True) I will get copysettingwarning error
    df = df.fillna({'PUlocationID': -1, 'DOlocationID': -1})
    df[categorical] = df[categorical].astype(str)
    return df

In [87]:
df_val = preprocess_data(df_val)
val_dict = df_val[categorical].to_dict(orient = 'records')
X_val = dv.transform(val_dict)
y_val = df_val[numerical].values
y_pred = lr.predict(X_val)

In [88]:
mean_squared_error(y_pred, y_val, squared = False)

11.014283149347039