# Homework - Training a ride duration prediction model

The assignment:

We want to develop a model that can predict the duration of a taxi trip from pick-up location to drop-off location.

For that, we 
1. select and preprocess the data
2. train a regression model
3. evaluate the model

Data:

https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

In this notebook, I am working on the following sets
* For-Hire Vehicle Trip Records January 2021: 'fhv_tripdata_2021-01.parquet'
* For-Hire Vehicle Trip Records February 2021: 'fhv_tripdata_2021-01.parquet'

In [2]:
# Import libraries
import pandas as pd
import numpy as np

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

import pyarrow  # read .parquet files

## Q1. Downloading the data

We'll use the same NYC taxi dataset, but instead of "Green Taxi Trip Records", we'll use "For-Hire Vehicle Trip Records".

Download the data for January and February 2021.

Note that you need "For-Hire Vehicle Trip Records", not "High Volume For-Hire Vehicle Trip Records".

Read the data for January. How many records are there?

* 1054112
* 1154112 X
* 1254112
* 1354112


In [3]:
# Read the .parquet file with pandas
df_jan = pd.read_parquet(r"C:\Users\JC\projects\MLOps_Zoomcamp_2022\data\fhv_tripdata_2021-01.parquet")
df_jan.head()

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,,,,B00009
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,,,,B00009
2,B00013,2021-01-01 00:01:00,2021-01-01 01:51:00,,,,B00013
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,,72.0,,B00037
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,,61.0,,B00037


In [4]:
# How many records are there for January?
len(df_jan)

1154112

## Q2. Computing duration

Now let's compute the duration variable. It should contain the duration of a ride in minutes.

What's the average trip duration in January?

* 15.16
* 19.16 X
* 24.16
* 29.16

In [5]:
# Create the target variable "trip_duration"
df_jan["trip_duration"] = df_jan.dropOff_datetime - df_jan.pickup_datetime

# Convert it into minutes
df_jan.trip_duration = df_jan.trip_duration.apply(lambda x: x.total_seconds() / 60)

df_jan.head()

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number,trip_duration
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,,,,B00009,17.0
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,,,,B00009,17.0
2,B00013,2021-01-01 00:01:00,2021-01-01 01:51:00,,,,B00013,110.0
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,,72.0,,B00037,8.283333
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,,61.0,,B00037,15.216667


In [6]:
# What´s the average trip duration in January?
df_jan.trip_duration.mean()

19.167224093791006

## Data preparation

Check the distribution of the duration variable. There are some outliers.

Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

How many records did you drop?

In [7]:
# Filter on trip duration from 1 min up to 60 min
df_jan_filtered = df_jan[(df_jan.trip_duration >= 1) & (df_jan.trip_duration <= 60)]

In [8]:
# How many records did you drop?
len(df_jan) - len(df_jan[(df_jan.trip_duration >= 1) & (df_jan.trip_duration <= 60)])

44286

## Q3. Missing values

The features we'll use for our model are the pickup and dropoff location IDs.

But they have a lot of missing values there. Let's replace them with "-1".

What's the fractions of missing values for the pickup location ID? I.e. fraction of "-1"s after you filled the NAs.

* 53%
* 63%
* 73%
* 83% X

In [9]:
# What´s the fraction of missing values for the pickup location ID? 
# (number of NaN * 100 / total number of records)
(df_jan_filtered["PUlocationID"].isnull().sum() * 100) / len(df_jan_filtered)

83.52732770722618

In [10]:
# Replace all NaN with "-1"
df_jan_filtered[["PUlocationID", "DOlocationID"]] = df_jan_filtered[["PUlocationID", "DOlocationID"]].fillna(-1)

df_jan_filtered.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_jan_filtered[["PUlocationID", "DOlocationID"]] = df_jan_filtered[["PUlocationID", "DOlocationID"]].fillna(-1)


Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number,trip_duration
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,-1.0,-1.0,,B00009,17.0
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,-1.0,-1.0,,B00009,17.0
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,-1.0,72.0,,B00037,8.283333
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,-1.0,61.0,,B00037,15.216667
5,B00037,2021-01-01 00:59:02,2021-01-01 01:08:05,-1.0,71.0,,B00037,9.05


## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer
* Get a feature matrix from it

What's the dimensionality of this matrix? (The number of columns).

* 2
* 152
* 352
* 525 X
* 725


In [11]:
# Function for preprocessing the features (X)
def create_dicts(df, features):
    # Convert to string (object)
    df[features] = df[features].astype(str)
    
    # Convert features to dictionary
    dicts = df[features].to_dict(orient="records")

    return dicts

In [20]:
# Define the features to predict the target variable
features = ["PUlocationID", "DOlocationID"]

# Process the features
feature_dicts_jan = create_dicts(df_jan_filtered, features)

# Create a DictVectorizer
dv = DictVectorizer()

# Fit to the feature dict and save it in a feature matrix
X_jan = dv.fit_transform(feature_dicts_jan)
y_jan = df_jan_filtered["trip_duration"].values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[features] = df[features].astype(str)


In [13]:
# What's the dimensionality of this matrix? (The number of columns).
X_jan

<1109826x525 sparse matrix of type '<class 'numpy.float64'>'
	with 2219652 stored elements in Compressed Sparse Row format>

## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model.

* Train a plain linear regression model with default parameters
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 5.52
* 10.52 X
* 15.52
* 20.52


In [14]:
# Train a regression model
lr = LinearRegression()
lr.fit(X_jan, y_jan)

# Evaluate the baseline with the validation dataset
y_pred= lr.predict(X_jan)
mean_squared_error(y_jan, y_pred, squared=False)

10.52851938944385

## Q6. Evaluating the model

Now let's apply this model to the validation dataset (Feb 2021).

What's the RMSE on validation?

* 6.01
* 11.01 X
* 16.01
* 21.01


In [16]:
# Read the .parquet file with pandas
df_feb = pd.read_parquet(r"C:\Users\JC\projects\MLOps_Zoomcamp_2022\data\fhv_tripdata_2021-02.parquet")
df_feb.head()

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number
0,B00013,2021-02-01 00:01:00,2021-02-01 01:33:00,,,,B00014
1,B00021,2021-02-01 00:55:40,2021-02-01 01:06:20,173.0,82.0,,B00021
2,B00021,2021-02-01 00:14:03,2021-02-01 00:28:37,173.0,56.0,,B00021
3,B00021,2021-02-01 00:27:48,2021-02-01 00:35:45,82.0,129.0,,B00021
4,B00037,2021-02-01 00:12:50,2021-02-01 00:26:38,,225.0,,B00037


In [17]:
# Create the target variable "trip_duration"
df_feb["trip_duration"] = df_feb.dropOff_datetime - df_feb.pickup_datetime

# Convert it into minutes
df_feb.trip_duration = df_feb.trip_duration.apply(lambda x: x.total_seconds() / 60)

# Filter on trip duration from 1 min up to 60 min
df_feb_filtered = df_feb[(df_feb.trip_duration >= 1) & (df_feb.trip_duration <= 60)]

# Replace all NaN with "-1"
df_feb_filtered[["PUlocationID", "DOlocationID"]] = df_feb_filtered[["PUlocationID", "DOlocationID"]].fillna(-1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_feb_filtered[["PUlocationID", "DOlocationID"]] = df_feb_filtered[["PUlocationID", "DOlocationID"]].fillna(-1)


In [19]:
df_feb_filtered.head()

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number,trip_duration
1,B00021,2021-02-01 00:55:40,2021-02-01 01:06:20,173.0,82.0,,B00021,10.666667
2,B00021,2021-02-01 00:14:03,2021-02-01 00:28:37,173.0,56.0,,B00021,14.566667
3,B00021,2021-02-01 00:27:48,2021-02-01 00:35:45,82.0,129.0,,B00021,7.95
4,B00037,2021-02-01 00:12:50,2021-02-01 00:26:38,-1.0,225.0,,B00037,13.8
5,B00037,2021-02-01 00:00:37,2021-02-01 00:09:35,-1.0,61.0,,B00037,8.966667


In [21]:
# Process the features
feature_dicts_feb = create_dicts(df_feb_filtered, features)

# Fit to the feature dict and save it in a feature matrix
X_feb = dv.transform(feature_dicts_feb)
y_feb = df_feb_filtered["trip_duration"].values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[features] = df[features].astype(str)


In [22]:
# Evaluate the baseline with the validation dataset
y_pred= lr.predict(X_feb)
mean_squared_error(y_feb, y_pred, squared=False)

11.014286426107942