In [3]:
# !pip install pyarrow

Collecting pyarrow
  Downloading pyarrow-16.1.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (3.0 kB)
Downloading pyarrow-16.1.0-cp311-cp311-macosx_11_0_arm64.whl (26.0 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.0/26.0 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hInstalling collected packages: pyarrow
Successfully installed pyarrow-16.1.0


In [75]:
import pandas as pd
import datetime as dt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.feature_extraction import DictVectorizer
import numpy as np

## Homework

The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.

## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records".

Download the data for January and February 2023.

Read the data for January. How many columns are there?

* 16
* 17
* 18
* 19 --> Megan's Answer

In [76]:
taxi_jan23_df = pd.read_parquet("data/yellow_tripdata_2023-01.parquet")
taxi_feb23_df = pd.read_parquet("data/yellow_tripdata_2023-02.parquet")

taxi_feb23_df.rename(columns={"Airport_fee": "airport_fee"}, inplace=True)

taxi_df = pd.concat([taxi_jan23_df, taxi_feb23_df]).reset_index(drop=True)

In [77]:
print(taxi_jan23_df.shape)
print(taxi_feb23_df.shape)
print(taxi_df.shape)

(3066766, 19)
(2913955, 19)
(5980721, 19)


## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January?

* 32.59
* 42.59 <-- Megan's Answer (I got 41.62 as the standard deviation)
* 52.59
* 62.59

In [78]:
def duration_minutes(df):
    # The pickup and dropoff features are already datetime datatypes. 
    # To calculate duration we need to subtract pickup time from dropoff time:
    df["duration"] = df["tpep_dropoff_datetime"] - df["tpep_pickup_datetime"]
    
    # Convert the duration to minutes
    df["duration"] = df["duration"].dt.seconds / 60

In [79]:
duration_minutes(taxi_jan23_df)

In [80]:
# Now calculate the standard deviation of the trip durations in January
taxi_jan23_df["duration"].std()

41.62919110966266

## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

* 90%
* 92%
* 95%
* 98% <-- Megan's Answer

In [81]:
# Dropping outliers
taxi_df_no_outliers = taxi_jan23_df[(taxi_jan23_df["duration"] >= 1) & (taxi_jan23_df["duration"] <= 60)]

In [82]:
# Calculate the percentage of records that remain after outliers are dropped
taxi_df_no_outliers.shape[0]/taxi_jan23_df.shape[0]*100

98.12212604417813

In [83]:
# making a function to reuse this code later:
def get_rid_of_outliers(df):
    no_outliers_df = df[(df["duration"] >= 1) & (df["duration"] <= 60)]
    return no_outliers_df

In [84]:
taxi_jan23_df = get_rid_of_outliers(taxi_jan23_df)

In [85]:
taxi_jan23_df.shape

(3009176, 20)

## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will 
  label encode them)
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

* 2
* 155
* 345
* 515 <-- Megan's Answer (See code below)
* 715

In [86]:
# Define the feature dataframe:
X = taxi_jan23_df[["PULocationID", "DOLocationID"]].astype(str)

# Define the target variable:
y = taxi_jan23_df["duration"]

In [87]:
# Turn the feature dataframe into a list of dictionaries
X_dict = X.to_dict(orient="records")

In [88]:
# use DictVectorizer to one hot encode the list of dictionaries
dv = DictVectorizer(sparse=True)
X_OHE_array = dv.fit_transform(X_dict)

In [89]:
# Get the number of One Hot Encoded columns
len(dv.get_feature_names_out())

515

## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 3.64
* 7.64 <-- Megan's Answer
* 11.64
* 16.64

In [90]:
# Instatiate the model:
lr = LinearRegression(n_jobs=16)

# Train the model:
lr.fit(X_OHE_array,y)

In [91]:
predictions = lr.predict(X_OHE_array)
RMSE = np.sqrt(mean_squared_error(y, predictions))
RMSE

7.649261108323528

## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2023). 

What's the RMSE on validation?

* 3.81
* 7.81 <-- Megan's Answer
* 11.81
* 16.81

In [92]:
# Add duration column to february data:
duration_minutes(taxi_feb23_df)

# Remove outliers in february data:
taxi_feb23_df = get_rid_of_outliers(taxi_feb23_df)

In [93]:
X_validate = taxi_feb23_df[["PULocationID", "DOLocationID"]].astype(str)
y_validate = taxi_feb23_df["duration"]

# Turn the feature dataframe into a list of dictionaries
X_validate_dict =  X_validate.to_dict(orient="records")

X_validate_OHE_array = dv.transform(X_validate_dict)

validation_predictions = lr.predict(X_validate_OHE_array)

RMSE = np.sqrt(mean_squared_error(y_validate, validation_predictions))
RMSE

7.811820628330829

## Submit the results

* Submit your results here: https://courses.datatalks.club/mlops-zoomcamp-2024/homework/hw1
* If your answer doesn't match options exactly, select the closest one