In [1]:
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error

**Info on column names** - https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf

In case of errors, comment capture to check the output

In [2]:
%%capture
!wget -nc https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -P data/
!wget -nc https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet -P data/

## Question 1 - Downloading the data
Read the data for January. How many columns are there?

In [3]:
jan_df = pd.read_parquet('./data/yellow_tripdata_2023-01.parquet')

print ('Number of columns in January data:', len(jan_df.columns))

Number of columns in January data: 19


## Question 2 - Computing duration
Now let's compute the duration variable. It should contain the duration of a ride in minutes.

What's the standard deviation of the trips duration in January?

In [4]:
jan_df['duration'] = jan_df.tpep_dropoff_datetime - jan_df.tpep_pickup_datetime
jan_df.duration = jan_df.duration.apply(lambda td: td.total_seconds() / 60)

print ('Standard deviation of the trips duration in January:', round(jan_df.duration.std(), 2))

Standard deviation of the trips duration in January: 42.59


## Question 3 - Dropping outliers
Next, we need to check the distribution of the duration variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

In [5]:
jan_df_filtered = jan_df[(jan_df.duration >= 1) & (jan_df.duration <= 60)].copy()

print ("fraction of records left after dropping outliers:", round(len(jan_df_filtered) / len(jan_df), 2) * 100, "%")

fraction of records left after dropping outliers: 98.0 %


## Question 4 - One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

- Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will label encode them)
- Fit a dictionary vectorizer
- Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

In [6]:
categorical = ['PULocationID', 'DOLocationID']
jan_df_filtered[categorical] = jan_df_filtered[categorical].astype(str)

train_dicts = jan_df_filtered[categorical].to_dict(orient='records')

dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)

print ("Sample of train_dicts:", train_dicts[:2])
print ("Shape of X_train:", X_train.shape)
print ("Num columns in X_train:", X_train.shape[1])

Sample of train_dicts: [{'PULocationID': '161', 'DOLocationID': '141'}, {'PULocationID': '43', 'DOLocationID': '237'}]
Shape of X_train: (3009173, 515)
Num columns in X_train: 515


The DictVectorizer creates one column per unique value in each categorical feature, using binary (0/1) values to indicate presence/absence. For any numerical features included, it preserves them as-is in additional columns at the end of the matrix.

# Question 5 - Training a model

Now let's use the feature matrix from the previous step to train a model.

- Train a plain linear regression model with default parameters, where duration is the response variable
- Calculate the RMSE of the model on the training data

What's the RMSE on train?

In [7]:
target = 'duration'
y_train = jan_df_filtered[target].values

lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)
rmse = root_mean_squared_error(y_train, y_pred)

In [8]:
print (f"RMSE: {round(rmse, 3)}")

RMSE: 7.649


## Question 6 - Evaluating the model

Now let's apply this model to the validation dataset (February 2023).

What's the RMSE on validation?

In [9]:
feb_df = pd.read_parquet('./data/yellow_tripdata_2023-02.parquet')

feb_df['duration'] = feb_df.tpep_dropoff_datetime - feb_df.tpep_pickup_datetime
feb_df.duration = feb_df.duration.apply(lambda td: td.total_seconds() / 60)

feb_df_filtered = feb_df[(feb_df.duration >= 1) & (feb_df.duration <= 60)].copy()
feb_df_filtered[categorical] = feb_df_filtered[categorical].astype(str)

val_dicts = feb_df_filtered[categorical].to_dict(orient='records')
X_val = dv.transform(val_dicts)
y_val = feb_df_filtered[target].values

val_preds = lr.predict(X_val)
rmse = root_mean_squared_error(y_val, val_preds)

print (f"Val RMSE: {round(rmse, 3)}")

Val RMSE: 7.812
