In [1]:
import pandas as pd

#### **Data Ingestion**

In [5]:
# Download the January data
january_data_url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet'
df_january = pd.read_parquet(january_data_url)

# Download the February data
february_data_url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet'
df_february = pd.read_parquet(february_data_url)

#### **Q1. Downloading the data**

##### Read the data for January. How many columns are there?

In [6]:
# Inspect the number of columns
num_columns = df_january.shape[1]
print(f'Number of columns in January data: {num_columns}')

Number of columns in January data: 19


#### **Q2. Computing duration**

##### What's the standard deviation of the trips duration in January?

In [7]:
# Convert pickup and dropoff times to datetime
df_january['tpep_pickup_datetime'] = pd.to_datetime(df_january['tpep_pickup_datetime'])
df_january['tpep_dropoff_datetime'] = pd.to_datetime(df_january['tpep_dropoff_datetime'])

# Compute the duration in minutes
df_january['duration'] = (df_january['tpep_dropoff_datetime'] - df_january['tpep_pickup_datetime']).dt.total_seconds() / 60

# Calculate the standard deviation of the duration
duration_std = df_january['duration'].std()
print(f'Standard deviation of trip duration: {duration_std:.2f}')

Standard deviation of trip duration: 42.59


#### **Q3. Dropping outliers**

##### What fraction of the records left after you dropped the outliers?

In [8]:
# Filter out durations that are outliers
filtered_df = df_january[(df_january['duration'] >= 1) & (df_january['duration'] <= 60)]

# Calculate the fraction of remaining records
fraction_remaining = len(filtered_df) / len(df_january)
print(f'Fraction of records after dropping outliers: {fraction_remaining:.2%}')

Fraction of records after dropping outliers: 98.12%


#### **Q4. One-hot encoding**

##### What's the dimensionality of this matrix (number of columns)?

In [9]:
from sklearn.feature_extraction import DictVectorizer

# Select relevant columns and convert to string
filtered_df['PULocationID'] = filtered_df['PULocationID'].astype(str)
filtered_df['DOLocationID'] = filtered_df['DOLocationID'].astype(str)

# Convert the DataFrame to a list of dictionaries
data_dicts = filtered_df[['PULocationID', 'DOLocationID']].to_dict(orient='records')

# Fit a DictVectorizer
dv = DictVectorizer()
X_train = dv.fit_transform(data_dicts)

# Get the dimensionality of the feature matrix
num_columns = X_train.shape[1]
print(f'Number of columns in the feature matrix: {num_columns}')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['PULocationID'] = filtered_df['PULocationID'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['DOLocationID'] = filtered_df['DOLocationID'].astype(str)


Number of columns in the feature matrix: 515


#### **Q5. Training a model**

##### What's the RMSE on train?

In [10]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Train a linear regression model
target = filtered_df['duration']
model = LinearRegression()
model.fit(X_train, target)

# Predict on the training data
y_pred = model.predict(X_train)

# Calculate the RMSE
rmse_train = np.sqrt(mean_squared_error(target, y_pred))
print(f'RMSE on train: {rmse_train:.2f}')


RMSE on train: 7.65


#### **Q6. Evaluating the model**

##### What's the RMSE on validation?

In [11]:
# Preprocess the February data similarly
df_february['tpep_pickup_datetime'] = pd.to_datetime(df_february['tpep_pickup_datetime'])
df_february['tpep_dropoff_datetime'] = pd.to_datetime(df_february['tpep_dropoff_datetime'])
df_february['duration'] = (df_february['tpep_dropoff_datetime'] - df_february['tpep_pickup_datetime']).dt.total_seconds() / 60
filtered_df_feb = df_february[(df_february['duration'] >= 1) & (df_february['duration'] <= 60)]
filtered_df_feb['PULocationID'] = filtered_df_feb['PULocationID'].astype(str)
filtered_df_feb['DOLocationID'] = filtered_df_feb['DOLocationID'].astype(str)

# Convert the February data to a list of dictionaries
data_dicts_feb = filtered_df_feb[['PULocationID', 'DOLocationID']].to_dict(orient='records')

# Transform the validation data using the fitted DictVectorizer
X_val = dv.transform(data_dicts_feb)

# Predict on the validation data
y_val_pred = model.predict(X_val)

# Calculate the RMSE on validation data
target_val = filtered_df_feb['duration']
rmse_val = np.sqrt(mean_squared_error(target_val, y_val_pred))
print(f'RMSE on validation: {rmse_val:.2f}')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df_feb['PULocationID'] = filtered_df_feb['PULocationID'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df_feb['DOLocationID'] = filtered_df_feb['DOLocationID'].astype(str)


RMSE on validation: 7.81
