In [1]:
import pandas as pd

In [2]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

### Q1: Reading January's data for yellow taxis trips

In [3]:
df = pd.read_parquet('../data/yellow_tripdata_2022-01.parquet')
num_cols = len(df.columns)
print(f"The number of columns is: {num_cols}")

The number of columns is: 19


### Q2. Computing duration

First we look at the columns found in our data and identify the columns that correspond to the trip duration

In [4]:
df.PULocationID.dtype

dtype('int64')

In [5]:
df.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'airport_fee'],
      dtype='object')

In [6]:
(df['tpep_pickup_datetime'] - df['tpep_dropoff_datetime'])

0         -1 days +23:42:11
1         -1 days +23:51:36
2         -1 days +23:51:02
3         -1 days +23:49:58
4         -1 days +23:22:28
                 ...       
2463926   -1 days +23:54:02
2463927   -1 days +23:49:21
2463928   -1 days +23:49:00
2463929   -1 days +23:47:57
2463930   -1 days +23:33:00
Length: 2463931, dtype: timedelta64[ns]

The trip duration can be calculated by subtracting the dropoff time from the pickup time, which correspond to the "tpep_pickup_datetime" and "tpep_dropoff_datetime" columns accordingly.

Subtracting two datetime (<M8[ns]) columns in pandas will result in a column of type timedelta64[ns]. We can then apply a lambda function over the column (which is a Series object in the Pandas library) to find the trip's duration in seconds.  

In [7]:
df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
df['duration'] = df.duration.apply(lambda td: td.total_seconds() / 60)

Having defined and populated the 'duration' column, we can perform numeric operations over it. 
In order to calculated the standard deviation of the trips duration we use Pandas' built-in std function for Series.

In [8]:
print("The std is: ", df['duration'].std())

The std is:  46.44530513776802


### Q3. Dropping outliers

We then attempt to remove the outliers in our dataset. Using Panda's DataFrame API we can filter out all dataset instances where the trip's duration is over 60 minutes or under 1 minute.

In [9]:
df_filtered = df[(df.duration >= 1) & (df.duration <= 60)]
outliers_percentage = (len(df.index) - len(df_filtered.index))/len(df.index)

In [10]:
print(f"Outliers correspond to {round(100*outliers_percentage, 5)} of our dataset")
print("Dataset size after removing outliers: ", len(df_filtered))

Outliers correspond to 1.72452 of our dataset
Dataset size after removing outliers:  2421440


### Q4. One-hot encoding

Our data contains categorical data. In order to transform them into something that a Machine Learning algorithm can understand we will use sklearn's Dict Vectorizer, which transform the categorical columns into their one-hot encoded equivalents.

*Note that we cannot simply transform the categorical columns by assigning them into an integer that corresponds to each row's categorical value's index from a dictionary of all possible values. That is because, in doing so, we would introduce implicit bias in our models, since categorical values that correspond neighbouring index values in the categorical value dictionary would be expected to be semantically similary by a common loss function such as MSE*

Furthermore, our encoded data will contain 

First we decide which categorical columns will be used by our model. Pickup and dropoff location seem suitable for a trip prediction model.
However these columns contain data of type string. Therefore, we will have to transform these columns to type string before using our one-hot encoder vectorizer.

In [11]:
Categorical_columns = ['PULocationID', 'DOLocationID']

# Numerical_columns = ['trip_distance']

# df_filtered[Categorical_columns] = df_filtered[Categorical_columns].astype(str)
# df_condensed = pd.concat([df_filtered[Categorical_columns].astype(str), df_filtered[Numerical_columns]])
df_condensed = df_filtered[Categorical_columns]


In [12]:
dv = DictVectorizer()
X_train = dv.fit_transform(df_condensed.to_dict(orient='records'))
y_train = df_filtered['duration']

We can see that the resulting matrix X_train has dimensions equal to the number of records in our dataset (after outlier filtering) times the sum of the unique categorical values in each categorical variable column 

In [13]:
print(f"Shape of One-hot encoded Matrix: {X_train.shape}")
print(f"Number of unique PULocationID' and 'DOLocationID combinations: {len(set(df_condensed['DOLocationID'].astype(str)))+len(set(df_condensed['PULocationID'].astype(str)))}")

Shape of One-hot encoded Matrix: (2421440, 2)
Number of unique PULocationID' and 'DOLocationID combinations: 515


### Q5. Training a model

Now, we utilize the feature matrix X_train in order to train a Linear Regression model. In order to train the model we will have to utilize the y_train as the target variable, that is to say the variable which our model must learn to predict correctly using the X_train feature matrix as variable.

For the Linear Regression model, we use Sklearn's LinearRegression model 

In [14]:
lr = LinearRegression()
lr.fit(X_train, y_train)


In [15]:
y_pred_train = lr.predict(X_train)

We then calculate the training loss using the MSE metric, again utilizing the sklearn's library implementation of it.

In [16]:
print("Training MSE: ", mean_squared_error(y_train, y_pred_train, squared=False))

Training MSE:  8.920327827581444


### Q6 Evaluation

Similar to Q5 we can use the linear regression on the validation subset of our dataset. This way we can evaluate the Linear Regressor's results on data that weren't used in training it. 

In order to find the feature matrix for the validation subset we have to repeat every step that we followed for the train subset. Therefore, we incorporate every such step in the "read_dataframe" function and repeat them.


In [17]:
def read_dataframe(filename):
    if filename.endswith('.csv'):
        df = pd.read_csv(filename)

        df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)
        df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)
    elif filename.endswith('.parquet'):
        df = pd.read_parquet(filename)

    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df

In [18]:
df_train = read_dataframe('../data/yellow_tripdata_2022-01.parquet')
df_val = read_dataframe('../data/yellow_tripdata_2022-02.parquet')

In [19]:
print("Training subset size: ", len(df_train))
print("Validation subset size: ", len(df_val))

Training subset size:  2421440
Validation subset size:  2918187


We can then, similarly, design a "calc_feature_matrix" which will calculate each dataframe's feature matrix

In [20]:
def calc_feature_matrix(df_train, df_val, cols=['PULocationID', 'DOLocationID']):
    dv = DictVectorizer()
    train_dicts = df_train[cols].to_dict(orient='records')
    X_train = dv.fit_transform(train_dicts)

    val_dicts = df_val[cols].to_dict(orient='records')
    X_val = dv.transform(val_dicts)
    
    return X_train, X_val

In [21]:
X_train, X_val = calc_feature_matrix(df_train, df_val)

In [22]:
y_train = df_train['duration'].values
y_val = df_val['duration'].values

In [23]:
lr = LinearRegression(n_jobs=-1)
lr.fit(X_train, y_train)

y_pred = lr.predict(X_val)

mean_squared_error(y_val, y_pred, squared=False)

7.786389417027388

*Note that the RMSE loss in the validation subset is higher than the RMSE loss in the training subset. This is to expected since the linear regressor was trained to minimize the loss on the train subset and not the validation subset.* 

*However, the linear regression model we trained generalizes well to the validation subset, since it achieves an RMSE error that is pretty close to the training subset one.*