# Milestone 2
Grant Perkins

## Problem Statement
Can features be derived from the MBTA's reliability dataset to classify whether the Red Line Rapid Transit will be late during peak hours to at least 75% accuracy?

### Background
The Massachusetts Bay Transportation (MBTA) is the primary operator of public transportation in the greater Boston area. They provide consistent commuter rails, rapid transit, and buses for a minimal fee. The MBTA has a commitment to providing “accurate and timely information” about delays in their
services (https://www.mbta.com/policies/our-commitment-you). Predicting if the MBTA’s Red Line Rapid Transit will be late, based on the past month of data about the historical number of people trying to ride MBTA Rapid Transit and how many people got on the trains on time, would improve accuracy of reported information, in line with business goals. I will initially attempt using a Recurrent Neural Network (RNN) to solve this problem.

### Dataset Description
The dataset can be found here: [https://mbta-massdot.opendata.arcgis.com/datasets/MassDOT::mbta-bus-commuter-rail-rapid-transit-reliability/about](https://mbta-massdot.opendata.arcgis.com/datasets/MassDOT::mbta-bus-commuter-rail-rapid-transit-reliability/about)

The dataset I will use is the “MBTA Bus, Commuter Rail, & Rapid Transit Reliability”. This dataset is a daily report of reliability for all bus, commuter rails, and rapid transit options through the MBTA. It provides a timestamp, mode of transportation, number of people who used transport, and number of people who wanted to use transport. There are 8 million rows in this dataset, with timestamps ranging back from 2015 to the start of December in 2022. There is exactly one row for each day. It is not intraday data unfortunately, so there is no data about what specific times trains were late or canceled.

This dataset is easily accessible as a CSV file with over 8 million rows. The dataset is updated every few months, as the last update was November 30, 2022. I will only be using this data for this project. I will be removing a large chunk of this dataset, as I am only interested in Rapid Transit trains during peak hours.

# EDA Code


In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sklearn
import seaborn as sns

### Load Dataset
In this section, I am loading a preprocessing the dataset for training and prediction. The following steps are taken:
 - Load raw csv
 - Isolate rapid transit trains during peak hours
 - Reshape df from one row for each day, each line, to one row for each day with columns for each line
 - Interpolate missing values with medians per column (from EDA)
 - Save df for future use
 - Separate prediction variables from labels (red line reliability)
 - Use sliding window to create 30 day periods to predict
 - Split dataset 60:20:20 train:validation:test

In [47]:
# load raw csv
df = pd.read_csv(r"C:\Users\gcper\OneDrive\WPI\ML\final\MBTA_Reliability.csv")

# isolate rapid transit trains during peak hours
df = df[(df["gtfs_route_desc"] == "Rapid Transit") & (df["mode_type"] == "Rail") & (df["peak_offpeak_ind"] == "PEAK")]
df["service_date"] = df["service_date"].apply(lambda s: s.split()[0])
routes = df["gtfs_route_id"].unique()

# reshape df
# make template df, one row for each date
new_df = pd.DataFrame({"service_date": df.service_date.unique()})
for route in routes:
    # make a column for each route's numerator and denominator
    route_df = df[df.gtfs_route_id == route][["service_date", "otp_numerator", "otp_denominator"]]
    route_df = route_df.rename({"otp_numerator": f"{route}_Numerator", "otp_denominator": f"{route}_Denominator"},
                               errors="raise", axis=1)
    new_df = new_df.merge(route_df, left_on="service_date", right_on="service_date",
                          how="left")  # puts NaNs where data is missing (how=left)
new_df.service_date = pd.to_datetime(new_df.service_date, format="%Y/%m/%d",
                                     errors="raise")  # convert service date to date type
new_df = new_df.sort_values(by="service_date", ascending=True)  # sort df by date, oldest to newest
new_df = new_df.rename({"service_date": "Date"}, axis=1)
new_df = new_df.set_index("Date")

# interpolate
for col in new_df.columns:
    new_df[col].fillna(value=new_df[col].median(), inplace=True)

# save final df
new_df.to_csv("mbta.csv")

In [3]:
def load_dataset(mbta_df, window=30, val_split=0.2, test_split=0.2):
    # separate input and labels
    red_line = mbta_df["Red_Numerator"] / mbta_df["Red_Denominator"]
    variables_df = mbta_df[[col for col in mbta_df.columns if "Red" not in col]]
    
    # sliding window
    windows = variables_df.rolling(window=window)
    X = [window.to_numpy() for window in windows]
    X = X[window-1:-1] # first `window` windows are cut short, drop last window as well (can't predict)
    X = np.array(X)
    y = red_line.to_numpy()
    y = 1.0 * (y < 0.9) # define unreliable as < 90%
    y = y[window:] # offset labels. trying to predict the NEXT day
    
    # train - validation - test split
    # 60 : 20 : 20
    length = len(y)
    train_length = int((1 - val_split - test_split) * length)
    val_length = int(val_split * length)
    test_length = int(test_split * length)
    train_X, train_y = X[:train_length], y[:train_length]
    val_X, val_y = X[train_length:train_length+val_length], y[train_length:train_length+val_length]
    test_X, test_y = X[train_length+val_length:], y[train_length+val_length:]
    return train_X, train_y, val_X, val_y, test_X, test_y

In [4]:
df = pd.read_csv("mbta.csv")
train_X, train_y, val_X, val_y, test_X, test_y = load_dataset(df)
print([i.shape for i in [train_X, train_y, val_X, val_y, test_X, test_y]])

[(1062, 30, 13), (1062,), (354, 30, 13), (354,), (355, 30, 13), (355,)]
