# CyRide ML
This notebook will be an exploratory 'lab' in which I use CyRide data to calculate some statistics and do some light machine learning. For now, I'm just using data from Febuary, but once a more precise learning goal is described, I will use the combined dataset (October 2021 to June 2022).

## Imports & Installations

In [None]:
!python3 -m pip install pandas numpy scikit-learn

In [43]:
import pandas as pd
import numpy as mp
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from dateutil import parser

## Reading in Data

In [None]:
october_2021 = pd.read_csv('./data/2021_10.csv')
november_2021 = pd.read_csv('./data/2021_11.csv')
december_2021 = pd.read_csv('./data/2021_12.csv')
january_2022 = pd.read_csv('./data/2022_01.csv')
february_2022 = pd.read_csv('./data/2022_02.csv')
march_2022 = pd.read_csv('./data/2022_03.csv')
april_2022 = pd.read_csv('./data/2022_04.csv')
may_2022 = pd.read_csv('./data/2022_05.csv')
june_2022 = pd.read_csv('./data/2022_06.csv')

frames = [october_2021, november_2021, december_2021, january_2022, february_2022, march_2022, april_2022, may_2022, june_2022]
combined_data = pd.concat(frames)

combined_data.drop(['VehicleName', 'route', 'pattern', 'PatternName', 'stop', 'trip', 'tripName', 'block', 'ons', 'offs'], axis=1, inplace=True)

In [None]:
# combined_data.head()
print(combined_data.columns)


## Preprocessing
I need to create a 'arrival diff' and 'departure diff' to see how far apart these values are. I also need to encode some categorical data.

In [28]:
 february_2022 = february_2022.assign(departure_diff = february_2022.apply(lambda row: 
    pd.Timedelta(parser.parse(row.scheduled_depart) -
    parser.parse(row.depart)).total_seconds()
, axis=1))

 february_2022 = february_2022.assign(arrival_diff = february_2022.apply(lambda row: 
    pd.Timedelta(parser.parse(row.scheduled_arrive) -
    parser.parse(row.arrive)).total_seconds()
, axis=1))



On any diff, a value x < 0 means it was x minutes late, while a positive value means it was early.

In [29]:
february_2022[['RouteName', 'StopName', 'arrive', 'scheduled_arrive', 'arrival_diff', 'depart', 'scheduled_depart', 'departure_diff']].head()

Unnamed: 0,RouteName,StopName,arrive,scheduled_arrive,arrival_diff,depart,scheduled_depart,departure_diff
0,11 Cherry,Mortensen Road at Lawrence Avenue Eastbound,2022-01-31 17:59:54.0000000 -06:00,2022-01-31 17:58:15.0000000 -06:00,-99.0,2022-01-31 18:00:01.0000000 -06:00,2022-01-31 17:58:15.0000000 -06:00,-106.0
1,2 Green West,Hyland Avenue at Forest Hills Drive Northbound,2022-01-31 17:59:53.0000000 -06:00,2022-01-31 17:58:50.0000000 -06:00,-63.0,2022-01-31 18:00:01.0000000 -06:00,2022-01-31 17:58:50.0000000 -06:00,-71.0
2,1 Red East,5th Street at Youth and Shelter Services,2022-01-31 17:59:34.0000000 -06:00,2022-01-31 17:58:20.0000000 -06:00,-74.0,2022-01-31 18:00:02.0000000 -06:00,2022-01-31 17:58:20.0000000 -06:00,-102.0
3,6 Brown South,Mortensen Parkway at Hayward Avenue Eastbound,2022-01-31 17:59:07.0000000 -06:00,2022-01-31 17:58:40.0000000 -06:00,-27.0,2022-01-31 18:00:11.0000000 -06:00,2022-01-31 17:58:40.0000000 -06:00,-91.0
4,2 Green East,Hyland Avenue at Ontario Street Southbound,2022-01-31 17:59:43.0000000 -06:00,2022-01-31 17:59:00.0000000 -06:00,-43.0,2022-01-31 18:00:11.0000000 -06:00,2022-01-31 17:59:00.0000000 -06:00,-71.0


In [None]:
february_2022.arrival_diff.describe()

In [None]:
february_2022.departure_diff.describe()

In [38]:
february_2022.columns

Index(['vehicle', 'VehicleName', 'route', 'RouteName', 'pattern',
       'PatternName', 'stop', 'StopName', 'trip', 'tripName', 'run', 'runName',
       'block', 'arrive', 'scheduled_arrive', 'depart', 'scheduled_depart',
       'ons', 'offs', 'arrival_passengers', 'departure_passengers',
       'vehicle_capacity', 'departure_diff', 'arrival_diff'],
      dtype='object')

In [47]:
february_2022['RouteEncoded'] = LabelEncoder().fit_transform(february_2022['RouteName'])
february_2022[['RouteName', 'RouteEncoded']].head(10)

Unnamed: 0,RouteName,RouteEncoded
0,11 Cherry,2
1,2 Green West,6
2,1 Red East,0
3,6 Brown South,14
4,2 Green East,5
5,11 Cherry,2
6,23 Orange,8
7,6 Brown North,13
8,2 Green West,6
9,1 Red East,0


# Machine Learning

## Random Forest
My first model will be a Random Forest. This is a relitively simple model that will see if there's a correlation between departure diffs and various data points, such as ons, offs, and which route it is.

### Train/ Test split

In [53]:
X = february_2022[['ons', 'offs', 'RouteEncoded']]
y = february_2022['departure_diff']

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

### Model training

In [54]:
forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)

### Model Predictions and Testing

In [56]:
predictions = forest_model.predict(val_X)
error = mean_absolute_error(val_y, predictions)
print(error)


238.19425840434047
