# Data Processing & Cleaning
 

This report outlines the steps taken to clean the following data sets:

- Open Toronto Data: TTC Subway, Streetcar, & Bus Delay data sets
- Open Toronto Data: TTC Ridership data
- Canada Historical Weather data

We will perform the following steps to process & clean the data into its final form for analysis: 

1. General data review
2. Data compilation/consolidation ('raw' --> 'processed')
3. Data cleaning ('processed' --> 'clean_final')

## 1. General Data Review

### TTC Delay Data

Toronto's open data portal does not currently have information regarding the data collection methods used for delay data (e.g., who records instances, how frequently, how manual or automated the process is). 

We assume that the delay data contains all known delays in the TTC during the time period for which data is collected for. 

The data for all modes of transportation contain the following features: 


In [223]:
import src.paths as pt
import imp 
imp.reload(pt)
pd.read_excel(pt.sc_readme)

Unnamed: 0,Field Name,Description,Example
0,Report Date,The date (YYYY/MM/DD) when the delay-causing i...,2017-06-20 00:00:00
1,Route,The number of the streetcar route,51
2,Time,The time (hh:mm:ss AM/PM) when the delay-causi...,00:35:00
3,Day,The name of the day,Monday
4,Location,The location of the delay-causing incident,York Mills Station
5,Incident,The description of the delay-causing incident,Mechanical
6,Min Delay,"The delay, in minutes, to the schedule for the...",10
7,Min Gap,"The total scheduled time, in minutes, from the...",20
8,Direction,"The direction of the bus route where B,b or BW...",N
9,Vehicle,Vehicle number,1057


**Notes:**  
- In the subway delay data, the 'Incident' feature is denoted as 'Code', with a separate code book mapping the codes to their respective delay details.  
- In addition, the 'Location' featured is labelled 'Station' as subway delays typically occur at particular stations (as opposed to bus/streetcar delays potentially occuring at specific intersections between stops).  
- 'Direction' is also labelled as 'Bound'. 

### TTC Ridership Data

The dataset, reported by the TTC, tracks the passengers on the transit system and is shared every quarter. 

The extracted data includes the following features for each year + month from 2007 onwards (see README for data extraction process/parameters): 

- Average Weekday Ridership  
- Monthy Ridership

### Toronto Weather Data

The weather data set is extracted from the [Canada Weather Stats website](weatherstats.ca) which obtains data directly from [Environment and Climate Change Canada](https://weather.gc.ca/).

Both hourly and daily data sets are extracted to allow for extra granularity (hourly), but also snow and rain measures (daily, as these features are not present in the hourly data).

The hourly dataset contains the following features: 

In [224]:
hourly = pd.read_csv(pt.hourly_weather_path)
hourly.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Data columns (total 19 columns):
date_time_local      60000 non-null object
unixtime             60000 non-null int64
pressure_station     59971 non-null float64
pressure_sea         59971 non-null float64
wind_dir             57248 non-null object
wind_dir_10s         59947 non-null float64
wind_speed           59973 non-null float64
wind_gust            13580 non-null float64
relative_humidity    59964 non-null float64
dew_point            59966 non-null float64
temperature          59973 non-null float64
windchill            14191 non-null float64
humidex              9722 non-null float64
visibility           59975 non-null float64
health_index         59428 non-null float64
cloud_cover_4        0 non-null float64
cloud_cover_8        56536 non-null float64
cloud_cover_10       0 non-null float64
solar_radiation      0 non-null float64
dtypes: float64(16), int64(1), object(2)
memory usage: 8.7+ MB


The daily dataset contains the same features, with columns for aggregates of the features (e.g., avg, min, max) as well as additional features including:  

- Snow  
- Rain  
- Sunrise  
- Sunset  
- Daylight 

## 2. Data Compiling/Consolidation
### TTC Delay Data
The TTC delay data files have been compiled into their processed forms from the 'data_compiling.py' script in this repository. A few notes regarding the compilation steps taken: 

#### Subway Files

- Files are split by Month and Year of delay record. 
- Data for Jan 2014-April 2017 is stored in a single file, stacked in a single sheet. Therefore, we keep this as is and continue to stack all months of data together for analysis. 
- README and delay codes files moved into a separate 'docs' folder.  
- For simplicity, the raw delay code book file for subway data was reformatted to stack all codes into a single table.  

#### Streetcar & Bus Files

- Files are split by Year only, with individual sheets containing each month of data (in comparison to Subway data files). Therefore, slightly modified method of compiling streetcar & bus data was used to first concatenate sheets, then concatenate workbooks.
- Inconsistencies were present in column names for the following: 
    1. Streetcar: Inconsistent columns 'Delay', 'Gap' were merged to more conventional 'Min Delay' and 'Min Gap' columns. 'Incident ID' column removed.
    2. Bus: Inconsistent columns 'Delay'/' Min Delay', 'Gap' were merged to more conventional 'Min Delay' and 'Min Gap' columns 'Incident ID' column removed.

Processed, pre-cleaned files are saved in their respective 'data/processed/ttc' folders.

### TTC Ridership Data
The ridership data is kept in melted format. For analysis purposes, the data is pivoted/unmelted within the 'data_compiling.py' script and is stored in the 'data/processed/ridership' folder.

### Weather Data
No additional compilation or formatting is required to prepare the data files for cleaning. These files will be kept in the 'data/raw/weather' folder until final cleaning is performed.

### Libraries

In [225]:
import os
import pandas as pd 
import numpy as np

## Cleaning

Below are the general steps taken to clean the data: 

- Inspection: types, summaries, counts, outliers
- Cleaning: 
    - Remove irrelevant data if necessary  
    - Data types
    - Check for duplicates
    - Syntax, typos (re-mapping)
    - Check for missing values: 
        - Remove records if random or rare occurences, 
        - Impute, 
        - Flag "missing"
    - Scaling/Transformations/Normalization if necessary 
    - Review outliers and determine keep/remove

### TTC Delay Data
#### Subway Files 

Using the subway delay code book, we will map the Delay Codes to their respective descriptions for better analysis of the data. 

In [226]:
sub = pd.read_csv(sub_path_processed, index_col = 0)
codes = pd.read_excel(pt.sub_delay_codes)

sub = sub.merge(codes, how = 'left', left_on = 'Code', right_on = 'CODE')

sub.columns = map(lambda x: x.lower().replace(' ','_'), sub.columns)
sub.drop('unnamed:_0', axis = 1, inplace = True)

In [227]:
sub.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 121904 entries, 0 to 121903
Data columns (total 12 columns):
date           121904 non-null object
time           121904 non-null object
day            121904 non-null object
station        121904 non-null object
code           121903 non-null object
min_delay      121904 non-null int64
min_gap        121904 non-null int64
bound          95292 non-null object
line           121435 non-null object
vehicle        121904 non-null int64
code           119249 non-null object
description    119249 non-null object
dtypes: int64(3), object(9)
memory usage: 12.1+ MB


First, we'll review why there are ~25,000 missing records in the 'bound' column. 

In [228]:
no_bound = sub['bound'].isna()
sub.loc[no_bound].description.value_counts(normalize = True).head(10)

Injured or ill Customer (In Station) - Transported            0.222429
Miscellaneous General Delays                                  0.206627
Miscellaneous Other                                           0.167163
Escalator/Elevator Incident                                   0.114834
Injured or ill Customer (In Station) - Medical Aid Refused    0.044387
Station Stairway Incident                                     0.040457
Disorderly Patron                                             0.029660
Injured or ill Customer (On Train) - Transported              0.019442
Assault / Patron Involved                                     0.016671
Station Other                                                 0.016505
Name: description, dtype: float64

The majority of the records with missing bound/direction values occured in station or are 'miscellaneous' delays (of which we do not have information on what this bucket entails). However, given that these records represent a large portion of the total dataset (and likely of the 'miscellaneous' variety), we will keep these records in and denote them as 'No Bound'.  

We will also remove records whose value appears to be a typo (i.e., *not* N, E, S, or W, as there is no readme information on any other bounds/directions applicable to the dataset)

In [229]:
sub.loc[no_bound,'bound'] = 'No Bound'

keep_bound = sub.bound.isin(['No Bound','N','E','S','W'])
sub.drop(sub[~keep_bound].index, axis = 0, inplace = True)

sub.bound.value_counts()

No Bound    26612
W           25323
S           24992
E           23402
N           21523
Name: bound, dtype: int64

We will review the ~500 missing values in the 'line' column similarly:

In [230]:
no_line = sub['line'].isna()
sub.loc[no_line].description.value_counts(normalize = True).head(10)

Miscellaneous Other                                           0.246606
Injured or ill Customer (In Station) - Transported            0.162896
Miscellaneous General Delays                                  0.144796
Escalator/Elevator Incident                                   0.056561
Injured Employee                                              0.049774
Injured or ill Customer (In Station) - Medical Aid Refused    0.031674
Operator Overspeeding                                         0.027149
Miscellaneous Speed Control                                   0.022624
Station Stairway Incident                                     0.020362
Disorderly Patron                                             0.020362
Name: description, dtype: float64

The majority of these records are 'miscellaneous' or occur outside of the train. Therefore, we will keep these records and denote with 'No Line'.

We will also standardize the values of the 'line' column by removing whitespaces, dashes, and general typos (via mapping). 

In [231]:
sub.loc[no_line,'line'] = 'No Line'

for char in ['/','-',' ']:
    sub.line = sub.line.map(lambda x: str(x).replace(f'{char}', ''))

line_mapping = {
        'BD': 'BD',
        'YU': 'YU',
        'SRT': 'SRT',
        'SHP': 'SHP',
        'YUBD': 'YUBD',
        'NoLine': 'NoLine',
        'YUS': 'YU',
        'BDYU': 'YUBD',
        'BDYUS': 'YUBD',
        'BDLINE': 'BD',
        'YULINE': 'YU',
        'BLOORDANFORTHLINES': 'BD'
    }

sub.line = sub.line.map(line_mapping)
# Remove records with 'bound' not on a subway line
not_sub = sub['line'].isna()
sub.drop(sub[not_sub].index, axis = 0, inplace = True)

sub.line.value_counts().head(20)

BD        56652
YU        55038
SRT        4412
SHP        3973
YUBD       1257
NoLine      469
Name: line, dtype: int64