# Data Cleaning & Processing
 

This report outlines the steps taken to clean the following data sets:

- Open Toronto Data: TTC Subway, Streetcar, & Bus Delay data sets
- Open Toronto Data: TTC Ridership data
- Canada Historical Weather data

We will perform the following steps to process & clean the data into its final form for analysis: 

1. General data review
2. Data compilation/consolidation ('raw' --> 'processed')
3. Data cleaning ('processed' --> 'clean_final')

## 1. General Data Review

### TTC Delay Data

Toronto's open data portal does not currently have information regarding the data collection methods used for delay data (e.g., who records instances, how frequently, how manual or automated the process is). 

We assume that the delay data contains all known delays in the TTC during the time period for which data is collected for. 

The data for all modes of transportation contain the following features: 


In [30]:
import src.paths as pt
import imp 
imp.reload(pt)
pd.read_excel(pt.sc_readme)

Unnamed: 0,Field Name,Description,Example
0,Report Date,The date (YYYY/MM/DD) when the delay-causing i...,2017-06-20 00:00:00
1,Route,The number of the streetcar route,51
2,Time,The time (hh:mm:ss AM/PM) when the delay-causi...,00:35:00
3,Day,The name of the day,Monday
4,Location,The location of the delay-causing incident,York Mills Station
5,Incident,The description of the delay-causing incident,Mechanical
6,Min Delay,"The delay, in minutes, to the schedule for the...",10
7,Min Gap,"The total scheduled time, in minutes, from the...",20
8,Direction,"The direction of the bus route where B,b or BW...",N
9,Vehicle,Vehicle number,1057


Notes: 
- In the subway delay data, the 'Incident' feature is denoted as 'Code', with a separate code book mapping the codes to their respective delay details.  
- In addition, the 'Location' featured is labelled 'Station' as subway delays typically occur at particular stations (as opposed to bus/streetcar delays potentially occuring at specific intersections between stops).  
- 'Direction' is also labelled as 'Bound'. 

### TTC Ridership Data

The dataset, reported by the TTC, tracks the passengers on the transit system and is shared every quarter. 

The extracted data includes the following features for each year + month from 2007 onwards (see README for data extraction process/parameters): 

- Average Weekday Ridership  
- Monthy Ridership

### Toronto Weather Data

The weather data set is extracted from the [Canada Weather Stats website](weatherstats.ca) which obtains data directly from [Environment and Climate Change Canada](https://weather.gc.ca/).

Both hourly and daily data sets are extracted to allow for extra granularity (hourly), but also snow and rain measures (daily, as these features are not present in the hourly data).

The hourly dataset contains the following features: 

In [40]:
hourly = pd.read_csv(pt.hourly_weather_path)
hourly.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Data columns (total 19 columns):
date_time_local      60000 non-null object
unixtime             60000 non-null int64
pressure_station     59971 non-null float64
pressure_sea         59971 non-null float64
wind_dir             57248 non-null object
wind_dir_10s         59947 non-null float64
wind_speed           59973 non-null float64
wind_gust            13580 non-null float64
relative_humidity    59964 non-null float64
dew_point            59966 non-null float64
temperature          59973 non-null float64
windchill            14191 non-null float64
humidex              9722 non-null float64
visibility           59975 non-null float64
health_index         59428 non-null float64
cloud_cover_4        0 non-null float64
cloud_cover_8        56536 non-null float64
cloud_cover_10       0 non-null float64
solar_radiation      0 non-null float64
dtypes: float64(16), int64(1), object(2)
memory usage: 8.7+ MB


The daily dataset contains similar features, with additional aggregate features (e.g., avg, min, max) as well as snow and rain: 

In [41]:
daily = pd.read_csv(pt.daily_weather_path)
daily.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2400 entries, 0 to 2399
Data columns (total 70 columns):
date                             2400 non-null object
max_temperature                  2400 non-null float64
avg_hourly_temperature           2400 non-null float64
avg_temperature                  2400 non-null float64
min_temperature                  2400 non-null float64
max_humidex                      626 non-null float64
min_windchill                    862 non-null float64
max_relative_humidity            2400 non-null int64
avg_hourly_relative_humidity     2400 non-null float64
avg_relative_humidity            2400 non-null float64
min_relative_humidity            2400 non-null int64
max_dew_point                    2400 non-null float64
avg_hourly_dew_point             2400 non-null float64
avg_dew_point                    2400 non-null float64
min_dew_point                    2400 non-null float64
max_wind_speed                   2400 non-null int64
avg_hourly_wind_speed 

## 2. Data Compiling/Consolidation
### TTC Delay Data
The TTC delay data files have been compiled into their processed forms from the 'data_compiling.py' script in this repository. A few notes regarding the compilation steps taken: 

#### Subway Files

- Files are split by Month and Year of delay record. 
- Data for Jan 2014-April 2017 is stored in a single file, stacked in a single sheet. Therefore, we keep this as is and continue to stack all months of data together for analysis. 
- README and delay codes files moved into a separate 'docs' folder. 

#### Streetcar & Bus Files

- Files are split by Year only, with individual sheets containing each month of data (in comparison to Subway data files). Therefore, slightly modified method of compiling streetcar & bus data was used to first concatenate sheets, then concatenate workbooks.
- Inconsistencies were present in column names for the following: 
    1. Streetcar: Inconsistent columns 'Delay', 'Gap' were merged to more conventional 'Min Delay' and 'Min Gap' columns. 'Incident ID' column removed.
    2. Bus: Inconsistent columns 'Delay'/' Min Delay', 'Gap' were merged to more conventional 'Min Delay' and 'Min Gap' columns 'Incident ID' column removed.

Processed, pre-cleaned files are saved in their respective 'data/processed/ttc' folders.

### TTC Ridership Data
The ridership data is kept in melted format. For analysis purposes, the data is pivoted/unmelted within the 'data_compiling.py' script and is stored in the 'data/processed/ridership' folder.

### Weather Data
No additional compilation or formatting is required to prepare the data files for cleaning. These files will be kept in the 'data/raw/weather' folder until final cleaning is performed.

### Libraries

In [3]:
import os
import pandas as pd 
import numpy as np

## Cleaning

Below are the general steps taken to clean the data: 

- Check for missing values: 
    - Impute if necessary, 
    - Remove records if critical feature missing, 

### TTC Delay Data
#### Subway Files 


In [22]:
sub = pd.read_csv(sub_path_processed, index_col = 0)
sub.head()

Unnamed: 0,Date,Time,Day,Station,Code,Min Delay,Min Gap,Bound,Line,Vehicle
0,2018-04-01,00:27,Sunday,ST GEORGE BD STATION,MUSAN,8,12,W,BD,5256
1,2018-04-01,07:56,Sunday,FINCH STATION,TUSC,0,0,S,YU,5751
2,2018-04-01,08:00,Sunday,YONGE UNIVERSITY LINE,MUO,0,0,,YU,0
3,2018-04-01,09:50,Sunday,KIPLING STATION,TUSC,0,0,W,BD,5139
4,2018-04-01,10:18,Sunday,VICTORIA PARK STATION,MUSC,0,0,W,BD,5354
