# Data Processing & Cleaning - TTC Ridership
 

This report outlines the steps taken to clean the following data sets:

- Open Toronto Data: TTC Ridership data

We will perform the following steps to process & clean the data into its final form for analysis: 

1. General data review
2. Data compilation/consolidation ('raw' --> 'processed')
3. Data cleaning ('processed' --> 'clean_final')  


### Libraries

In [None]:
import os
import pandas as pd 
import numpy as np
import re
from datetime import datetime
import src.paths as pt
import src.mappings as maps
import imp 
imp.reload(pt)
imp.reload(maps)

## 1. General Data Review

The dataset, reported by the TTC, tracks the passengers on the transit system and is shared every quarter. 

The extracted data includes the following features for each year + month from 2007 onwards (see README for data extraction process/parameters): 

- Average Weekday Ridership  
- Monthy Ridership

## 2. Data Compilation/Consolidation 

The ridership data is kept in melted format. For analysis purposes, the data is pivoted/unmelted within the 'data_compiling.py' script and is stored in the 'data/processed/ridership' folder.


## 3. Cleaning

Below are the general steps taken to clean the data: 

- Inspection: types, summaries, counts, outliers
- Cleaning: 
    - Remove irrelevant data if necessary  
    - Data types
    - Check for duplicates
    - Syntax, typos (re-mapping)
    - Check for missing values: 
        - Remove records if random or rare occurences, 
        - Impute, 
        - Flag "missing"
    - Scaling/Transformations/Normalization if necessary 
    - Review outliers and determine keep/remove
