# Data Processing & Cleaning - TTC Ridership
 

This report outlines the steps taken to clean the following data sets:

- Open Toronto Data: TTC Ridership data

We will perform the following steps to process & clean the data into its final form for analysis: 

1. General data review
2. Data compilation/consolidation ('raw' --> 'processed')
3. Data cleaning ('processed' --> 'clean_final')  


### Libraries

In [159]:
import pandas as pd 
from datetime import datetime
import src.paths as pt
import src.mappings as maps
import imp 
imp.reload(pt)
imp.reload(maps)

<module 'src.mappings' from 'c:\\Users\\Patrick\\OneDrive\\PET PROJECTS\\TTC Delay Analysis\\src\\mappings.py'>

## 1. General Data Review

The dataset, reported by the TTC, tracks the passengers on the transit system and is shared every quarter. 

The extracted data includes the following features for each year + month from 2007 onwards (see README for data extraction process/parameters): 

- Average Weekday Ridership  
- Monthy Ridership

## 2. Data Compilation/Consolidation 

The ridership data is kept in melted format. For analysis purposes, the data is pivoted/unmelted within the 'data_compiling.py' script and is stored in the 'data/processed/ridership' folder.


## 3. Cleaning

The data already conforms to general tidy data principles. However, to remain consistent with the TTC delay data, we will create an additional 'date' column using the 'Year' and 'Period' features in the processed, pivoted dataset. 

In [160]:
ridership = pd.read_csv(pt.ridership_path_processed, 
                        parse_dates = [[1,2]], 
                        skiprows = 1, 
                        index_col = 1)
ridership.columns = ['date', 
                    'avg_weekday_ridership', 
                    'monthly_total_ridership']

ridership.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 153 entries, 0 to 152
Data columns (total 3 columns):
date                       153 non-null datetime64[ns]
avg_weekday_ridership      153 non-null float64
monthly_total_ridership    141 non-null float64
dtypes: datetime64[ns](1), float64(2)
memory usage: 4.8 KB


The dataset is currenly missing 12 months of monthly ridership figures, but has the corresponding average weekday figures for the same months (the full year of 2007). 

Therefore, we will impute the missing values as follows:  

- Calculate avg_weekday_ridership as a proportion of monthly_total_ridership  
- For each month, calculate the average of the proportion calculated in the step above  
- Apply the average proportion to each of the missing months to estimate monthly ridership  


In [161]:
# Steps 1 + 2 
ridership['proportion'] = ridership.iloc[:,1]/ridership.iloc[:,2]
prop_by_mth = ridership.groupby(ridership.date.dt.month).proportion.mean()

prop_by_mth.sort_index()

date
1     0.037806
2     0.041867
3     0.032978
4     0.041341
5     0.041562
6     0.032982
7     0.040539
8     0.040874
9     0.033916
10    0.041384
11    0.039642
12    0.035534
Name: proportion, dtype: float64

In [162]:
# Step 3
ridership['proportion'] = prop_by_mth[ridership.date.dt.month].values
ridership['est_monthly'] = ridership['avg_weekday_ridership']/ridership['proportion']

ridership['monthly_total_ridership'].replace(np.nan, 
                                ridership['est_monthly'], 
                                inplace = True)

# Check previously nan values 
ridership.loc[ridership.date.dt.year == 2007 ,'monthly_total_ridership']

Measure Name
0     3.606615e+07
1     3.317548e+07
2     3.928584e+07
3     3.549331e+07
4     3.703124e+07
5     3.431264e+07
6     4.466076e+07
7     4.502974e+07
8     3.546481e+07
9     3.879707e+07
10    3.699460e+07
11    4.567127e+07
Name: monthly_total_ridership, dtype: float64

In [164]:
ridership = ridership.iloc[:,:3].sort_values('date')
ridership.to_csv(pt.ridership_path_cleaned)
ridership.head()

Unnamed: 0_level_0,date,avg_weekday_ridership,monthly_total_ridership
Measure Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4,2007-01-01,1400000.0,37031240.0
3,2007-02-01,1486000.0,35493310.0
7,2007-03-01,1485000.0,45029740.0
0,2007-04-01,1491000.0,36066150.0
8,2007-05-01,1474000.0,35464810.0
