Load in data from turnstile file

### Back Story

An email from a potential client:

> Lara & Alice -
>
> It was great to meet with you and chat at the event where we recently met and had a nice chat. We’d love to take some next steps to see if working together is something that would make sense for both parties.
>
> As we mentioned, we are interested in harnessing the power of data and analytics to optimize the effectiveness of our street team work, which is a significant portion of our fundraising efforts.
>
> WomenTechWomenYes (WTWY) has an annual gala at the beginning of the summer each year. As we are new and inclusive organization, we try to do double duty with the gala both to fill our event space with individuals passionate about increasing the participation of women in technology, and to concurrently build awareness and reach.
>
> To this end we place street teams at entrances to subway stations. The street teams collect email addresses and those who sign up are sent free tickets to our gala.
>
> Where we’d like to solicit your engagement is to use MTA subway data, which as I’m sure you know is available freely from the city, to help us optimize the placement of our street teams, such that we can gather the most signatures, ideally from those who will attend the gala and contribute to our cause.
>
> The ball is in your court now—do you think this is something that would be feasible for your group? From there we can explore what kind of an engagement would make sense for all of us.
>
> Best,
>
> Karrine and Dahlia
>
> WTWY International




#### Data:

 * MTA Data (Google it!)
 * Additional data sources welcome!

#### Skills:

 * `python` and `pandas`
 * visualizations via Matplotlib & seaborn

#### Analysis:

 * Exploratory Data Analysis


#### Deliverable/communication:


In [1]:
"""
Set Options
"""

# import libraries
import pandas as pd
import numpy as np
import os
import glob
import matplotlib.pyplot as plt
import matplotlib
from datetime import datetime

# configuration options
%matplotlib inline
matplotlib.style.use("seaborn-muted")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)

In [2]:
"""
load, parse, and initially sample data
"""
# load and concatanate sample data
path = r'/Users/tbowling/ds/metis/working/Benson/data/'
all_files = glob.glob(path + "/*.txt")
df = pd.concat((pd.read_csv(f) for f in all_files))

# convert weird style of control area name
df.rename(index=str, columns={"C/A":"CONTROL"},inplace=True)

# drop spurious data
df = df.drop(['LINENAME','DIVISION'], axis=1)

# filter out audits
df = df[df.DESC == 'REGULAR']

# strip spurious whitespace from column names
df.columns = [column.strip() for column in df.columns]

In [3]:
"""
Convert data to daily flux vs turnstile for entries
"""

# convert time data into datetime objects
df['TIMING'] = pd.to_datetime(df['DATE'] + ' ' + df['TIME'],format = '%m/%d/%Y %H:%M:%S' )

# sort values
df.sort_values(["CONTROL", "UNIT", "SCP", "STATION", "TIMING"], inplace=True, \
                          ascending=False)

# get first each day
daily_entries = df.groupby(["CONTROL", "UNIT", "SCP", "STATION", "DATE"])\
.ENTRIES.first().reset_index()

# make columns for previous day's data
daily_entries[["PREV_DATE", "PREV_ENTRIES"]] = (daily_entries
                                            .groupby(["CONTROL", "UNIT", "SCP", "STATION"])["DATE", "ENTRIES"]
                                            .transform(lambda grp: grp.shift(1)))
# drop first column
daily_entries.dropna(subset=["PREV_DATE"], axis=0, inplace=True)

# deal with negative data - from Lara's example
def get_daily_counts_entries(row, max_counter):
    counter = row["ENTRIES"] - row["PREV_ENTRIES"]
    if counter < 0:
        # Maybe counter is reversed?
        counter = -counter
    if counter > max_counter:
        #print(row["ENTRIES"], row["PREV_ENTRIES"])
        counter = min(row["ENTRIES"], row["PREV_ENTRIES"])
        # if current entries is bad, use yesterday's count as proxy
    if counter > max_counter:
        # Check it again to make sure we are not giving a counter that's too big
        return 0
    return counter

# If counter is > 1Million, then the counter might have been reset.  
# Just set it to zero as different counters have different cycle limits
daily_entries["ENTRY_FLUX"] = daily_entries.apply(get_daily_counts_entries, axis=1, max_counter=1000000)
daily_entries.head()

Unnamed: 0,CONTROL,UNIT,SCP,STATION,DATE,ENTRIES,PREV_DATE,PREV_ENTRIES,ENTRY_FLUX
1,A002,R051,02-00-00,59 ST,08/26/2018,6737067,08/25/2018,6736562.0,505.0
2,A002,R051,02-00-00,59 ST,08/27/2018,6738257,08/26/2018,6737067.0,1190.0
3,A002,R051,02-00-00,59 ST,08/28/2018,6739630,08/27/2018,6738257.0,1373.0
4,A002,R051,02-00-00,59 ST,08/29/2018,6740888,08/28/2018,6739630.0,1258.0
5,A002,R051,02-00-00,59 ST,08/30/2018,6742199,08/29/2018,6740888.0,1311.0


In [4]:
"""
Convert data to daily flux vs turnstile for entries
"""

# get first each day
daily_exits = df.groupby(["CONTROL", "UNIT", "SCP", "STATION", "DATE"])\
.EXITS.first().reset_index()

# make columns for previous day's data
daily_exits[["PREV_DATE", "PREV_EXITS"]] = (daily_exits
                                            .groupby(["CONTROL", "UNIT", "SCP", "STATION"])["DATE", "EXITS"]
                                            .transform(lambda grp: grp.shift(1)))
# drop first column
daily_exits.dropna(subset=["PREV_DATE"], axis=0, inplace=True)

# deal with negative data - from Lara's example
def get_daily_counts_exits(row, max_counter):
    counter = row["EXITS"] - row["PREV_EXITS"]
    if counter < 0:
        # Maybe counter is reversed?
        counter = -counter
    if counter > max_counter:
        #print(row["EXITS"], row["PREV_EXITS"])
        counter = min(row["EXITS"], row["PREV_EXITS"])
        # if current entries is bad, use yesterday's count as proxy
    if counter > max_counter:
        # Check it again to make sure we are not giving a counter that's too big
        return 0
    return counter

# If counter is > 1Million, then the counter might have been reset.  
# Just set it to zero as different counters have different cycle limits
daily_exits["EXIT_FLUX"] = daily_exits.apply(get_daily_counts_exits, axis=1, max_counter=1000000)
daily_exits.head()

Unnamed: 0,CONTROL,UNIT,SCP,STATION,DATE,EXITS,PREV_DATE,PREV_EXITS,EXIT_FLUX
1,A002,R051,02-00-00,59 ST,08/26/2018,2283631,08/25/2018,2283425.0,206.0
2,A002,R051,02-00-00,59 ST,08/27/2018,2284091,08/26/2018,2283631.0,460.0
3,A002,R051,02-00-00,59 ST,08/28/2018,2284478,08/27/2018,2284091.0,387.0
4,A002,R051,02-00-00,59 ST,08/29/2018,2284941,08/28/2018,2284478.0,463.0
5,A002,R051,02-00-00,59 ST,08/30/2018,2285388,08/29/2018,2284941.0,447.0


In [5]:
"""
merge entries and exits into single dataframe
"""

# drop duplicate data
daily_entries.drop(['ENTRIES','PREV_ENTRIES','PREV_DATE'], axis=1, inplace=True)
daily_exits.drop(['EXITS','PREV_EXITS','PREV_DATE'], axis=1, inplace=True)

flux = pd.merge(daily_entries,daily_exits,on=['CONTROL','UNIT','SCP','STATION','DATE'])
flux.head()

Unnamed: 0,CONTROL,UNIT,SCP,STATION,DATE,ENTRY_FLUX,EXIT_FLUX
0,A002,R051,02-00-00,59 ST,08/26/2018,505.0,206.0
1,A002,R051,02-00-00,59 ST,08/27/2018,1190.0,460.0
2,A002,R051,02-00-00,59 ST,08/28/2018,1373.0,387.0
3,A002,R051,02-00-00,59 ST,08/29/2018,1258.0,463.0
4,A002,R051,02-00-00,59 ST,08/30/2018,1311.0,447.0
