# Subway Station Analysis
compare by neighborhood, population and number of riders etc to identify communities that choke a certain station. Include different times as well.

In [1]:
# Import libraries that will help us with data analysis and visualization
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Exploratory Data Analysis
Let's begin by importing the data as a pandas DataFrame. 
We can then start getting a feel for the data before cleaning and feature
engineering.

In [2]:
# Read in the data, which is in a CSV (comma-separated values) format
data = pd.read_csv('turnstile_180127.txt')

# Let's now see if our import was successful and begin to understand the data
# by checking the first five rows of the dataframe
data.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/20/2018,03:00:00,REGULAR,6486774,2196363
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/20/2018,07:00:00,REGULAR,6486786,2196374
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/20/2018,11:00:00,REGULAR,6486844,2196436
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/20/2018,15:00:00,REGULAR,6487005,2196490
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/20/2018,19:00:00,REGULAR,6487314,2196567


Right off the bat, we recognize certain columns that are quite clear about what they represent. **_'Station'_**, **_'Date'_**, and **_'Time'_** in particular. Knowing that this is turnstile data also makes the **_'Entries'_** and **_'Exits'_** columns easy to understand. But their numbers seem to be counting from somewhere in the past, and from what specific point in time is not clear. Let's consult the [MTA's official description](http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description.txt) of these features to help us understand the other columns and make sense of the data. The MTA describes the feature columns as follows:

**_C/A_**      = Control Area (A002)

**_UNIT_**     = Remote Unit for a station (R051)

**_SCP_**      = Subunit Channel Position represents an specific address for a device (02-00-00)

**_STATION_**  = Represents the station name the device is located at

**_LINENAME_** = Represents all train lines that can be boarded at this station
           Normally lines are represented by one character.  LINENAME 456NQR repersents train server for 4, 5, 6, N, Q, and R trains.
           
**_DIVISION_** = Represents the Line originally the station belonged to BMT, IRT, or IND   

**_DATE_**     = Represents the date (MM-DD-YY)

**_TIME_**     = Represents the time (hh:mm:ss) for a scheduled audit event

**_DESc_**     = Represent the "REGULAR" scheduled audit event (Normally occurs every 4 hours)
           1. Audits may occur more that 4 hours due to planning, or troubleshooting activities. 
           2. Additionally, there may be a "RECOVR AUD" entry: This refers to a missed audit that was recovered. 
           
**_ENTRIES_**  = The comulative entry register value for a device

**_EXIST_**    = The cumulative exit register value for a device

_You can see for yourself that these misspellings are on the MTA website :D_



### Dropping some features
Looks like we can drop a few of these columns, in particular the first two (**_C/A, Unit_**). Seems like these won't make much difference to riders or users like us, but would be useful for internal MTA analysis. We can also go ahead and get rid of the **_'Division'_** column, which doesn't really matter anymore (again, at least to us) as the MTA is run as one cohesive unit. I think you would be hard pressed to find many riders on the subway who understand what the BMT, IRT, or IND were.

In [3]:
smallData = data.drop(labels=['C/A', 'UNIT', 'DIVISION'], axis=1)

### Feature Engineering
Observe that we have **_'Time'_** of audit and **_'Date'_** in separate columns. To make our dataframe a bit more coherent, let's combine these two into one **_'Datetime'_** column with a timestamp format. 

In [4]:
# Concatenate DATE and TIME columns with a space in between and convert to
# datetime format
smallData['DATETIME'] = pd.to_datetime(data['DATE'] + ' ' + data['TIME'])
smallData.head()

Unnamed: 0,SCP,STATION,LINENAME,DATE,TIME,DESC,ENTRIES,EXITS,DATETIME
0,02-00-00,59 ST,NQR456W,01/20/2018,03:00:00,REGULAR,6486774,2196363,2018-01-20 03:00:00
1,02-00-00,59 ST,NQR456W,01/20/2018,07:00:00,REGULAR,6486786,2196374,2018-01-20 07:00:00
2,02-00-00,59 ST,NQR456W,01/20/2018,11:00:00,REGULAR,6486844,2196436,2018-01-20 11:00:00
3,02-00-00,59 ST,NQR456W,01/20/2018,15:00:00,REGULAR,6487005,2196490,2018-01-20 15:00:00
4,02-00-00,59 ST,NQR456W,01/20/2018,19:00:00,REGULAR,6487314,2196567,2018-01-20 19:00:00


Now that we've combined these features into a DATETIME column with a Datetime format, we can drop the 'DATE' and 'TIME' columns.

In [5]:
#smallData.drop(labels=['DATE', 'TIME'], axis=1, inplace=True)

We need to be careful about the **_'DESC'_** column as well. In the next cell we see some absurd values for 'RECOVR AUD', such as a particular Cortlandt St. station having over 1 billion entries! There are many other ridiculous values, so we should drop these rows from our dataframe entirely.

In [6]:
smallData[smallData['DESC'] == 'RECOVR AUD']

Unnamed: 0,SCP,STATION,LINENAME,DATE,TIME,DESC,ENTRIES,EXITS,DATETIME
8384,02-00-00,CORTLANDT ST,RNW,01/22/2018,00:00:00,RECOVR AUD,485945,528638,2018-01-22 00:00:00
8426,02-00-01,CORTLANDT ST,RNW,01/22/2018,00:00:00,RECOVR AUD,551412,350362,2018-01-22 00:00:00
8468,02-00-02,CORTLANDT ST,RNW,01/22/2018,00:00:00,RECOVR AUD,515848,255138,2018-01-22 00:00:00
8510,02-00-03,CORTLANDT ST,RNW,01/22/2018,00:00:00,RECOVR AUD,871899400,871480652,2018-01-22 00:00:00
8552,02-01-00,CORTLANDT ST,RNW,01/22/2018,00:00:00,RECOVR AUD,565171,1170954,2018-01-22 00:00:00
8594,02-01-01,CORTLANDT ST,RNW,01/22/2018,00:00:00,RECOVR AUD,608060,1664083,2018-01-22 00:00:00
8636,02-01-02,CORTLANDT ST,RNW,01/22/2018,00:00:00,RECOVR AUD,873279,1325673,2018-01-22 00:00:00
8678,02-03-00,CORTLANDT ST,RNW,01/22/2018,00:00:00,RECOVR AUD,722740,165856,2018-01-22 00:00:00
8720,02-03-01,CORTLANDT ST,RNW,01/22/2018,00:00:00,RECOVR AUD,511351,140220,2018-01-22 00:00:00
8762,02-03-02,CORTLANDT ST,RNW,01/22/2018,00:00:00,RECOVR AUD,554753,190120,2018-01-22 00:00:00


In [7]:
smallData = smallData[smallData['DESC'] != 'RECOVR AUD']

In [8]:
test = smallData[(smallData['STATION'] == 'AVENUE U') & (smallData['LINENAME'] == 'F')]

In [9]:
test.head()

Unnamed: 0,SCP,STATION,LINENAME,DATE,TIME,DESC,ENTRIES,EXITS,DATETIME
106482,00-00-00,AVENUE U,F,01/20/2018,00:00:00,REGULAR,4833628,1868000,2018-01-20 00:00:00
106483,00-00-00,AVENUE U,F,01/20/2018,04:00:00,REGULAR,4833628,1868001,2018-01-20 04:00:00
106484,00-00-00,AVENUE U,F,01/20/2018,08:00:00,REGULAR,4833629,1868002,2018-01-20 08:00:00
106485,00-00-00,AVENUE U,F,01/20/2018,12:00:00,REGULAR,4833630,1868005,2018-01-20 12:00:00
106486,00-00-00,AVENUE U,F,01/20/2018,16:00:00,REGULAR,4833639,1868007,2018-01-20 16:00:00


In [10]:
def calculate_entries_diffs(data):
    currSCP = ''
    currStation = ''
    curr_entries = 0
    diff_column = []
    
    for i, row in data.iterrows():
        #if (currStation != row['STATION']):
            #currStation = row['STATION']
        #if (currSCP != row['SCP']):
            #currSCP = row['SCP']
        if ((currSCP != row['SCP']) | (currStation != row['STATION'])):
            currStation = row['STATION']
            currSCP = row['SCP']
            curr_entries = row['ENTRIES']
            diff_column.append(0)
        else:
            diff_column.append(row['ENTRIES'] - curr_entries)
            curr_entries = row['ENTRIES']
    
    return diff_column

In [11]:
test['DIFF_ENTRIES'] = calculate_entries_diffs(test)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [20]:
# Let's also add column representing what day of the week it is
import datetime
import calendar
test['DAY_OF_WEEK'] = test['DATETIME'].apply(lambda x: x.weekday())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [21]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
test

Unnamed: 0,SCP,STATION,LINENAME,DATE,TIME,DESC,ENTRIES,EXITS,DATETIME,DIFF_ENTRIES,DAY_OF_WEEK
106482,00-00-00,AVENUE U,F,01/20/2018,00:00:00,REGULAR,4833628,1868000,2018-01-20 00:00:00,0,5
106483,00-00-00,AVENUE U,F,01/20/2018,04:00:00,REGULAR,4833628,1868001,2018-01-20 04:00:00,0,5
106484,00-00-00,AVENUE U,F,01/20/2018,08:00:00,REGULAR,4833629,1868002,2018-01-20 08:00:00,1,5
106485,00-00-00,AVENUE U,F,01/20/2018,12:00:00,REGULAR,4833630,1868005,2018-01-20 12:00:00,1,5
106486,00-00-00,AVENUE U,F,01/20/2018,16:00:00,REGULAR,4833639,1868007,2018-01-20 16:00:00,9,5
106487,00-00-00,AVENUE U,F,01/20/2018,20:00:00,REGULAR,4833640,1868009,2018-01-20 20:00:00,1,5
106488,00-00-00,AVENUE U,F,01/21/2018,00:00:00,REGULAR,4833642,1868010,2018-01-21 00:00:00,2,6
106489,00-00-00,AVENUE U,F,01/21/2018,04:00:00,REGULAR,4833642,1868013,2018-01-21 04:00:00,0,6
106490,00-00-00,AVENUE U,F,01/21/2018,08:00:00,REGULAR,4833642,1868014,2018-01-21 08:00:00,0,6
106491,00-00-00,AVENUE U,F,01/21/2018,12:00:00,REGULAR,4833643,1868018,2018-01-21 12:00:00,1,6


In [16]:
test.pivot_table(values='DIFF_ENTRIES', index=['STATION', 'DATE'], aggfunc='sum')

Unnamed: 0_level_0,Unnamed: 1_level_0,DIFF_ENTRIES
STATION,DATE,Unnamed: 2_level_1
AVENUE U,01/20/2018,254
AVENUE U,01/21/2018,210
AVENUE U,01/22/2018,489
AVENUE U,01/23/2018,527
AVENUE U,01/24/2018,531
AVENUE U,01/25/2018,540
AVENUE U,01/26/2018,516
