# Objective

Help a nonprofit organization YoLocal Snack find three potential locations to open up shop. Our goal is to find the the stations with the highest entries and exits during meal hours. To cater towards our target market of New Yorkers with long commutes, we will establish filters that indicate long commutes.

Long Commute Indicators:

1. Boroughs Outside of the City
2. Stations with only one or two subway lines
3. Number of unlimited and student metros cards used 

After filtering and identifying potential stations, we can do a hand check of these stations by opening up Google Maps to visualize the number of local food stores near the station. Google's activity tracker can also reveal if traffic within these stores are higher during meal hours. In the future, YoLocal Snack will work with these vendors to efficiently cater to the local commuters. 


# Gathering Data

I will use mta data from January 2021 to April 2021 as the basis of my analysis. This is a good time frame to look at New York's commuter cycle. Students go back to school in January and workers resume work after major holidays. 

To reinforce consistency, I gathered data for January to April from previous years to be used for comparison with traffic in 2021. If stations remain consistently busy during meal hours for the last three years, then they are great choices for YoLocal Snack to open a store. 

Datasets stored: 

- Mta Location Data
- Mta Turnstile Data January to April 2019 - 2021
- Mta Fare Data January to April 2019 - 2021

In [2]:
from sqlalchemy import create_engine
import urllib.request
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import datetime 
from datetime import timedelta
%config InlineBackend.figure_format = 'svg'
%matplotlib inline 

turnstile_url = "http://web.mta.info/developers/data/nyct/turnstile/turnstile_{}.txt"
fare_url = "http://web.mta.info/developers/data/nyct/fares/fares_{}.csv"
location_url = "https://atisdata.s3.amazonaws.com/Station/Stations.csv"

In [3]:
pd.set_option('display.max.colwidth', None)

```def get_serial_date(start_date, end_date, month):
    week_nums = []
    date = datetime.date(*start_date)
    end_date = datetime.date(*end_date)
    delta = timedelta(weeks = 1)
    while date <= end_date:
        date_month = date.month
        if date_month in month:
            week_nums.append(date.strftime("%y%m%d"))
        date += delta
    return week_nums```

In [4]:
engine = create_engine("sqlite:///Data/mta.db")
turnstile_df_21 = pd.read_sql("SELECT * FROM turnstile_data WHERE DATE LIKE '%2021';", engine)

In [5]:
turnstile_df_21.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3377520 entries, 0 to 3377519
Data columns (total 11 columns):
 #   Column                                                                Dtype 
---  ------                                                                ----- 
 0   C/A                                                                   object
 1   UNIT                                                                  object
 2   SCP                                                                   object
 3   STATION                                                               object
 4   LINENAME                                                              object
 5   DIVISION                                                              object
 6   DATE                                                                  object
 7   TIME                                                                  object
 8   DESC                                                          

In [6]:
turnstile_df_21.columns = turnstile_df_21.columns.str.replace(' ','')

In [7]:
mta_dfs = [turnstile_df_21]
#mta_dfs = [turnstile_df_19, turnstile_df_20, turnstile_df_21]

for mta_df in mta_dfs:
    
    mta_df['DATETIME'] = pd.to_datetime(mta_df.DATE + " " + mta_df.TIME, 
                                        format="%m/%d/%Y %H:%M:%S")
    
    mta_df['TURNSTILES'] = mta_df['C/A'] + " - " +\
                           mta_df['UNIT'] + " - " +\
                           mta_df['SCP'] + " - " +\
                           mta_df['STATION'] 

In [8]:
turnstile_df_21 = turnstile_df_21[['TURNSTILES', 'C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DATETIME', 'DATE', 'TIME',
                   'ENTRIES', 'EXITS']]

In [14]:
turnstile_df_21['ENTRIES'] = turnstile_df_21['ENTRIES'].astype('int')
turnstile_df_21['EXITS'] = turnstile_df_21['EXITS'].astype('int')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  turnstile_df_21['ENTRIES'] = turnstile_df_21['ENTRIES'].astype('int')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  turnstile_df_21['EXITS'] = turnstile_df_21['EXITS'].astype('int')


In [11]:
turnstile_df_21.describe()

  turnstile_df_21.describe()


Unnamed: 0,TURNSTILES,C/A,UNIT,SCP,STATION,LINENAME,DATETIME,DATE,TIME,ENTRIES,EXITS
count,3377520,3377520,3377520,3377520,3377520,3377520.0,3377520,3377520,3377520,3377520.0,3377520.0
unique,5056,749,468,226,378,114.0,216474,113,62524,2127757.0,2686004.0
top,N339A - R114 - 00-00-00 - PARSONS BLVD,PTH22,R549,00-00-00,34 ST-PENN STA,1.0,2021-04-05 08:00:00,01/01/2021,04:00:00,0.0,0.0
freq,810,29007,46309,313237,69058,413514.0,2562,30696,239910,47280.0,14505.0
first,,,,,,,2021-01-01 00:00:00,,,,
last,,,,,,,2021-04-23 23:59:55,,,,


In [13]:
(turnstile_df_21.groupby(['TURNSTILES','DATETIME'])
['ENTRIES', 'EXITS'].count()
.reset_index()
.sort_values(["ENTRIES", "EXITS"], ascending=False)).head()

  (turnstile_df_21.groupby(['TURNSTILES','DATETIME'])


Unnamed: 0,TURNSTILES,DATETIME,ENTRIES,EXITS
304390,B028 - R136 - 01-00-01 - SHEEPSHEAD BAY,2021-01-08 04:00:00,2,2
912572,N071 - R013 - 00-00-00 - 34 ST-PENN STA,2021-04-08 08:00:00,2,2
913251,N071 - R013 - 00-00-01 - 34 ST-PENN STA,2021-04-08 08:00:00,2,2
913930,N071 - R013 - 00-00-02 - 34 ST-PENN STA,2021-04-08 08:00:00,2,2
914609,N071 - R013 - 00-00-03 - 34 ST-PENN STA,2021-04-08 08:00:00,2,2


In [17]:
turnstile_df_21['ENTRIES'].describe()

count    3.377520e+06
mean     4.215707e+07
std      2.186629e+08
min      0.000000e+00
25%      2.253830e+05
50%      1.505995e+06
75%      6.173308e+06
max      2.147432e+09
Name: ENTRIES, dtype: float64

In [18]:
turnstile_df_21['EXITS'].describe()

count    3.377520e+06
mean     3.392197e+07
std      1.943887e+08
min      0.000000e+00
25%      9.431400e+04
50%      9.045045e+05
75%      4.055988e+06
max      2.123068e+09
Name: EXITS, dtype: float64

# DATA CLEANING Part 1

A quick exploration of the dataset reveals many cleaning tasks. There are a number of duplicate rows, the exits and entries columns contain outliers that are far from the mean, the time column reveals 62524 instead of the expected 14. The entries and exits columns show cumulative values instead of the number of entries at that point in time. 

The next steps will include:
1. Remove the duplicate values 
2. Locate the outliers and save their indexes. Use the unique identifiers to replace the outlier values with numbers from a previous year if traffic patterns are similar to current.
3. Check the unique time values 
4. Calculate the number of entries and exits  

In [21]:
turnstile_df_21.sort_values(['TURNSTILES','DATETIME'], 
                   ascending = True, inplace = True)
turnstile_df_21.drop_duplicates(subset = ['TURNSTILES', 'DATETIME'], keep = 'first',
                      inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  turnstile_df_21.sort_values(['TURNSTILES','DATETIME'],
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  turnstile_df_21.drop_duplicates(subset = ['TURNSTILES', 'DATETIME'], keep = 'first',


In [52]:
exit_mask_0 = (turnstile_df_21['EXITS'] == 0) & (turnstile_df_21['DATE'] > '03/01/2021')
turnstile_df_21[exit_mask_0].shape

(63291, 11)

In [55]:
entry_mask_0 = (turnstile_df_21['ENTRIES'] == 0) & (turnstile_df_21['DATE'] > '03/01/2021')
turnstile_df_21[entry_mask_0].shape

(22046, 11)

In [73]:
turnstile_df_21[entry_mask_0].groupby('STATION')['ENTRIES'].value_counts()


STATION          ENTRIES
111 ST           0          319
14 ST            0          316
168 ST           0          408
175 ST           0          632
21 ST-QNSBRIDGE  0          316
                           ... 
THIRTY ST        0            1
THIRTY THIRD ST  0          303
UTICA AV         0          328
W 4 ST-WASH SQ   0            9
W 8 ST-AQUARIUM  0           44
Name: ENTRIES, Length: 71, dtype: int64

In [None]:
#Think about what to do with turnstiles with zero. We can exclude the stations if they are coming from stations with low traffic and if there are a lot of zero entries within March 1st, 2021.
#turnstile_df_21_not_performing = pd.concat(turnstile_df_21[exit_mask_0],entry_mask_0)

In [None]:
#Identify irregular entries with over a billion values. Exclude if they're coming from low traffic stations or replace them with 2020's or 2019's values. 
#mta_df['irr_entry']=mta_df['ENTRIES'].apply(lambda x: len(str(x))==10) 
#irr_entry_df = mta_df[mta_df['irr_entry'] == True]

In [79]:
turnstile_df_21['TIME'].value_counts()

04:00:00    239909
16:00:00    239882
08:00:00    239879
12:00:00    239858
20:00:00    239783
             ...  
20:24:56         1
14:52:35         1
23:15:51         1
22:03:50         1
22:01:16         1
Name: TIME, Length: 62524, dtype: int64

In [80]:
#Examine this time period closer. We'll need to reformat the time here.

In [82]:
turnstile_df_21[["PREV_DATE", "PREV_ENTRIES", "PREV_EXITS"]] = (turnstile_df_21
                                                       .groupby(["TURNSTILES"])["DATE", "ENTRIES", "EXITS"]
                                                       .apply(lambda grp: grp.shift(1)))

  turnstile_df_21[["PREV_DATE", "PREV_ENTRIES", "PREV_EXITS"]] = (turnstile_df_21
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [83]:
turnstile_df_21.head()

Unnamed: 0,TURNSTILES,C/A,UNIT,SCP,STATION,LINENAME,DATETIME,DATE,TIME,ENTRIES,EXITS,PREV_DATE,PREV_ENTRIES,PREV_EXITS
0,A002 - R051 - 02-00-00 - 59 ST,A002,R051,02-00-00,59 ST,NQR456W,2021-01-01 03:00:00,01/01/2021,03:00:00,7511448,2558786,,,
1,A002 - R051 - 02-00-00 - 59 ST,A002,R051,02-00-00,59 ST,NQR456W,2021-01-01 07:00:00,01/01/2021,07:00:00,7511451,2558789,01/01/2021,7511448.0,2558786.0
2,A002 - R051 - 02-00-00 - 59 ST,A002,R051,02-00-00,59 ST,NQR456W,2021-01-01 11:00:00,01/01/2021,11:00:00,7511461,2558813,01/01/2021,7511451.0,2558789.0
3,A002 - R051 - 02-00-00 - 59 ST,A002,R051,02-00-00,59 ST,NQR456W,2021-01-01 15:00:00,01/01/2021,15:00:00,7511495,2558831,01/01/2021,7511461.0,2558813.0
4,A002 - R051 - 02-00-00 - 59 ST,A002,R051,02-00-00,59 ST,NQR456W,2021-01-01 19:00:00,01/01/2021,19:00:00,7511620,2558857,01/01/2021,7511495.0,2558831.0


In [85]:
turnstile_df_21.dropna(subset=["PREV_DATE"], axis=0, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  turnstile_df_21.dropna(subset=["PREV_DATE"], axis=0, inplace=True)


In [86]:
turnstile_df_21.head()

Unnamed: 0,TURNSTILES,C/A,UNIT,SCP,STATION,LINENAME,DATETIME,DATE,TIME,ENTRIES,EXITS,PREV_DATE,PREV_ENTRIES,PREV_EXITS
1,A002 - R051 - 02-00-00 - 59 ST,A002,R051,02-00-00,59 ST,NQR456W,2021-01-01 07:00:00,01/01/2021,07:00:00,7511451,2558789,01/01/2021,7511448.0,2558786.0
2,A002 - R051 - 02-00-00 - 59 ST,A002,R051,02-00-00,59 ST,NQR456W,2021-01-01 11:00:00,01/01/2021,11:00:00,7511461,2558813,01/01/2021,7511451.0,2558789.0
3,A002 - R051 - 02-00-00 - 59 ST,A002,R051,02-00-00,59 ST,NQR456W,2021-01-01 15:00:00,01/01/2021,15:00:00,7511495,2558831,01/01/2021,7511461.0,2558813.0
4,A002 - R051 - 02-00-00 - 59 ST,A002,R051,02-00-00,59 ST,NQR456W,2021-01-01 19:00:00,01/01/2021,19:00:00,7511620,2558857,01/01/2021,7511495.0,2558831.0
5,A002 - R051 - 02-00-00 - 59 ST,A002,R051,02-00-00,59 ST,NQR456W,2021-01-01 23:00:00,01/01/2021,23:00:00,7511647,2558865,01/01/2021,7511620.0,2558857.0


# DATA CLEANING Part 2

We have to reformat the dates to fall into date ranges 0-4,4-8,8-12,4-8,8-12 to make it easier for analysis. The outliers that fall outside of the 25% and 75% quartile of our dataset needs to be removed or replace with numbers from previous years assuming traffic patterns are consistent for the mta over the years. 

Before we calculate the entries and exits for a particular point in time, we need to perform a gutcheck. The ideal stiatuion is to have all the previous entries be less than the current entries. We want to check if there are situations where PREV_ENTRIES > ENTRIES or PREV_EXITS > EXITS and then decide how to calculate the entries and exits.

In [88]:
mask = (turnstile_df_21["ENTRIES"] < turnstile_df_21["PREV_ENTRIES"])
turnstile_df_21[mask].groupby(["TURNSTILES"]).size()

TURNSTILES
A002 - R051 - 02-03-02 - 59 ST                1
A011 - R080 - 01-03-00 - 57 ST-7 AV         666
A011 - R080 - 01-03-01 - 57 ST-7 AV           1
A025 - R023 - 01-06-00 - 34 ST-HERALD SQ      1
A031 - R083 - 00-00-01 - 23 ST                1
                                           ... 
R624 - R124 - 00-00-02 - KINGSTON AV          1
R627 - R063 - 00-00-02 - SUTTER AV-RUTLD      1
R627 - R063 - 00-03-02 - SUTTER AV-RUTLD      2
R730 - R431 - 00-00-04 - EASTCHSTER/DYRE    597
S101 - R070 - 00-00-04 - ST. GEORGE           1
Length: 280, dtype: int64

# DATA ANALYSIS WITH ONLY TURNSTILE DATA

After finding the entries and exits values, we can combine the two values to find the total traffic for a particular turnstile at a time in day. 

Questions:

1. Find the top 20 stations with the highest number of exits, enteries, traffic
    - Now find the top stations with only one or two lines with the highest number of exits, entries, traffic
2. Using the results from question one, we find the stations with highest exits, entries, traffics for time ranges 8-12, 12-4, 4-8 *meal hours
    
    - Which stations have the most entries around 8-12 am
    - Which stations have the most exits around 4-8pm pm
 
    
    - Which stations have the most exits around 8-12 am?
    - Which stations have the most entriess around 4 - 8 pm? 
    
3. Find the average total of exits, entries, traffic for each weekday
    - Do entries = exits?
    - Is traffic consistent throughout the weekdays
    - Using total traffic establish percentage of people in certain stations?


# DATA VIZUALIZATIONS WITH ONLY TURNSTILE DATA

Plot the answers to the questions to itentify insights and potential gaps in data.

- Bar Chart -> Top 20 stations highest exits, entries, traffic 
- Line chart -> Consistency of Entries and Exits over time for a station (We're looking for consistent traffic)
- Scatter Plot -> Exits versus Entries for a particular station 
- Heatmat -> Traffic flow during the weekday by TIME of a particular station

# ADDING FARE AND LOCATION DATASETS

# CONCLUSION

# FUTURE IDEAS