<h2>MTA Ridership On New York City Map</h2>
<h5><i>By Kyoosik Kim</i><h5>
<h3 style="font-family:Tahoma;">Introduction</h3>
<p style="font-family:sans-serif;">This project is to visualize MTA ridership on the map of New York City. In the course of the project, I will encompass various aspects of data science from data munging, data visualizaition, to analysis. The base data set is <a href="https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/turnstile_data_master_with_weather.csv">NYC turnstile data</a> from the Udacity course <a href="https://www.udacity.com/course/intro-to-data-science--ud359">Intro to Data Science</a>. Additionally, I will utilize two other external data sets; <a href="https://data.cityofnewyork.us/Transportation/Subway-Stations/arq3-7z49/data">NYC Subway Station</a> and <a href="http://web.mta.info/developers/resources/nyct/turnstile/Remote-Booth-Station.xls
">MTA Remote Unit Code</a>. The location information from the NYC subway station is to be added onto the other data set. After then, the combined data set is merged with the base data set to provide the ridership data by station and its indivisual physical location. The final outcome will be the map of New York City with MTA ridership as dots in the size of popularity.</p>

<h3 style="font-family:Tahoma;">Data Preparation</h3>
<p style="font-family:sans-serif;">&bull; Import Libraries and Files</p>

In [12]:
import numpy as np
import pandas as pd

import geopandas as gpd
import matplotlib.pyplot as plt
%matplotlib inline

import string
from fuzzywuzzy import fuzz

master_df = pd.read_csv('./Data/turnstile_data_master_with_weather.csv')
master_df.drop(['Unnamed: 0'], axis = 1, inplace = True)

master_df['DATEn'] = pd.to_datetime(master_df['DATEn'])
master_df['day'] = master_df['DATEn'].apply(lambda d : d.day)

day_list = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
master_df['day_of_week'] = master_df['DATEn'].dt.weekday_name
master_df['day_of_week'] = master_df['day_of_week'].map(dict(zip(day_list, np.arange(1, 8))))

<br/>
<p style="font-family:sans-serif;">&bull; Define Functions</p>

In [13]:
# stations on the map by ridership - library & functions


#
def process_st_number(words):
    ordinal = ["st", "nd", "rd", "th"]
    result = ""
    words = words.split(" ")
    
    for word in words:
        if word[0].isdigit():
            for match in ordinal:            
                word = word.replace(match, "")
        
        result = result + " " + word
    
    return result.strip()

#
def swap_station_name(name):
    name = name.replace("(", "- ")
    name = name.replace(")", "")
    
    name = process_st_number(name)
    name = name.replace(" - ", "-").upper()
    
    split = name.split("-")
    
    if len(split) == 2:
        if (not split[0][0].isdigit()) and (split[1][0].isdigit()):
            return split[1] + "-" + split[0]
        else:
            return split[0] + "-" + split[1]
    
    return name

# make the two station list in accordance
def process_station_name(name):
    name = swap_station_name(name)
    
    if len(name) >= 15:
        if name[13] is " ":
            return name[0:13]
        else:
            return name[0:15]
        
    return name

# 
def compare_station_name(name, match_list):
    if (name in match_list):
        return name
    
    highest = 0
    similar_name = ""
    for match in match_list:
        score = fuzz.ratio(name, match)
        if score > highest:
            highest = score
            similar_name = match
            
    return similar_name

<br/>
<p style="font-family:sans-serif;">&bull; Read and Process Data Sets in Format</p>

In [14]:
# stations on the map by ridership - read & process data

remote_unit_df = pd.read_csv('./Data/Remote-Booth-Station.csv',
                             usecols = ['Remote', 'Station'])
remote_unit_df.drop_duplicates(subset = 'Remote', keep = 'first', inplace = True)
remote_unit_df.columns = ['UNIT', 'station']
remote_unit_df['station'] = remote_unit_df['station'].apply(
    lambda s : swap_station_name(s))

station_position_df = pd.read_csv('./Data/DOITT_SUBWAY_STATION_01_13SEPT2010.csv',
                                  usecols = ['NAME', 'the_geom'])
station_position_df.drop_duplicates(subset = 'NAME', keep = 'first', inplace = True)
station_position_df.columns = ['station', 'position']
station_position_df['station'] = station_position_df['station'].apply(
    lambda s : process_station_name(s))

<br/>
<p style="font-family:sans-serif;">&bull; Process Data Sets for Comparison</p>

In [15]:
# stations on the map by ridership - prepare to merge

station_position_df['station_temp'] = station_position_df['station'].apply(
    lambda name : compare_station_name(name, list(remote_unit_df['station'])))

diff = []
for i in range(0, len(station_position_df)):
    if (station_position_df['station'].iloc[i] is not 
            station_position_df['station_temp'].iloc[i]):
        diff.append(i)
        
with pd.option_context('display.max_rows', None, 'display.max_columns', 3):
    display((station_position_df.iloc[diff]).sort_values('station'))

Unnamed: 0,station,position,station_temp
13,104-102 STS,POINT (-73.84443500029684 40.69516599823373),104 ST
6,110 ST-CATHEDRA,POINT (-73.95806670661364 40.800581558114956),110 ST-CATHEDRL
159,110 ST-CENTRAL,POINT (-73.95182200176913 40.79907499977324),110 ST-CATHEDRL
140,138 ST-GRAND CO,POINT (-73.92984899935611 40.81322399958908),138 ST-GR CONC
260,148 ST-HARLEM,POINT (-73.93647000005559 40.82388000080457),148 ST-LENOX
46,149 ST-GRAND CO,POINT (-73.9273847542618 40.81830344372315),149 ST-GR CONC
106,15 ST-PROSPECT,POINT (-73.97973580592873 40.66003568810021),15 ST-PROSPECT
411,163 ST-AMSTERDA,POINT (-73.93989200188344 40.83601299923096),163 ST-AMSTERDM
57,168 ST,POINT (-73.93956099985425 40.84071899990795),18 ST
20,174-175 STS,POINT (-73.91013600050078 40.84589999983414),174-175 ST


<br/>
<p style="font-family:sans-serif;">&bull; Fine Tuning Before Merging</p>

In [16]:
# stations on the map by ridership - fine tuning

correct_station_name_df = pd.read_csv('./Data/correct_station_name.txt')

for row, index in correct_station_name_df.iterrows():
    station_position_df['station_temp'][index[0]] = index[1]
    
station_position_df['station'] = station_position_df['station_temp']
station_position_df.drop('station_temp', axis = 1, inplace = True)

<br/>
<p style="font-family:sans-serif;">&bull; Merge the Data Sets</p>

In [17]:
# stations on the map by ridership - finalize date preparation

# merge on station names and UNITs
remote_unit_df = pd.merge(remote_unit_df, station_position_df, on = 'station')
master_df = pd.merge(master_df, remote_unit_df, on = 'UNIT')
master_df

entry_exit_by_station = master_df.groupby(
    ['station', 'position'])['ENTRIESn_hourly', 'EXITSn_hourly'].mean()
entry_exit_by_station_df = pd.DataFrame(entry_exit_by_station)
entry_exit_by_station_df.drop_duplicates(inplace = True)
entry_exit_by_station_df.reset_index(level = ['station', 'position'], inplace = True)
entry_exit_by_station_df.columns = ['station', 'position', 'entry_mean', 'exit_mean']

entry_exit_by_station_df.head()

Unnamed: 0,station,position,ENTRIESn_hourly,EXITSn_hourly
0,1 AVE,POINT (-73.98168087489128 40.73097497580066),3295.478495,3594.61828
1,103 ST,POINT (-73.96837899960818 40.799446000334825),1630.593002,1214.834254
2,103 ST-CORONA,POINT (-73.86269999830412 40.749865000555545),2735.22043,2116.392473
3,104 ST,POINT (-73.83768300060997 40.681711001091195),367.005882,299.8
4,110 ST,POINT (-73.94424999687163 40.795020000113105),1713.263441,1484.026882


<h3 style="font-family:Tahoma;">Data Visualization</h3>
<p style="font-family:sans-serif;">&bull; Import Libraries and Files</p>

<h3 style="font-family:Tahoma;">Analysis</h3>
<p style="font-family:sans-serif;">&bull; Import Libraries and Files</p>

<h3 style="font-family:Tahoma;">Conclusion</h3>
<p style="font-family:sans-serif;">&bull; Import Libraries and Files</p>