# Finding Similar Public Transportation Systems to Washington D.C. to Explore Next Steps for Growth

#### Part of the IBM Data Science Capstone Project

### I. Introduction

The D.C. Metro is one of the busiest rapid transit systems in the United States, second only to the New York City Subway. As of May 2020, the network includes 91 stations, six lines and serves thousands in the surrounding areas of Maryland and Virginia. The Washington Metropolitan Area Transit Authority anticipates an average of one million riders daily by 2030. Due to the increase in population and commuters, the WMATA has been focused on extending service, building new stations, and constructing additional lines to alleviate congestion. 

This necessity for growth can be either be an obstacle that hurts a city fiscally and does not meet the demands of the citizens, or it can be an opportunity to expand efficiently, thus decreasing congestion, and easing daily commutes as the population grows. Rather than grow blindly, cities can use the growing arsenal of data analytic techniques to anticipate demand. 

To address this need, I propose a twofold analysis. The first stage involves examining and clustering metro stations in D.C. based on surrounding venues. Leveraging the Foursquare API, we can explore the venue types surrounding each station, thus allowing us to make a model that clusters stations based on their primary usage. In just this first step, city planners can begin to predict the demand of people traveling from this station. For example, if the station is in a largely residential neighborhood, it can be presumed that there is a necessity for transit to commercial and professional neighborhoods. 

The second stage involves looking at a variety of cities with an efficient and expansive transit network and conducting similar clustering analyses. Based on the most popular classifications of metro stations across several cities, we can determine how similar or dissimilar the system in D.C. is to other cities. The similarly clustered cities to D.C. can be explored further by city planners. By looking at more extensive but similar transit systems across the globe, the WMATA can examine how these cities expanded their metro stations. This allows the opportunity for the Washington Metro to learn from predecessors successes and mistakes in the expansion process.  



### II. Data

For the first stage, location of each metro station, and the venues surrounding it are needed. 
1. All the metro stations in the D.C. metropolitan area are scraped from [this Wikipedia page.](https://en.wikipedia.org/wiki/List_of_Washington_Metro_stations)

    To obtain the coordinates for each station, a geolocator will be used.
2. Using the Foursquare API, the venues within 500 m will be examined and their type determined. Foursquare will assist in the categorization. 

For the second stage other metro systems must be selected, and a similar examination must occur.
1. A list of metro systems around the world with be scraped from [this Wikipedia page.](https://en.wikipedia.org/wiki/List_of_metro_systems)

   The metro stations with the highest annual riderships will be selected for further examination. Using this technique versus hand picking cities with high quality transit systems, allows for the WMATA to see the downfalls of larger systems that do not rank among some of the best. This will allow D.C.'s city planners to learn from similar cities' mistakes while also learning from some of the greatest similar transit systems in the world. 

2. A list of each metro station in a city and the geographical coordinates will be obtained either through scraping the Wikipedia page or using a geolocator. 

   This is necessary because while the Wikipedia pages do list every metro station in a city's system, they do not always include the coordinates. Therefore, how the coordinates for a station are obtained is dependent on what information can be scraped from the Wikipedia page. For example, the [list of stations in Hong Kong](https://en.wikipedia.org/wiki/List_of_MTR_stations) does not include longitude and latitude. In order to get this, the stations location must be determined using a geolocator. 
   
3. Using the Foursquare API, the venues within 500 m will be examined and their type determined. Foursquare will assist in the categorization. 

This concludes the data gathering portion of this project. 

In [4]:
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values


Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.21.0               |             py_0          58 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.2 MB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.21.0-py_0

The following packages will be UPDATED:

  openssl                                 1.1.1f-h516909a_0 --> 1.1.1g-h51

In [8]:
address = 'Lo Wu Station, HK'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)

22.5292045 114.1142734
