<h3> Capstone Project <h3>
   

<h4> Introduction/Business problem <h4>

The problem: Which London underground stations could be classified as residential, leisure or business?

London is a huge bustling city og 9 million residents. To understand how this population moves is tied heavily with the famous London underground system and so by categorising each station we can learn more about the population and how it flows through the city. Once categorised we would know what each station is being primarily used for and from which other stations people will be commuting. Knowing the answer to this problem could help a start up business choose which area to open in or perhaps a town planner to understand the flow of the London population when considering how to further extend the rail system.

   

<h4> Data <h4>

The data used for this project will be a list of London undergrounds and their coordiantes from https://wiki.openstreetmap.org/wiki/List_of_London_Underground_stations. 

We will also need to know the venues surrounding each station (within a certain radius) and what type of venue it is to categorise the station. We can find these by using the Foursquare API which incidently comes with it's own set of categories that we will use.

    Arts & Entertainment
    College & University
    Event
    Food
    Nightlife Spot
    Outdoors & Recreation
    Professional & Other Places
    Residence
    Shop & Service
    Travel & Transport
    

<h4>Methodology<h4>

We must first scrape the required data, clean it and place it into a pandas dataframe.

In [101]:
import pandas as pd
from pandas import DataFrame
!pip install folium
import folium
import requests



In [2]:
url = 'https://wiki.openstreetmap.org/wiki/List_of_London_Underground_stations'

df = pd.read_html(url)

df = df[0]

df

Unnamed: 0,Name,Latitude,Longitude,Platform / Entrance,Collected By,Collected On,Line,Step free
0,Acton Town,51.502500,-0.278126,Platform,User:Gagravarr,24/11/06,"District, Piccadilly",
1,Acton Central,51.50883531,-0.263033174,Entrance,User:Firefishy,08/05/2007,London Overground,
2,Acton Central,51.50856013,-0.262879534,Platform,User:Firefishy,08/05/2007,London Overground,
3,Aldgate,51.51394,-0.07537,Aldgate High Street entrance,User:Morwen,28/4/2007,Metropolitan,No
4,Aldgate East,51.51514,-0.07178,Entrance,User:Parsingphase,(2006),"District, Hammersmith & City",
...,...,...,...,...,...,...,...,...
297,Wimbledon,51.42200,-0.20544,Platform,User:Mattwestcott,18/05/2007,District,Unsure
298,Wimbledon Park,51.43391,-0.19864,Platform,User:Mattwestcott,18/05/2007,District,
299,Wood Green,51.59709,-0.10939,Entrance,User:Morwen,26/05/2007,Piccadilly,No (Escalators)
300,Woodford,51.60582,+0.03328,platforms,User:Morwen,21/4/07,Central,


In [3]:
df = df.drop(columns = ['Platform / Entrance', 'Collected By', 'Collected On', 'Step free'])
df.at[42, 'Latitude'] = 51.49787
df.at[42, 'Longitude'] = -0.04967
df = df.drop_duplicates(subset=['Name'])
df.head()

Unnamed: 0,Name,Latitude,Longitude,Line
0,Acton Town,51.5025,-0.278126,"District, Piccadilly"
1,Acton Central,51.50883531,-0.263033174,London Overground
3,Aldgate,51.51394,-0.07537,Metropolitan
4,Aldgate East,51.51514,-0.07178,"District, Hammersmith & City"
5,Alperton,51.54097,-0.30061,Piccadilly


In [4]:
df.shape

(291, 4)

Visualising the stations on a map

In [5]:
Latitude = 51.5074
Longitude = -0.1278

map_london = folium.Map(location=[Latitude, Longitude], zoom_start=10)

# add markers to map
for lat, lng, name in zip(df['Latitude'], df['Longitude'], df['Name']):
    label = '{}'.format(name)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_london)  
    
map_london

Now we define our cleint credentials for the foursquare query.

In [12]:
CLIENT_ID = 'IV2XFVYCOUOD53DGPMJBN1Q5XRN1KQYMXGAHYM5YJRWJLYGW'
CLIENT_SECRET = 'KVWL2APHJTNX2VZ3VFDLEP0Y4L2OLXBZ1YNMUXEZR25VKQD1'
VERSION = '20180605' # Foursquare API version
LIMIT = 200 # A default Foursquare API limit value
radius = 500 # define radius
CATEGORIES = {
    "Arts & Entertainment": "4d4b7104d754a06370d81259",
    "College & University": "4d4b7105d754a06372d81259",
    "Event": "4d4b7105d754a06373d81259",
    "Food": "4d4b7105d754a06374d81259",
    "Nightlife Spot": "4d4b7105d754a06376d81259",
    "Outdoors & Recreation": "4d4b7105d754a06377d81259",
    "Professional & Other Places": "4d4b7105d754a06375d81259",
    "Residence": "4e67e38e036454776db1fb3a",
    "Shop & Service": "4d4b7105d754a06378d81259",
    "Travel & Transport": "4d4b7105d754a06379d81259"}

Now we write the function that will be called for each station to find it's near by venues.

In [25]:
def getNearbyVenues(names, latitudes, longitudes, categories, radius=500):
    
    Category_count=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        results = []
        results.append(name)
        results.append(lat)
        results.append(lng)
        print(name)
        for category in categories.values():    
           # create the API request URL
            url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&categoryId={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
                lat, 
                lng, 
                radius,
                category)
            #make GET request
            result = requests.get(url).json()["response"]['totalResults']
        
            results.append(result)
            
        Category_count.append(results)
            

        
    return Category_count

Calling this function fails since it's sending too many requests, thus we break df up into smaller pieces.

In [26]:
result = getNearbyVenues(df['Name'][0:5], df['Latitude'][0:5], df['Longitude'][0:5], CATEGORIES)
print(result)

Acton Town
Acton Central
Aldgate
Aldgate East
Alperton
[['Acton Town', '51.502500', '-0.278126', 1, 2, 0, 8, 2, 2, 6, 5, 2, 6], ['Acton Central', '51.50883531', '-0.263033174', 4, 0, 0, 6, 5, 9, 6, 0, 13, 3], ['Aldgate', '51.51394', '-0.07537', 11, 29, 0, 94, 66, 40, 64, 9, 43, 52], ['Aldgate East', '51.51514', '-0.07178', 9, 33, 0, 100, 45, 28, 45, 11, 47, 37], ['Alperton', '51.54097', '-0.30061', 0, 4, 0, 7, 3, 7, 5, 2, 9, 4]]


In [27]:
len(result)

5

In [97]:
result2 = getNearbyVenues(df['Name'][285:291], df['Latitude'][285:291], df['Longitude'][285:291], CATEGORIES)
for i in result2:
    result.append(i)
print(result[-5:])

Willesden Junction
Wimbledon
Wimbledon Park
Wood Green
Woodford
Woodside Park
[['Wimbledon', '51.42200', '-0.20544', 5, 2, 0, 52, 16, 7, 39, 1, 61, 22], ['Wimbledon Park', '51.43391', '-0.19864', 0, 0, 0, 8, 1, 4, 7, 0, 4, 3], ['Wood Green', '51.59709', '-0.10939', 4, 2, 0, 31, 11, 6, 23, 6, 47, 10], ['Woodford', '51.60582', '+0.03328', 0, 0, 0, 8, 2, 5, 3, 0, 10, 3], ['Woodside Park', '51.6181717295887', '-0.185578887883903', 0, 0, 0, 3, 0, 6, 3, 4, 3, 3]]


Convert to dataframe

In [142]:
df2 = DataFrame(result,columns=['Name','Latitude','Longitude','Arts & Entertainment', 'College & University','Event','Food','Nightlife Spot','Outdoors & Recreation','Professional & Other Places','Residence','Shop & Service','Travel & Transport'])
df2

Unnamed: 0,Name,Latitude,Longitude,Arts & Entertainment,College & University,Event,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Residence,Shop & Service,Travel & Transport
0,Acton Town,51.502500,-0.278126,1,2,0,8,2,2,6,5,2,6
1,Acton Central,51.50883531,-0.263033174,4,0,0,6,5,9,6,0,13,3
2,Aldgate,51.51394,-0.07537,11,29,0,94,66,40,64,9,43,52
3,Aldgate East,51.51514,-0.07178,9,33,0,100,45,28,45,11,47,37
4,Alperton,51.54097,-0.30061,0,4,0,7,3,7,5,2,9,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
285,Wimbledon,51.42200,-0.20544,5,2,0,52,16,7,39,1,61,22
286,Wimbledon Park,51.43391,-0.19864,0,0,0,8,1,4,7,0,4,3
287,Wood Green,51.59709,-0.10939,4,2,0,31,11,6,23,6,47,10
288,Woodford,51.60582,+0.03328,0,0,0,8,2,5,3,0,10,3


Saving the dataframe since it took bloody ages to send those Foursquare queries 5 at a time.

In [111]:
from project_lib import Project
from pyspark import SparkFiles as sc
project = Project(sc,"0fae3d18-58f3-466c-8a24-183728810356","p-c5c6fdd6d62d9eb686a45ffcc5d9a4e79700b7ce")
project.save_data(file_name = "LondonUndergroundDF.csv",data = df2.to_csv(index=False))

2020-10-27 00:02:52,329 - __PROJECT_LIB__ - ERROR - failed to initialize ibmos2spark integration
Traceback (most recent call last):
  File "/opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages/project_lib/storage/bcos.py", line 138, in _initialize_bcos2spark
    import ibmos2spark
ModuleNotFoundError: No module named 'ibmos2spark'


{'file_name': 'LondonUndergroundDF.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'datasciencecourse-donotdelete-pr-osiwcs4hziaax0',
 'asset_id': '9fc46801-1cc5-4da8-b6ee-20a53af91579'}

Cluster the stations

In [112]:
from sklearn.cluster import KMeans
import numpy as np

In [117]:
df3 = df2[['Arts & Entertainment', 'College & University','Event','Food','Nightlife Spot','Outdoors & Recreation','Professional & Other Places','Residence','Shop & Service','Travel & Transport']]
df3.head()

Unnamed: 0,Arts & Entertainment,College & University,Event,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Residence,Shop & Service,Travel & Transport
0,1,2,0,8,2,2,6,5,2,6
1,4,0,0,6,5,9,6,0,13,3
2,11,29,0,94,66,40,64,9,43,52
3,9,33,0,100,45,28,45,11,47,37
4,0,4,0,7,3,7,5,2,9,4


In [136]:
kmeans = KMeans(n_clusters=3, random_state=0).fit(df3)
kmeans.labels_

array([0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 2, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 2, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 2, 0, 0, 2, 2,
       0, 0, 0, 2, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 2, 1, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 0], dtype=int32)

In [146]:
colours = ['yellow', 'orange', 'red']
Latitude = 51.5074
Longitude = -0.1278
i = 0

map_london = folium.Map(location=[Latitude, Longitude], zoom_start=10)

# add markers to map
for lat, lng, name in zip(df2['Latitude'], df2['Longitude'], df2['Name']):
    label = '{}'.format(name)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=colours[kmeans.labels_[i]],
        fill=True,
        fill_opacity=0.7,
        parse_html=False).add_to(map_london)  
    i += 1
    
map_london