# Ron Daniel Analysis of East London Suburbs with Parks and Amenities

## Introduction

### The Covid crisis has highlighted the need for easy access to open spaces for citizens, especially those who live in apartments and may not have a private garden.  As a result, many are looking to move to neighbourhoods and suburbs with parks and leisure facilities, especially those that can be reached by walking.  In London, historically there has always a migration from the city centre towards the suburbs.  Young couples or recent immigrants start off in apartments near the city centre, where night life and pubs are abundant, and as they mature, they seek areas where houses with gardens are more abundant and affordable, and the quality of schooling is often better.  Covid has accelerated this trend.

### This project aims to identify which boroughs and suburbs in East London are the most attractive in terms of parks and open spaces.  I have chosen East London because the migration from the centre of London is often towards the east because housing there is relatively cheap (the 'East End' was heavily bombed during the Second World War) compared to the wealthier north, west and south of London.  I also live in East London so it is of personal interest to me.


## Data Sources

### In this project, I plan to scrape a list of London suburbs from Wikipedia, filter it down to the suburbs within the eight London boroughs in E/NE London, and use the Foursquare API to get venues information on those suburbs.  I will then use KMeans to cluster the data to identify which areas have the most parks and open spaces.  

## 1. Import Libraries

In [4]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes 

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 


import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries imported.


## 2. Scrape London Suburbs data from Wikipedia

In [6]:
## Visit the page and see the attributes of the desired table in page source code.  In Firefox code inspection mode can be enabled with Ctrl+Shift+C

## Get page into Python


import requests

url_="https://en.wikipedia.org/wiki/List_of_areas_of_London"
class_="wikitable"

response=requests.get(url_)
if response.status_code == 200:
    print('Successfully loaded {}'.format(url_))
else:
    raise Exception('Experienced a problem while loading {url} : {code}'.format(url=url_, code=response.status_code))

Successfully loaded https://en.wikipedia.org/wiki/List_of_areas_of_London


In [7]:
## import beautifulsoup for parsing HTML
from bs4 import BeautifulSoup as bs

# Parse the response.text with an html parser
# lxml is one of the faster parsers out there, but you may use 'html.parser' as well
soup = bs(response.text, 'lxml')

# This line finds all tables with a class we specified in variable class_ in the entire webpage.
# all_tables will be a list of all matches
all_tables = soup.find_all('table')

print('Found {} objects of attribute type \'table\''.format(len(all_tables)))

Found 5 objects of attribute type 'table'


In [8]:
## Find our table with class 'wikitable'
table = soup.find_all('table', class_=class_)
print('Found {} objects of attribute type \'table\' and class \'{}\''.format(len(table), class_))

Found 1 objects of attribute type 'table' and class 'wikitable'


## 3. Take html into pandas dataframe

In [10]:
# Automatic parsing via Pandas. NOTE: It loads data as List, not as DataFrame
df_list = pd.read_html(str(table))

# So we make it into a dataframe:
df = pd.DataFrame(df_list[0])
# Let's replace spaces with underscores for convenience
display(df.head())

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728


## 4. Extract lat/Longs using Nominatim and a function to apply to each row of the dataframe

In [12]:
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

locator = Nominatim(user_agent="myGeocoder")
# We are using RateLimiter to not overload the public service.
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)

def get_geo(location):
    try:
        geo = geocode(location)
    except:
        print("Error! Connection failed for location {}!".format(location))
    if geo:
        return geo.latitude or 'Lat Not Found', geo.longitude or 'Long Not Found'
    else:
        # THIS SPECIFIC GEOCODER FOUND NO COORDINATES BY THAT ADDRESS
        return "No location"

In [13]:
## Test function with location for London

get_geo('London')

(51.5073219, -0.1276474)

In [14]:
## Define search name for all locations
df['SearchLocation'] = df["Post town"] + ", " + df['Location']
df.head()

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref,SearchLocation
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785,"LONDON, Abbey Wood"
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805,"LONDON, Acton"
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645,"CROYDON, Addington"
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665,"CROYDON, Addiscombe"
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728,"BEXLEY, SIDCUP, Albany Park"


In [15]:
## Let's test getting locations on a smaller dataframe:

df_test = df[:5]

df_test['Geodata'] = df_test['SearchLocation'].apply(get_geo)
df_test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test['Geodata'] = df_test['SearchLocation'].apply(get_geo)


Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref,SearchLocation,Geodata
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785,"LONDON, Abbey Wood","(51.487621, 0.1140504)"
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805,"LONDON, Acton","(51.5081402, -0.2732607)"
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645,"CROYDON, Addington","(44.4206405, -76.978248)"
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665,"CROYDON, Addiscombe","(51.3796916, -0.0742821)"
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728,"BEXLEY, SIDCUP, Albany Park","(51.4353837, 0.1259653)"


In [16]:
## Get locations for all suburbs
df['Geodata'] = df['SearchLocation'].apply(get_geo)
df.head()

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref,SearchLocation,Geodata
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785,"LONDON, Abbey Wood","(51.487621, 0.1140504)"
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805,"LONDON, Acton","(51.5081402, -0.2732607)"
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645,"CROYDON, Addington","(44.4206405, -76.978248)"
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665,"CROYDON, Addiscombe","(51.3796916, -0.0742821)"
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728,"BEXLEY, SIDCUP, Albany Park","(51.4353837, 0.1259653)"


In [17]:
## Check size of dataframe
df.shape

(531, 8)

In [18]:
## examine data for filtering down to East London
df["Post town"].unique()

array(['LONDON', 'CROYDON', 'BEXLEY, SIDCUP', 'ILFORD', 'WEMBLEY',
       'WESTERHAM', 'HORNCHURCH', 'BARNET, LONDON', 'BARKING',
       'BEXLEYHEATH', 'DARTFORD', 'LONDON, BARNET', 'BARNET',
       'BECKENHAM, LONDON', 'LONDON, BARKING', 'DAGENHAM',
       'WALLINGTON, CROYDON', 'HARROW, STANMORE', 'SUTTON', 'BELVEDERE',
       'SURBITON', 'BEXLEY', 'BEXLEYHEATH, LONDON', 'BROMLEY', 'SIDCUP',
       'ENFIELD', 'BRENTFORD', 'EDGWARE', 'CARSHALTON', 'ROMFORD',
       'SUTTON/MERTON', 'ORPINGTON', 'CHESSINGTON', 'CHISLEHURST',
       'ERITH', 'WEST WICKHAM', 'KINGSTON UPON THAMES', 'COULSDON',
       'UXBRIDGE', 'HOUNSLOW', 'UPMINSTER', 'SEVENOAKS', 'FELTHAM',
       'WELLING', 'PINNER', 'BECKENHAM', 'TWICKENHAM', 'LONDON, WELLING',
       'TEDDINGTON, HAMPTON', 'GREENFORD', 'CHIGWELL', 'RICHMOND',
       'HAMPTON', 'HAYES', 'WEST DRAYTON', 'HARROW', 'ISLEWORTH',
       'KENLEY', 'HARROW, LONDON', 'KESTON', 'LONDON, SIDCUP', 'MORDEN',
       'MITCHAM', 'NEW MALDEN', 'NORTHOLT', 'NORTHWOO

In [19]:
## examine data for filtering down to East London subset
for col in df:
    print(df[col].unique())

['Abbey Wood' 'Acton' 'Addington' 'Addiscombe' 'Albany Park'
 'Aldborough Hatch' 'Aldgate' 'Aldwych' 'Alperton' 'Anerley' 'Angel'
 'Aperfield' 'Archway' 'Ardleigh Green' 'Arkley' 'Arnos Grove' 'Balham'
 'Bankside' 'Barbican' 'Barking' 'Barkingside' 'Barnehurst' 'Barnes'
 'Barnes Cray' 'Barnet Gate' 'Barnet (also Chipping Barnet, High Barnet)'
 'Barnsbury' 'Battersea' 'Bayswater' 'Beckenham' 'Beckton' 'Becontree'
 'Becontree Heath' 'Beddington' 'Bedford Park' 'Belgravia' 'Bellingham'
 'Belmont' 'Belsize Park' 'Belvedere' 'Bermondsey' 'Berrylands'
 'Bethnal Green' 'Bexley (also Old Bexley, Bexley Village)'
 'Bexleyheath (also Bexley New Town)' 'Bickley' 'Biggin Hill' 'Blackfen'
 'Blackfriars' 'Blackheath' 'Blackheath Royal Standard' 'Blackwall'
 'Blendon' 'Bloomsbury' 'Botany Bay' 'Bounds Green' 'Bow' 'Bowes Park'
 'Brentford' 'Brent Cross' 'Brent Park' 'Brimsdown' 'Brixton' 'Brockley'
 'Bromley' 'Bromley (also Bromley-by-Bow)' 'Bromley Common' 'Brompton'
 'Brondesbury' 'Brunswick Park' 

## 5. Clean data; drop unwanted columns and slice to boroughs in East London north of the river Thames

In [21]:
## unable to drop 'Postcode district' and 'Dial code'; pandas doesn't like spaces in column name, so renamed and dropped unwanted columns
df.columns = ['Location', 'borough', 'post_town', 'post_district', 'dial_code', 'grid', 'SearchLocation', 'Geodata']
df.drop(columns=['post_town', 'post_district', 'dial_code', 'grid', 'SearchLocation'], axis=1, inplace=True)

In [22]:
## List Columns in dataframe
df.columns

Index(['Location', 'borough', 'Geodata'], dtype='object')

In [23]:
## examine data for filtering down to East London
df["borough"].unique()

array(['Bexley, Greenwich [7]', 'Ealing, Hammersmith and Fulham[8]',
       'Croydon[8]', 'Bexley', 'Redbridge[9]', 'City[10]',
       'Westminster[10]', 'Brent[11]', 'Bromley[11]', 'Islington[8]',
       'Islington[12]', 'Havering[12]', 'Barnet[12]', 'Enfield[12]',
       'Wandsworth[13]', 'Southwark[14]', 'City[14]',
       'Barking and Dagenham[14]', 'Redbridge[15]', 'Bexley[15]',
       'Richmond upon Thames[15]', 'Bexley[16]', 'Barnet', 'Barnet[16]',
       'Islington[17]', 'Wandsworth[18]', 'Westminster[19]',
       'Bromley[20]', 'Newham[20]', 'Barking and Dagenham[20]',
       'Barking and Dagenham[21]', 'Sutton[21]', 'Ealing[21]',
       'Westminster[22]', 'Lewisham[22]', 'Harrow[22]', 'Sutton[22]',
       'Camden[23]', 'Bexley[23]', 'Southwark[24]',
       'Kingston upon Thames[24]', 'Tower Hamlets[25]', 'Bexley[25]',
       'Bexley[26]', 'Bromley[26]', 'Bexley[27]', 'City[27]',
       'Lewisham[28]', 'Greenwich', 'Tower Hamlets[28]', 'Camden[29]',
       'Enfield[30]', 'Hari

In [24]:
df['Geodata']=df['Geodata'].astype(str)
df['Geodata'].dtype

dtype('O')

In [25]:
## Separate 'grid' column into 'latitude' and 'longitude' columns
df['Geodata']=df['Geodata'].str.replace(" ","").str.strip('(').str.strip(')')
df['latitude']=df['Geodata'].str.split(',').str[0]
df['longitude']=df['Geodata'].str.split(',').str[1]
df.head(10)

Unnamed: 0,Location,borough,Geodata,latitude,longitude
0,Abbey Wood,"Bexley, Greenwich [7]","51.487621,0.1140504",51.487621,0.1140504
1,Acton,"Ealing, Hammersmith and Fulham[8]","51.5081402,-0.2732607",51.5081402,-0.2732607
2,Addington,Croydon[8],"44.4206405,-76.978248",44.4206405,-76.978248
3,Addiscombe,Croydon[8],"51.3796916,-0.0742821",51.3796916,-0.0742821
4,Albany Park,Bexley,"51.4353837,0.1259653",51.4353837,0.1259653
5,Aldborough Hatch,Redbridge[9],Nolocation,Nolocation,
6,Aldgate,City[10],"51.5142477,-0.0757186",51.5142477,-0.0757186
7,Aldwych,Westminster[10],"51.5131312,-0.1175934",51.5131312,-0.1175934
8,Alperton,Brent[11],"51.5408036,-0.3000963",51.5408036,-0.3000963
9,Anerley,Bromley[11],"51.4075993,-0.0619394",51.4075993,-0.0619394


In [26]:
## Strip dataframe to boroughs in East London near where I live
ELondon_df = df[df["borough"].str.contains("Redbridge|City|Havering|Barking and Dagenham|Tower Hamlets|Haringey|Waltham Forest|Hackney")].reset_index(drop=True)
ELondon_df.head()

Unnamed: 0,Location,borough,Geodata,latitude,longitude
0,Aldborough Hatch,Redbridge[9],Nolocation,Nolocation,
1,Aldgate,City[10],"51.5142477,-0.0757186",51.5142477,-0.0757186
2,Ardleigh Green,Havering[12],"51.5712468,0.2190799",51.5712468,0.2190799
3,Barbican,City[14],"51.5201501,-0.0986832",51.5201501,-0.0986832
4,Barking,Barking and Dagenham[14],"51.5402677,0.0793235",51.5402677,0.0793235


In [27]:
## Drop 'Geodata column
ELondon_df.drop(columns=['Geodata'], axis=1, inplace=True)
ELondon_df.head(20)

Unnamed: 0,Location,borough,latitude,longitude
0,Aldborough Hatch,Redbridge[9],Nolocation,
1,Aldgate,City[10],51.5142477,-0.0757186
2,Ardleigh Green,Havering[12],51.5712468,0.2190799
3,Barbican,City[14],51.5201501,-0.0986832
4,Barking,Barking and Dagenham[14],51.5402677,0.0793235
5,Barkingside,Redbridge[15],51.581935349999995,0.0700570832366836
6,Becontree,Barking and Dagenham[20],51.5403111,0.1265241
7,Becontree Heath,Barking and Dagenham[21],51.5610299,0.1478793
8,Bethnal Green,Tower Hamlets[25],51.5303456,-0.0561633
9,Blackfriars,City[27],51.5115854,-0.1037671


In [28]:
ELondon_df = df[df["borough"].str.contains("Redbridge|City|Havering|Barking and Dagenham|Tower Hamlets|Waltham Forest|Hackney")].reset_index(drop=True)

In [29]:
## Make copy of Dataframe
ELondondf_new = ELondon_df.copy()

In [30]:
## Save dataframe to csv file for future use
ELondon_df.to_csv('C:/Users/rbd63/Documents/Jobs Stuff/Jobs Stuff/Data Science/IBM Data Science/09_Applied Data Science Capstone/ELondon_df.csv')

In [31]:
ELondon_df.describe()

Unnamed: 0,Location,borough,Geodata,latitude,longitude
count,97,97,97,97,94.0
unique,97,25,91,91,90.0
top,Temple,Havering,Nolocation,Nolocation,0.0689113
freq,1,20,3,3,2.0


In [32]:
## Drop rows with NaN values
ELondon_clean = ELondondf_new.dropna()
ELondon_clean = ELondon_clean.reset_index(drop=True)
ELondon_clean.describe()

Unnamed: 0,Location,borough,Geodata,latitude,longitude
count,94,94,94,94.0,94.0
unique,94,23,90,90.0,90.0
top,Temple,Havering,"51.5589375,0.0689113",51.5589375,0.0689113
freq,1,19,2,2.0,2.0


In [33]:
## Drop rows with NaN values from ELondon_df
ELondon_df2 = ELondon_df.dropna()
ELondon_df2 = ELondon_df2.reset_index(drop=True)
ELondon_df2.describe()

Unnamed: 0,Location,borough,Geodata,latitude,longitude
count,94,94,94,94.0,94.0
unique,94,23,90,90.0,90.0
top,Temple,Havering,"51.5589375,0.0689113",51.5589375,0.0689113
freq,1,19,2,2.0,2.0


In [34]:
ELondon_df2.head()

Unnamed: 0,Location,borough,Geodata,latitude,longitude
0,Aldgate,City[10],"51.5142477,-0.0757186",51.5142477,-0.0757186
1,Ardleigh Green,Havering[12],"51.5712468,0.2190799",51.5712468,0.2190799
2,Barbican,City[14],"51.5201501,-0.0986832",51.5201501,-0.0986832
3,Barking,Barking and Dagenham[14],"51.5402677,0.0793235",51.5402677,0.0793235
4,Barkingside,Redbridge[15],"51.581935349999995,0.07005708323668366",51.58193535,0.0700570832366836


In [35]:
## Drop 'Geodata' column
ELondon_df2.drop(columns=['Geodata'], axis=1, inplace=True)
ELondon_df2.head(20)

Unnamed: 0,Location,borough,latitude,longitude
0,Aldgate,City[10],51.5142477,-0.0757186
1,Ardleigh Green,Havering[12],51.5712468,0.2190799
2,Barbican,City[14],51.5201501,-0.0986832
3,Barking,Barking and Dagenham[14],51.5402677,0.0793235
4,Barkingside,Redbridge[15],51.58193535,0.0700570832366836
5,Becontree,Barking and Dagenham[20],51.5403111,0.1265241
6,Becontree Heath,Barking and Dagenham[21],51.5610299,0.1478793
7,Bethnal Green,Tower Hamlets[25],51.5303456,-0.0561633
8,Blackfriars,City[27],51.5115854,-0.1037671
9,Blackwall,Tower Hamlets[28],51.5079378,-0.0071843


In [36]:
## Save dataframe to csv file for future use
ELondon_df2.to_csv('C:/Users/rbd63/Documents/Jobs Stuff/Jobs Stuff/Data Science/IBM Data Science/09_Applied Data Science Capstone/ELondon_df2.csv')

In [37]:
ELondon_df2['latitude'].dtype

dtype('O')

## 6. Visualise London Suburbs Data on Folium Map


In [38]:
## convert lat/long data from string to float
ELondon_df2['latitude'] = ELondon_df2['latitude'].astype(float)
ELondon_df2['longitude'] = ELondon_df2['longitude'].astype(float)

In [39]:
# Use geopy library to get the latitude and longitude values of London.
address = 'London'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of London are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of London are 51.5073219, -0.1276474.


In [40]:
# Create a map of London with suburbs superimposed on top

# create map of London using latitude and longitude values
map_London = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, location, borough in zip(ELondon_df2['latitude'], ELondon_df2['longitude'], ELondon_df2['Location'], ELondon_df2['borough']):
    label = '{}, {}'.format(location, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_London)  
    
map_London

In [4]:
## Define Foursquares Credentials and Version
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

{
    "tags": [
        "remove-cell",
    ]
}

Your credentails:
CLIENT_ID: 
CLIENT_SECRET:


{'tags': ['remove-cell']}

## 7. Get venues data for East London suburbs and analyse

In [42]:
## create a function to get venues data for London suburbs, within 1000m

def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Suburb', 
                  'Suburb Latitude', 
                  'Suburb Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [43]:
# create a new dataframe called London_venues
London_venues = getNearbyVenues(names=ELondon_df2['Location'],
                                   latitudes=ELondon_df2['latitude'],
                                   longitudes=ELondon_df2['longitude']
                                  )

Aldgate


KeyError: 'groups'

In [None]:
# Check the size of the resulting dataframe
print(London_venues.shape)
London_venues.head(20)

In [None]:
# Check how many venues were returned for each suburb
London_venues.groupby('Suburb').count()

In [None]:
# Find out how many unique categories can be curated from all the returned venues
print('There are {} uniques categories.'.format(len(London_venues['Venue Category'].unique())))

In [None]:
# one hot encoding
London_onehot = pd.get_dummies(London_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhood column back to dataframe
London_onehot['Suburb'] = London_venues['Suburb'] 

# move neighbourhood column to the first column
fixed_columns = [London_onehot.columns[-1]] + list(London_onehot.columns[:-1])
London_onehot = London_onehot[fixed_columns]

London_onehot.shape

In [None]:
## group rows by suburb and by taking the mean of the frequency of occurrence of each category
London_grouped = London_onehot.groupby('Suburb').mean().reset_index()
London_grouped

In [None]:
# Let's confirm the new size
London_grouped.shape

In [None]:
# Let's print each suburb along with the top 5 most common venues
num_top_venues = 5

for hood in London_grouped['Suburb']:
    print("----"+hood+"----")
    temp = London_grouped[London_grouped['Suburb'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
# Put that into a pandas dataframe
# First, let's write a function to sort the venues in descending order.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
# Create the new dataframe and display the top 10 venues for each suburb.
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Suburb']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
suburbs_venues_sorted = pd.DataFrame(columns=columns)
suburbs_venues_sorted['Suburb'] = London_grouped['Suburb']

for ind in np.arange(London_grouped.shape[0]):
    suburbs_venues_sorted.iloc[ind, 1:] = return_most_common_venues(London_grouped.iloc[ind, :], num_top_venues)

suburbs_venues_sorted.head()

## 8. Cluster Suburbs using KMeans

In [None]:
# set number of clusters
kclusters = 4

London_grouped_clustering = London_grouped.drop('Suburb', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(London_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

In [None]:
## Find best value of k for clustering, using elbow method
from sklearn import metrics
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pyplot as plt

# determine best k
distortions = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k).fit(London_grouped_clustering)
    kmeanModel.fit(London_grouped_clustering)
    distortions.append(sum(np.min(cdist(London_grouped_clustering, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / London_grouped_clustering.shape[0])

# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing optimum value for k')
plt.show()

## It is difficult to decipher a clear elbow from the above plot but there are gradient changes at k=3 and k=5, so I will use 4 as the optimal value

## create a new dataframe that includes the cluster label as well as the top 10 venues for each suburb

In [None]:
## add cluster labels
suburbs_venues_sorted.insert(0, 'Cluster_Labels', kmeans.labels_)

London_merged = ELondon_df2

# merge London_grouped with London_data to add latitude/longitude for each suburb
London_merged2 = London_merged.join(suburbs_venues_sorted.set_index('Suburb'), on='Location')

In [None]:
London_merged2.describe

In [None]:
London_merged3 = London_merged2.dropna()
London_merged3.shape

In [None]:
London_merged3['Cluster Labels']= London_merged3['Cluster Labels'].astype(int)
London_merged3.head()

In [None]:
London_merged3.drop(columns=['Cluster_Labels'], axis=1, inplace=True)
London_merged3.head()

## 9. Visualise resulting clusters

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(London_merged3['latitude'], London_merged3['longitude'], London_merged3['Location'], London_merged3['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Examine Clusters
# Now I examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, I assign a name to each cluster

### Cluster 0

In [None]:
London_merged3.loc[London_merged3['Cluster Labels'] == 0, London_merged3.columns[[0, 1] + list(range(5, London_merged3.shape[1]))]]

## Cluster 1: 

In [None]:
London_merged3.loc[London_merged3['Cluster Labels'] == 1, London_merged3.columns[[0, 1] + list(range(5, London_merged3.shape[1]))]]

## Cluster 2:

In [None]:
London_merged3.loc[London_merged3['Cluster Labels'] == 2, London_merged3.columns[[0, 1] + list(range(5, London_merged3.shape[1]))]]

## Cluster 3:

In [None]:
London_merged3.loc[London_merged3['Cluster Labels'] == 3, London_merged3.columns[[0, 1] + list(range(5, London_merged3.shape[1]))]]

## End of Analysis