# How Similar are Neighbourhoods in London, Sydney and New York?
### Aaron Armour

We begin by importing all of the Python modules which we will use in this notebook.

In [1]:
# Module imports
import requests
from bs4 import BeautifulSoup
import re
from geopy.geocoders import Nominatim
import pandas as pd
from sklearn.cluster import KMeans
# Uncomment the line below if folium is not installed
!pip install folium
import folium
import numpy as np
from scipy.optimize import linear_sum_assignment

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/a4/f0/44e69d50519880287cc41e7c8a6acc58daa9a9acf5f6afc52bcc70f69a6d/folium-0.11.0-py2.py3-none-any.whl (93kB)
[K     |████████████████████████████████| 102kB 8.9MB/s ta 0:00:011
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/13/fb/9eacc24ba3216510c6b59a4ea1cd53d87f25ba76237d7f4393abeaf4c94e/branca-0.4.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0


## Obtain lists of neighbourhoods from Wikipedia

#### Construct a list of neighbourhoods for London

In [2]:
# Has table containing boroughs and whether they are inner or outer:
wp_london_b_url = 'https://en.wikipedia.org/wiki/List_of_places_in_London'
# Has table of neighbourhoods:
wp_london_n_url = 'https://en.wikipedia.org/wiki/List_of_areas_of_London'

response = requests.get(wp_london_b_url)
london_b_data = response.content

response = requests.get(wp_london_n_url)
london_n_data = response.content

Define a function which will help us to extract data from tables in HTML.

In [3]:
# A function to help with extracting data out of a table in HTML

def table_extractor(table, columns_to_keep, skip_first=True):
    data = []
    
    for i, row in enumerate(table.children):
        if i == 0 and skip_first:
            # skip first row
            continue
        if row.name == 'tr':
            data.append(tuple([item.string for j, item in enumerate(row.find_all('td')) if j in columns_to_keep]))
            #data.append(tuple([item for j, item in enumerate(row.find_all('td')) if j in columns_to_keep]))

    return data

In [4]:
# Beautiful Soup seems to have problems when opening and closing tags are separate by a line break.
# So let's get rid of those pesky line breaks.
london_b_data = london_b_data.replace(b'\n</td>', b'</td>')

soup = BeautifulSoup(london_b_data)
table = soup.find('tbody')

inner_boroughs = []
for borough, category in table_extractor(table, [1,2]):
    if category.startswith('Inner'):
        inner_boroughs.append(borough)

print('Inner boroughs of London are:')
print(inner_boroughs)

Inner boroughs of London are:
['Camden', 'Greenwich', 'Hackney', 'Hammersmith and Fulham', 'Islington', 'Kensington and Chelsea', 'Lambeth', 'Lewisham', 'Southwark', 'Tower Hamlets', 'Wandsworth', 'Westminster']


In [5]:
# Again let's get rid of those pesky line breaks.
london_n_data = london_n_data.replace(b'\n</td>', b'</td>')

# Remove some tags which are causing problems with extracting the data
london_n_data = london_n_data.replace(b'<br />', b', ')
london_n_data = re.sub(b'\s*<sup.*?</sup>', b'', london_n_data)
london_n_data = re.sub(b'\s*\(also.*?\)', b'', london_n_data)

soup = BeautifulSoup(london_n_data)
table = soup.find('tbody')
table = table.find_next('tbody')

london_nbhds = []
for nbhd, borough in table_extractor(table, [0,1]):
    if ',' in borough or '&' in borough:
        for inner_b in inner_boroughs:
            if inner_b in borough:
                london_nbhds.append(nbhd)
                break
    else:
        if borough in inner_boroughs:
            london_nbhds.append(nbhd)

print('{} neighbourhoods in the Inner Boroughs of London:'.format(len(london_nbhds)))
print(london_nbhds)

187 neighbourhoods in the Inner Boroughs of London:
['Abbey Wood', 'Acton', 'Aldwych', 'Angel', 'Archway', 'Balham', 'Bankside', 'Barnsbury', 'Battersea', 'Bayswater', 'Belgravia', 'Bellingham', 'Belsize Park', 'Bermondsey', 'Bethnal Green', 'Blackheath', 'Blackheath Royal Standard', 'Blackwall', 'Bloomsbury', 'Bow', 'Brixton', 'Brockley', 'Bromley', 'Brompton', 'Camberwell', 'Cambridge Heath', 'Camden Town', 'Canary Wharf', 'Canonbury', 'Catford', 'Chalk Farm', 'Charing Cross', 'Charlton', 'Chelsea', 'Chinatown', 'Chinbrook', 'Chiswick', 'Clapham', 'Clerkenwell', 'Covent Garden', 'Cricklewood', 'Crofton Park', 'Cubitt Town', 'Dalston', 'De Beauvoir Town', 'Denmark Hill', 'Deptford', 'Downham', 'Dulwich', 'Earls Court', 'Earlsfield', 'East Dulwich', 'Elephant and Castle', 'Eltham', 'Falconwood', 'Farringdon', 'Finsbury', 'Finsbury Park', 'Fitzrovia', 'Forest Hill', 'Frognal', 'Fulham', 'Gipsy Hill', 'Gospel Oak', 'Greenwich', 'Grove Park', 'Hackney', 'Hackney Central', 'Hackney Marshes

#### Construct a list of neighbourhoods for Syndey

In [6]:
wp_sydney_url = 'https://en.wikipedia.org/wiki/City_of_Sydney'

response = requests.get(wp_sydney_url)
sydney_data = response.content

In [7]:
soup = BeautifulSoup(sydney_data)
ulists = soup.find_all('ul')

# We want ulists[15] and ulists[16]
sydney_nbhds = [item.string for item in ulists[15].children if item != '\n']
sydney_nbhds += [item.string for item in ulists[16].children if item != '\n']

print('{} neighbourhoods in the City of Sydney:'.format(len(sydney_nbhds)))
print(sydney_nbhds)

48 neighbourhoods in the City of Sydney:
['Alexandria', 'Barangaroo', 'Beaconsfield', 'Camperdown', 'Centennial Park', 'Chippendale', 'Darlinghurst', 'Darlington', 'Dawes Point', 'Elizabeth Bay', 'Erskineville', 'Eveleigh', 'Forest Lodge', 'Glebe', 'Haymarket', 'Millers Point', 'Moore Park', 'Newtown', 'Paddington', 'Potts Point', 'Pyrmont', 'Redfern', 'Rosebery', 'Rushcutters Bay', 'Surry Hills', 'Sydney CBD', 'The Rocks', 'Ultimo', 'Waterloo', 'Woolloomooloo', 'Zetland', 'Broadway', 'Central', 'Central Park', 'Chinatown', 'Circular Quay', 'Darling Harbour', 'The Domain', 'East Sydney', 'Goat Island', 'Garden Island', 'Green Square', 'Kings Cross', 'Macdonaldtown', 'Railway Square', 'St James', 'Strawberry Hills', 'Wynyard']


#### Construct a list of neighbourhoods for New York

In [8]:
wp_manhattan_url = 'https://en.wikipedia.org/wiki/List_of_Manhattan_neighborhoods'

response = requests.get(wp_manhattan_url)
new_york_data = response.content

In [9]:
# Again let's get rid of those pesky line breaks.
new_york_data = new_york_data.replace(b'\n</td>', b'</td>')

new_york_data = re.sub(b'<sup.*?</sup>', b'', new_york_data)
new_york_data = re.sub(b'</a>(.*?)</td>', b'</a></td>', new_york_data)

soup = BeautifulSoup(new_york_data)
table1 = soup.find('tbody')
table2 = table1.find_next('tbody')
table3 = table2.find_next('tbody')
table4 = table3.find_next('tbody')

new_york_nbhds = [item[0] for item in table_extractor(table1, [0])]
new_york_nbhds += [item[0] for item in table_extractor(table2, [0])]
new_york_nbhds += [item[0] for item in table_extractor(table3, [0])]
new_york_nbhds += [item[0] for item in table_extractor(table4, [0])]

ind = new_york_nbhds.index('Murray Hill aka Curry Hill aka Little India')
new_york_nbhds[ind] = 'Murray Hill' # Let's just call it Murray Hill.

print('{} neighbourhoods in Manhattan, New York:'.format(len(new_york_nbhds)))
print(new_york_nbhds)

85 neighbourhoods in Manhattan, New York:
['Upper Manhattan', 'Marble Hill', 'Inwood', 'Fort George', 'Washington Heights', 'Hudson Heights', 'West Harlem', 'Hamilton Heights', 'Manhattanville', 'Morningside Heights', 'Central Harlem', 'Harlem', 'St. Nicholas Historic District', 'Astor Row', 'Sugar Hill', 'Marcus Garvey Park', 'Le Petit Senegal', 'East Harlem', 'Upper East Side', 'Lenox Hill', 'Carnegie Hill', 'Yorkville', 'Upper West Side', 'Manhattan Valley', 'Lincoln Square', 'Midtown', 'Columbus Circle', 'Sutton Place', 'Rockefeller Center', 'Diamond District', 'Theater District', 'Turtle Bay', 'Midtown East', 'Midtown', 'Tudor City', 'Little Brazil', 'Times Square', 'Hudson Yards', 'Midtown West', "Hell's Kitchen", 'Garment District', 'Herald Square', 'Koreatown', 'Murray Hill', 'Tenderloin', 'Madison Square', 'Flower District', 'Brookdale', 'Hudson Yards', 'Kips Bay', 'Rose Hill', 'NoMad', 'Peter Cooper Village', 'Chelsea', 'Flatiron District', 'Gramercy Park', 'Stuyvesant Square

## Obtain geographic coordinates for each neighbourhood

First we define a couple of functions to help us obtain the latitude and longitude for each neighbourhood. We will need to manually enter the latitude and longitude for any neighbourhood where geopy.geocoder's Nominatim function fails to return us this data.

In [10]:
geolocator = Nominatim(user_agent='My_Coursera_Applied_Data_Science_Capstone')

def get_lat_long(address):
    place = geolocator.geocode(address)
    
    if place is None:
        return None, None
    return place.latitude, place.longitude


def get_lls(neighbourhood_list, city):
    lls = {}
    for nbhd in neighbourhood_list:
        lat, long = get_lat_long('{}, {}'.format(nbhd, city))
        lls[nbhd] = (lat, long)
        if lat is None or long is None:
            print('Geolocation failed for {}, {}'.format(nbhd, city))
    return lls

In [11]:
london_nbhd_lls = get_lls(london_nbhds, 'London')

Geolocation failed for Somerstown, London


In [12]:
london_nbhd_lls['Somerstown'] = (51.5310, -0.1304)

In [13]:
sydney_nbhd_lls = get_lls(sydney_nbhds, 'Sydney')

In [14]:
new_york_nbhd_lls = get_lls(new_york_nbhds, 'New York')

Geolocation failed for St. Nicholas Historic District, New York
Geolocation failed for Astor Row, New York
Geolocation failed for Le Petit Senegal, New York
Geolocation failed for Little Brazil, New York
Geolocation failed for Tenderloin, New York
Geolocation failed for Little Germany, New York
Geolocation failed for Little Australia, New York
Geolocation failed for Cooperative Village, New York
Geolocation failed for Radio Row, New York
Geolocation failed for Little Syria, New York


In [15]:
new_york_nbhd_lls['St. Nicholas Historic District'] = (40.8173, -73.9426)
new_york_nbhd_lls['Astor Row'] = (40.8106, -73.9417)
new_york_nbhd_lls['Le Petit Senegal'] = (40.8040, -73.9542)
new_york_nbhd_lls['Little Brazil'] = (40.7567, -73.9807)
new_york_nbhd_lls['Tenderloin'] = (40.749, -73.988)
new_york_nbhd_lls['Little Germany'] = (40.7261, -73.9814)
new_york_nbhd_lls['Little Australia'] = (40.7227, -73.9960)
new_york_nbhd_lls['Cooperative Village'] = (40.7148, -73.9810)
new_york_nbhd_lls['Radio Row'] = (40.7109, -74.0126)
new_york_nbhd_lls['Little Syria'] = (40.7061, -74.0152)

## Obtain venue information by querying Foursquare's API

In [16]:
CLIENT_ID = '<My_Fourquare_ID>' # your Foursquare ID
CLIENT_SECRET = '<My_Foursquare_secret>' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [17]:
LIMIT = 100

def get_nearby_venues(neighbourhood_list, latlong_dict, radius=500, verbose=False):
    nbhd_venues = {}
    if verbose:
        print('Querying Foursquare for venues near to:')
        
    for nbhd in neighbourhood_list:
        lat, lng = latlong_dict[nbhd]
        if verbose:
            print('\t-{}'.format(nbhd))
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
               CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        nbhd_venues[nbhd] = [v['venue']['categories'][0]['name'] for v in results]

    return nbhd_venues

In [18]:
london_venues = get_nearby_venues(london_nbhds, london_nbhd_lls)

In [19]:
sydney_venues = get_nearby_venues(sydney_nbhds, sydney_nbhd_lls)

In [20]:
new_york_venues = get_nearby_venues(new_york_nbhds, new_york_nbhd_lls)

Now we will see if there are any neighbourhoods with the same name in different cities

In [21]:
set(london_nbhds) & set(sydney_nbhds)

{'Chinatown', 'Paddington'}

In [22]:
set(london_nbhds) & set(new_york_nbhds)

{'Chelsea', 'Chinatown'}

In [23]:
set(sydney_nbhds) & set(new_york_nbhds)

{'Chinatown'}

We will rename these neighbourhoods by adding (&lt;City&gt;) to the end of the neighbourhood name. This will avoid problems where data for neighbourhoods with the same name, but from different cities, gets combined.

In [24]:
# Add city into those names in those neigbourhoods which don't have unique names in this data set
london_venues['Chinatown (London)'] = london_venues['Chinatown']
london_venues['Paddington (London)'] = london_venues['Paddington']
london_venues['Chelsea (London)'] = london_venues['Chelsea']
del(london_venues['Chinatown'])
del(london_venues['Paddington'])
del(london_venues['Chelsea'])

sydney_venues['Chinatown (Sydney)'] = sydney_venues['Chinatown']
sydney_venues['Paddington (Sydney)'] = sydney_venues['Paddington']
del(sydney_venues['Chinatown'])
del(sydney_venues['Paddington'])

new_york_venues['Chinatown (New York)'] = new_york_venues['Chinatown']
new_york_venues['Chelsea (New York)'] = new_york_venues['Chelsea']
del(new_york_venues['Chinatown'])
del(new_york_venues['Chelsea'])

## Construct a DataFrame with proportions of each venue category for a given neighbourhood

First we create a DataFrame comprised of rows for a neighbourhood, city and venues triple. We will filter out neighbourhoods having too few venues return from Foursquare as insufficient data could impact on the analysis to follow.

In [25]:
def create_df(neighbourhood_venues, city, min_venues):
    data = []
    for nbhd in neighbourhood_venues:
        if len(neighbourhood_venues[nbhd]) < min_venues:
            # Too few venues, skip this one
            continue
        data += [(nbhd, city, venue) for venue in neighbourhood_venues[nbhd]]
    
    df = pd.DataFrame(data)
    df.columns = ['Neighbourhood', 'City', 'Venue Category']
        
    return df

In [26]:
# filter the lists to remove neighbourhoods having too few venues
# Because Sydney has fewer neighbourhoods, let's take the neighbourhoods with 30 or more venues
# and for London and New York let's take the neighbourhoods with 50 or more venues

london_df = create_df(london_venues, 'London', 50)
sydney_df = create_df(sydney_venues, 'Sydney', 30)
new_york_df = create_df(new_york_venues, 'New York', 50)

print('After filtering there are {} neighbourhoods from London.'.format(london_df['Neighbourhood'].nunique()))
print('After filtering there are {} neighbourhoods from Sydney.'.format(sydney_df['Neighbourhood'].nunique()))
print('After filtering there are {} neighbourhoods from New York.'.format(new_york_df['Neighbourhood'].nunique()))

After filtering there are 61 neighbourhoods from London.
After filtering there are 30 neighbourhoods from Sydney.
After filtering there are 62 neighbourhoods from New York.


In [27]:
# Combine all of these into a single DataFrame
df = pd.concat([london_df, sydney_df, new_york_df], ignore_index=True)

In [28]:
df.groupby('Neighbourhood').count()

Unnamed: 0_level_0,City,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1
Aldwych,64,64
Alphabet City and Loisaida,100,100
Angel,61,61
Astor Row,61,61
Balham,54,54
Battery Park City,93,93
Bayswater,100,100
Belsize Park,52,52
Bethnal Green,63,63
Bloomsbury,62,62


Now we use one hot encoding of venue categories and average.

In [29]:
# one hot encoding
onehot_df = pd.get_dummies(df[['Venue Category']], prefix="", prefix_sep="")

# Add in the Neighbourhood 
onehot_df['Neighbourhood'] = df['Neighbourhood']

# Move Neighbourhood to the first columns
cols = [onehot_df.columns[-1]] + list(onehot_df.columns[:-1])
onehot_df = onehot_df[cols]

In [30]:
grouped_df = onehot_df.groupby('Neighbourhood').mean().reset_index()
grouped_df.head(10)

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Arcade,Arepa Restaurant,...,Warehouse Store,Watch Shop,Water Park,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio
0,Aldwych,0.0,0.0,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,...,0.0,0.0,0.0,0.0,0.015625,0.015625,0.0,0.0,0.0,0.0
1,Alphabet City and Loisaida,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.04,0.02,0.0,0.0,0.0,0.02
2,Angel,0.0,0.016393,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.016393,0.0,0.0,0.0,0.016393
3,Astor Row,0.0,0.0,0.04918,0.016393,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.016393,0.016393,0.016393,0.0,0.016393
4,Balham,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018519
5,Battery Park City,0.0,0.0,0.0,0.010753,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.010753,0.032258,0.0,0.0,0.0,0.0
6,Bayswater,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Belsize Park,0.0,0.0,0.0,0.019231,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.019231,0.0,0.0,0.0,0.0
8,Bethnal Green,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015873,...,0.0,0.0,0.0,0.0,0.031746,0.0,0.0,0.0,0.0,0.015873
9,Bloomsbury,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Cluster analysis to address question 1: similarity of neighbourhoods across different cities

We use K-means to cluster the neighbourhoods. Since we have neighbourhoods from three different cities, we set the number of clusters to 3. If neighbourhoods are most like other neighbourhoods from the same city then we would expect to see clusters formed corresponding to the neighbourhoods the city belongs to.

In [31]:
# set number of clusters as three - the number of the cities we are analysing
kclusters = 3

clustering_df = grouped_df.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(clustering_df)

In [32]:
clustered_df = df.drop('Venue Category', 1)
clustered_df.drop_duplicates(inplace=True)
clustered_df.reset_index(drop=True, inplace=True)

clustered_df.sort_values(by=['Neighbourhood'], inplace=True)

Now, we create a DataFrame which contains the cluster labels for each neighbourhood. We also split this data into separate DataFrames for each of the cities.

In [33]:
clustered_df.insert(2, 'Cluster label', kmeans.labels_)

In [34]:
london_clustered_df = clustered_df[clustered_df['City'] == 'London']
london_clustered_df.reset_index(drop=True, inplace=True)

In [35]:
sydney_clustered_df = clustered_df[clustered_df['City'] == 'Sydney']
sydney_clustered_df.reset_index(drop=True, inplace=True)

In [36]:
new_york_clustered_df = clustered_df[clustered_df['City'] == 'New York']
new_york_clustered_df.reset_index(drop=True, inplace=True)

We count how many of the neighbourhoods in each city belong to each of the clusters.

In [37]:
london_clustered_df.groupby('Cluster label').count()

Unnamed: 0_level_0,Neighbourhood,City
Cluster label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,4,4
1,39,39
2,18,18


In [38]:
sydney_clustered_df.groupby('Cluster label').count()

Unnamed: 0_level_0,Neighbourhood,City
Cluster label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,28,28
2,2,2


In [39]:
new_york_clustered_df.groupby('Cluster label').count()

Unnamed: 0_level_0,Neighbourhood,City
Cluster label,Unnamed: 1_level_1,Unnamed: 2_level_1
2,62,62


We use folium to display the clusters visually for each of the cities.

In [72]:
lat, long = get_lat_long('London')

# create map
map_clusters = folium.Map(location=[lat, long], width='65%', height='65%', zoom_start=11)

# set color scheme for the clusters
rainbow = ['blue', 'green', 'red']

# add markers to the map
markers_colors = []
for nbhd, cluster in zip(london_clustered_df['Neighbourhood'], london_clustered_df['Cluster label']):
    # Replace occurences of (<city>) we have added into neighbourhood names
    loc = nbhd.find('(')
    if loc > -1:
        nbhd = nbhd[:loc-1]
    lat, long = london_nbhd_lls[nbhd]
    label = folium.Popup(str(nbhd) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [73]:
lat, long = get_lat_long('City of Sydney')

# create map
map_clusters = folium.Map(location=[lat, long], width='65%', height='65%', zoom_start=13)

# set color scheme for the clusters
rainbow = ['blue', 'green', 'red']

# add markers to the map
markers_colors = []
for nbhd, cluster in zip(sydney_clustered_df['Neighbourhood'], sydney_clustered_df['Cluster label']):
    # Replace occurences of (<city>) we have added into neighbourhood names
    loc = nbhd.find('(')
    if loc > -1:
        nbhd = nbhd[:loc-1]
    lat, long = sydney_nbhd_lls[nbhd]
    label = folium.Popup(str(nbhd) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [74]:
lat, long = get_lat_long('Manhattan, New York')

# create map
map_clusters = folium.Map(location=[lat, long], width='65%', height='65%', zoom_start=11)

# set color scheme for the clusters
rainbow = ['blue', 'green', 'red']

# add markers to the map
markers_colors = []
for nbhd, cluster in zip(new_york_clustered_df['Neighbourhood'], new_york_clustered_df['Cluster label']):
    # Replace occurences of (<city>) we have added into neighbourhood names
    loc = nbhd.find('(')
    if loc > -1:
        nbhd = nbhd[:loc-1]
    lat, long = new_york_nbhd_lls[nbhd]
    label = folium.Popup(str(nbhd) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Analysis to answer question 2: most similar neighbourhoods in other cities

We create a dictionary and list to allow us to retrieve the feature vectors for each neighbourhood.

In [43]:
# Create a dict mapping neighbourhoods to position in the DataFrame grouped, and create
# a list of feature vectors
temp = grouped_df.to_numpy()
index_finder = dict(zip(list(temp[:,0]),range(len(temp[:,0]))))
features = [temp[i,1:] for i in range(len(temp[:,0]))]

In [44]:
# Construct lists of the neighbourhoods from each city which we are analysing.
london_nbhds = list(london_clustered_df.iloc[:,0])
sydney_nbhds = list(sydney_clustered_df.iloc[:,0])
new_york_nbhds = list(new_york_clustered_df.iloc[:,0])

We calculate the average distances between (feature vectors of) the neighbourhoods in each city.

In [45]:
def avg_nbhd_dist_intra_city(nbhd_list):
    n = len(nbhd_list)
    total = 0
    for i, nbhd1 in enumerate(nbhd_list):
        for nbhd2 in nbhd_list[i+1:]:
            v1 = features[index_finder[nbhd1]]
            v2 = features[index_finder[nbhd2]]
            total += np.linalg.norm(v1 - v2)
    return total / (n * (n-1) / 2)

In [46]:
avg_nbhd_dist_intra_city(london_nbhds)

0.21162958616761138

In [47]:
avg_nbhd_dist_intra_city(sydney_nbhds)

0.20112308980514276

In [48]:
avg_nbhd_dist_intra_city(new_york_nbhds)

0.20001567641312398

We construct matrics of distances between (feature vectors of) neighbourhoods in different cities.

In [49]:
def create_dist_matrix(row_nbhds, col_nbhds):
    dists = []
    row_features = [features[index_finder[nbhd1]] for nbhd1 in row_nbhds]
    col_features = [features[index_finder[nbhd2]] for nbhd2 in col_nbhds]
    
    for row in row_features:
        #dists.append(tuple([LA.norm(row - col) for col in col_features]))
        dists.append(tuple([np.linalg.norm(row - col) for col in col_features]))
    
    return np.array(dists)

In [50]:
dist_london_sydney = create_dist_matrix(london_nbhds, sydney_nbhds)
dist_london_new_york = create_dist_matrix(london_nbhds, new_york_nbhds)
dist_sydney_new_york = create_dist_matrix(sydney_nbhds, new_york_nbhds)


Now, we use these distance matrices to find which neighbourhoods from the other cities are most similar to a given neighbourhood. We consider "most similar" to mean smallest distance between corresponding feature vectors.

In [51]:
# A function to find which of the neighbourhoods in col_nbhds are closest to each of the row_nbhds
# dist_matrix is the distance matrix giving the distances between the features in row_nbhds with
# the feautures in col_nbhds
def find_closest_nbhds(dist_matrix, row_nbhds, col_nbhds):
    data = []
    for i, r_nbhd in enumerate(row_nbhds):
        minimum = np.amin(dist_matrix[i,:])
        j = list(dist_matrix[i,:]).index(minimum)
        
        data.append((r_nbhd, col_nbhds[j], round(minimum, 4)))
    
    return pd.DataFrame(data)

#### Analysis for London and Sydney neighbourhoods

In [52]:
lon_syd_closest = find_closest_nbhds(dist_london_sydney, london_nbhds, sydney_nbhds)
lon_syd_closest.columns = ['London neighbourhood', 'Sydney neighbourhood', 'feature distance']

# Note transpose the distance matrix
syd_lon_closest = find_closest_nbhds(dist_london_sydney.T, sydney_nbhds, london_nbhds)
syd_lon_closest.columns = ['Sydney neighbourhood', 'London neighbourhood', 'feature distance']

In [53]:
lon_syd_closest.groupby('Sydney neighbourhood').count()

Unnamed: 0_level_0,London neighbourhood,feature distance
Sydney neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1
Chinatown (Sydney),1,1
Darling Harbour,1,1
Darlinghurst,3,3
East Sydney,7,7
Glebe,2,2
The Rocks,17,17
Woolloomooloo,24,24
Wynyard,6,6


In [54]:
syd_lon_closest.groupby('London neighbourhood').count()

Unnamed: 0_level_0,Sydney neighbourhood,feature distance
London neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1
Bayswater,2,2
Brompton,1,1
Chalk Farm,14,14
Hackney Central,2,2
Hammersmith,2,2
Highbury,1,1
Knightsbridge,1,1
Shoreditch,3,3
Stoke Newington,2,2
West Hackney,2,2


#### Analysis for London and New York neighbourhoods

In [55]:
lon_ny_closest = find_closest_nbhds(dist_london_new_york, london_nbhds, new_york_nbhds)
lon_ny_closest.columns = ['London neighbourhood', 'New York neighbourhood', 'feature distance']

# Note transpose the distance matrix
ny_lon_closest = find_closest_nbhds(dist_london_new_york.T, new_york_nbhds, london_nbhds)
ny_lon_closest.columns = ['New York neighbourhood', 'London neighbourhood', 'feature distance']

In [56]:
lon_ny_closest.groupby('New York neighbourhood').count()

Unnamed: 0_level_0,London neighbourhood,feature distance
New York neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1
Central Harlem,4,4
Columbus Circle,1,1
Financial District,6,6
Flower District,3,3
Garment District,2,2
Little Australia,3,3
Lower East Side,6,6
Meatpacking District,1,1
Midtown,1,1
Midtown East,1,1


In [57]:
ny_lon_closest.groupby('London neighbourhood').count()

Unnamed: 0_level_0,New York neighbourhood,feature distance
London neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1
Brompton,2,2
Canary Wharf,4,4
Chalk Farm,2,2
Chinatown (London),3,3
Hammersmith,1,1
Marylebone,5,5
Nag's Head,2,2
Putney,4,4
Shepherd's Bush,10,10
Shoreditch,20,20


#### Analysis for Sydney and New York neighbourhoods

In [58]:
syd_ny_closest = find_closest_nbhds(dist_sydney_new_york, sydney_nbhds, new_york_nbhds)
syd_ny_closest.columns = ['Sydney neighbourhood', 'New York neighbourhood', 'feature distance']

# Note transpose distance matrix
ny_syd_closest = find_closest_nbhds(dist_sydney_new_york.T, new_york_nbhds, sydney_nbhds)
ny_syd_closest.columns = ['New York neighbourhood', 'Sydney neighbourhood', 'feature distance']

In [59]:
syd_ny_closest.groupby('New York neighbourhood').count()

Unnamed: 0_level_0,Sydney neighbourhood,feature distance
New York neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1
Bowery,2,2
Carnegie Hill,2,2
Financial District,2,2
Flower District,3,3
Hudson Heights,10,10
Koreatown,1,1
Little Australia,1,1
Lower East Side,4,4
Tudor City,5,5


In [60]:
ny_syd_closest.groupby('Sydney neighbourhood').count()

Unnamed: 0_level_0,New York neighbourhood,feature distance
Sydney neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1
Chinatown (Sydney),13,13
East Sydney,30,30
Newtown,2,2
The Rocks,13,13
Woolloomooloo,3,3
Wynyard,1,1


#### Construct DataFrames for each city summarising cluster labels and most similar neighbourhoods

In [61]:
pd.set_option('display.max_rows', 70)

london_complete_df = london_clustered_df
london_complete_df = london_complete_df.drop('City', 1)
london_complete_df.columns = ['London neighbourhood', 'Cluster Labels']
london_complete_df = london_complete_df.merge(lon_syd_closest, on='London neighbourhood')
london_complete_df = london_complete_df.merge(lon_ny_closest, on='London neighbourhood')
london_complete_df

Unnamed: 0,London neighbourhood,Cluster Labels,Sydney neighbourhood,feature distance_x,New York neighbourhood,feature distance_y
0,Aldwych,1,The Rocks,0.1951,Midtown,0.1939
1,Angel,1,Woolloomooloo,0.1866,Tudor City,0.1761
2,Balham,1,Wynyard,0.2026,Yorkville,0.1989
3,Bayswater,1,Woolloomooloo,0.1638,Tudor City,0.17
4,Belsize Park,0,Woolloomooloo,0.167,Tudor City,0.1782
5,Bethnal Green,1,Wynyard,0.1833,Financial District,0.1859
6,Bloomsbury,1,Wynyard,0.176,Financial District,0.2
7,Brompton,2,The Rocks,0.1681,Murray Hill,0.1573
8,Camberwell,1,Woolloomooloo,0.2029,Tudor City,0.1874
9,Camden Town,1,Woolloomooloo,0.1872,Lower East Side,0.194


In [62]:
sydney_complete_df = sydney_clustered_df
sydney_complete_df = sydney_complete_df.drop('City', 1)
sydney_complete_df.columns = ['Sydney neighbourhood', 'Cluster Labels']
sydney_complete_df = sydney_complete_df.merge(syd_lon_closest, on='Sydney neighbourhood')
sydney_complete_df = sydney_complete_df.merge(syd_ny_closest, on='Sydney neighbourhood')
sydney_complete_df

Unnamed: 0,Sydney neighbourhood,Cluster Labels,London neighbourhood,feature distance_x,New York neighbourhood,feature distance_y
0,Central Park,0,Chalk Farm,0.1544,Lower East Side,0.1959
1,Chinatown (Sydney),2,Brompton,0.1881,Koreatown,0.1718
2,Chippendale,0,Chalk Farm,0.1505,Lower East Side,0.1992
3,Circular Quay,0,Hackney Central,0.1682,Lower East Side,0.1908
4,Darling Harbour,0,Shoreditch,0.1753,Tudor City,0.1864
5,Darlinghurst,0,Chalk Farm,0.1652,Tudor City,0.1988
6,Darlington,0,Highbury,0.2189,Tudor City,0.289
7,Dawes Point,0,Hackney Central,0.1711,Flower District,0.2048
8,East Sydney,0,Chalk Farm,0.1425,Flower District,0.1679
9,Elizabeth Bay,0,Chalk Farm,0.2015,Hudson Heights,0.2268


In [63]:
new_york_complete_df = new_york_clustered_df
new_york_complete_df = new_york_complete_df.drop('City', 1)
new_york_complete_df.columns = ['New York neighbourhood', 'Cluster Labels']
new_york_complete_df = new_york_complete_df.merge(ny_lon_closest, on='New York neighbourhood')
new_york_complete_df = new_york_complete_df.merge(ny_syd_closest, on='New York neighbourhood')
new_york_complete_df

Unnamed: 0,New York neighbourhood,Cluster Labels,London neighbourhood,feature distance_x,Sydney neighbourhood,feature distance_y
0,Alphabet City and Loisaida,2,Shoreditch,0.147,East Sydney,0.1954
1,Astor Row,2,Shepherd's Bush,0.1831,East Sydney,0.2077
2,Battery Park City,2,St Giles,0.1746,East Sydney,0.1996
3,Bowery,2,Shepherd's Bush,0.1691,Chinatown (Sydney),0.1737
4,Carnegie Hill,2,Chalk Farm,0.1698,East Sydney,0.1813
5,Central Harlem,2,Shepherd's Bush,0.1523,East Sydney,0.197
6,Chelsea (New York),2,Marylebone,0.4005,Woolloomooloo,0.4359
7,Chinatown (New York),2,Shepherd's Bush,0.186,Chinatown (Sydney),0.1959
8,Civic Center,2,Shoreditch,0.1958,Chinatown (Sydney),0.2005
9,Columbus Circle,2,St Giles,0.1549,East Sydney,0.1913


## Analysis to answer question 3: similarity between pairs of cities

We compute the average distance between neighbourhoods in one city with neighbourhoods in another city.

In [64]:
def avg_nbhd_dist_inter_city(dist_matrix):
    nr, nc = dist_matrix.shape
    return np.sum(dist_matrix) / (nr * nc)

In [65]:
avg_nbhd_dist_inter_city(dist_london_sydney)

0.2426325933873283

In [66]:
avg_nbhd_dist_inter_city(dist_london_new_york)

0.2245948407879716

In [67]:
avg_nbhd_dist_inter_city(dist_sydney_new_york)

0.25767306193647455

We also compute another measure of similarity between cities. For a pair of cities, consider pairing neighbourhoods from one city with neighbourhoods from the other city such that each neighbourhood belongs to at most one pair (i.e. once a neighbourhood has been used it cannot be repeated) and in such a way that the sum of distances between neighbourhoods in a pair is minimised. This minimal sum of distances is then averaged by dividing by the number of neighbourhoods in the city which has the lesser neighbourhoods (since this will be the number of neighbourhood pairings between the two cities as neighbourhoods cannot be repeated).

Finding the pairing of neighbourhoods which give the minimal sum of distances is an instance of the [Assignment problem](https://en.wikipedia.org/wiki/Assignment_problem) which can be solved with the Hungarian algorithm. We use [scipy's implementation, linear_sum_assignment](https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.optimize.linear_sum_assignment.html) to compute the pairing of neighbourhoods and determine the minimal sum of distances.

In [68]:
def avg_matching_dist_inter_city(dist_matrix):
    row_ind, col_ind = linear_sum_assignment(dist_matrix)
    return dist_matrix[row_ind, col_ind].sum() / min(dist_matrix.shape)

In [69]:
avg_matching_dist_inter_city(dist_london_sydney)

0.19594147702626127

In [70]:
avg_matching_dist_inter_city(dist_london_new_york)

0.19189345085078272

In [71]:
avg_matching_dist_inter_city(dist_sydney_new_york)

0.22355513408665612