<h1 align="center">TALE OF TWO CITIES - UNDERSTANDING NEIGHBOURHOODS OF LONDON AND PARIS</h1>


# Introduction

A Tale of Two cities, a novel written by Charles Dickens was set in London and Paris which takes place during the French Revolution. These cities were both happening then and now. A lot has changed over the years and we now take a look at how the cities have grown. 

London and Paris are quite the popular tourist and vacation destinations for people all around the world. They are diverse and multicultural and offer a wide variety of experiences that is widely sought after. We try to group the neighbourhoods of London and Paris respectively and draw insights to what they look like now.

# Business Problem


The aim is to help tourists choose their destinations depending on the experiences that the neighbourhoods have to offer and what they would want to have. This also helps people make decisions if they are thinking about migrating to London or Paris or even if they want to relocate neighbourhoods within the city. Our findings will help stakeholders make informed decisions and address any concerns they have including the different kinds of cuisines, provision stores and what the city has to offer. 


# Data Description

We require geolocation data for both London and Paris. Postal codes in each city serve as a starting point. Using Postal codes we use can find out the neighbourhoods, boroughs, venues and their most popular venue categories.


## London

To derive our solution, We scrape our data from https://en.wikipedia.org/wiki/List_of_areas_of_London

This wikipedia page has information about all the neighbourhoods, we limit it London.

1. *borough* : Name of Neighbourhood
2. *town* : Name of borough
3. *post_code* : Postal codes for London.

This wikipedia page lacks information about the geographical locations. To solve this problem we use ArcGIS API

### ArcGIS API

ArcGIS Online enables you to connect people, locations, and data using interactive maps. Work with smart, data-driven styles and intuitive analysis tools that deliver location intelligence. Share your insights with the world or specific groups. 

More specifically, we use ArcGIS to get the geo locations of the neighbourhoods of London. The following columns are added to our initial dataset which prepares our data. 

4. *latitude* : Latitude for Neighbourhood
5. *longitude* : Longitude for Neighbourhood

## Paris

To derive our solution, We leverage JSON data available at https://www.data.gouv.fr/fr/datasets/r/e88c6fda-1d09-42a0-a069-606d3259114e 

The JSON file has data about all the neighbourhoods in France, we limit it to Paris.

1. *postal_code* : Postal codes for France
2. *nom_comm* : Name of Neighbourhoods in France
3. *nom_dept* : Name of the boroughs, equivalent to towns in France
4. *geo_point_2d* : Tuple containing the latitude and longitude of the Neighbourhoods.

## Foursquare API Data

We will need data about different venues in different neighbourhoods of that specific borough. In order to gain that information we will use "Foursquare" locational information. Foursquare is a location data provider with information about all manner of venues and events within an area of interest. Such information includes venue names, locations, menus and even photos. As such, the foursquare location platform will be used as the sole data source since all the stated required information can be obtained through the API.

After finding the list of neighbourhoods, we then connect to the Foursquare API to gather information about venues inside each and every neighbourhood. For each neighbourhood, we have chosen the radius to be 500 meters.

The data retrieved from Foursquare contained information of venues within a specified distance of the longitude and latitude of the postcodes. The information obtained per venue as follows:

1. *Neighbourhood* : Name of the Neighbourhood
2. *Neighbourhood Latitude* : Latitude of the Neighbourhood
3. *Neighbourhood Longitude* : Longitude of the Neighbourhood
4. *Venue* : Name of the Venue
5. *Venue Latitude* : Latitude of Venue
6. *Venue Longitude* : Longitude of Venue
7. *Venue Category* : Category of Venue


Based on all the information collected for both London and Paris, we have sufficient data to build our model. We cluster the neighbourhoods together based on similar venue categories. We then present our observations and findings. Using this data, our stakeholders can take the necessary decision.

# Methodology

We will be creating our model with the help of Python so we start off by importing all the required packages.

In [244]:
import pandas as pd
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
import requests
import folium
import re
from sklearn.cluster import KMeans

The approach taken here is to explore each of the cities individually, plot the map to show the neighbourhoods being considered and then build our model by clustering all of the similar neighbourhoods together and finally plot the new map with the clustered neighbourhoods. We draw insights and then compare and discuss our findings.

# Exploring London

### Neighbourhoods of London

We begin to start collecting and refining the data needed for the our business solution to work.

### Data Collection

To get the neighbourhoods in london, we start by scraping the list of areas of london wiki page.

In [245]:
url_london = "https://en.wikipedia.org/wiki/List_of_areas_of_London"
wiki_london_url = requests.get(url_london)
wiki_london_url

<Response [200]>

Response 200 means that we are able to make the connection

In [246]:
wiki_london_data = pd.read_html(wiki_london_url.text)
wiki_london_data

[                                                   0
 0  Map all coordinates in "Category:Areas of Lond...
 1                       Download coordinates as: KML,
             Location                     London borough       Post town  \
 0         Abbey Wood              Bexley, Greenwich [7]          LONDON   
 1              Acton  Ealing, Hammersmith and Fulham[8]          LONDON   
 2          Addington                         Croydon[8]         CROYDON   
 3         Addiscombe                         Croydon[8]         CROYDON   
 4        Albany Park                             Bexley  BEXLEY, SIDCUP   
 ..               ...                                ...             ...   
 527         Woolwich                          Greenwich          LONDON   
 528   Worcester Park       Sutton, Kingston upon Thames  WORCESTER PARK   
 529  Wormwood Scrubs             Hammersmith and Fulham          LONDON   
 530          Yeading                         Hillingdon           HAYES   
 

Scraping the webpage gives us all the tables present on the page. We need the 2nd table, so selecting the 2nd table.

In [247]:
wiki_london_data = wiki_london_data[1]
wiki_london_data

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,020,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",020,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,020,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,020,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",020,TQ478728
...,...,...,...,...,...,...
527,Woolwich,Greenwich,LONDON,SE18,020,TQ435795
528,Worcester Park,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4,020,TQ225655
529,Wormwood Scrubs,Hammersmith and Fulham,LONDON,W12,020,TQ225815
530,Yeading,Hillingdon,HAYES,UB4,020,TQ115825


### Data Preprocessing and Feature Selection

We will altering the column name by replacing " " with "_" for ease of use par-se. Commented as we dont require this.

In [248]:
#def replaceSpaceAndNonBreakingSpace(sentence):
#    sentence = sentence.strip().replace(chr(160), "_")
#    return sentence.strip().replace(" ", "_")

In [249]:
#wiki_london_data.rename(columns=lambda x: printandreturn(x), inplace=True)
#wiki_london_data

In [321]:
london_processed_data = wiki_london_data.drop( [ wiki_london_data.columns[0], wiki_london_data.columns[4], wiki_london_data.columns[5] ], axis=1)

In [322]:
london_processed_data.head()

Unnamed: 0,London borough,Post town,Postcode district
0,"Bexley, Greenwich [7]",LONDON,SE2
1,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4"
2,Croydon[8],CROYDON,CR0
3,Croydon[8],CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"


In [323]:
london_processed_data.columns = ['borough','town','post_code']
london_processed_data

Unnamed: 0,borough,town,post_code
0,"Bexley, Greenwich [7]",LONDON,SE2
1,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4"
2,Croydon[8],CROYDON,CR0
3,Croydon[8],CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"
...,...,...,...
527,Greenwich,LONDON,SE18
528,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4
529,Hammersmith and Fulham,LONDON,W12
530,Hillingdon,HAYES,UB4


Let's remove the Square brackets [ ] and numbers from the borough column

In [324]:
london_processed_data['borough'] = london_processed_data['borough'].map(lambda x: re.sub(r'\[[0-9]*\]','',x).strip())
london_processed_data

Unnamed: 0,borough,town,post_code
0,"Bexley, Greenwich",LONDON,SE2
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
2,Croydon,CROYDON,CR0
3,Croydon,CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"
...,...,...,...
527,Greenwich,LONDON,SE18
528,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4
529,Hammersmith and Fulham,LONDON,W12
530,Hillingdon,HAYES,UB4


Found only one row with missed put data

In [325]:
london_processed_data[341:342]

Unnamed: 0,borough,town,post_code
341,Haringey || London || N17 ||,,


In [326]:
value = london_processed_data[london_processed_data['town'].isna() == True].loc[:,['borough']].reset_index().drop(columns = ['index'])['borough']

In [335]:
values = (value[0]).split('||')[:-1]
columns = ['borough','town','post_code']
indices = df1[df1['town'].isna() == True].index
index_=london_processed_data[london_processed_data['town'].isna() == True].index[0]
c=0
for i in values:
    i.strip()
    column = columns[c]
    c+=1
    london_processed_data.loc[int(index_):int(index_),'{0}'.format(column)] = i

In [337]:
london_processed_data[341:342]

Unnamed: 0,borough,town,post_code
341,Haringey,London,N17


We currently have 533 records and 3 columns of our data. It's time to perform Feature Engineering

### Feature Engineering

We can only focusing on the neighbourhoods of London, so performing the changes

In [338]:
london_processed_data = london_processed_data[london_processed_data['town'].str.contains('LONDON')]
london_processed_data

Unnamed: 0,borough,town,post_code
0,"Bexley, Greenwich",LONDON,SE2
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
6,City,LONDON,EC3
7,Westminster,LONDON,WC2
9,Bromley,LONDON,SE20
...,...,...,...
522,Redbridge,LONDON,"IG8, E18"
523,"Redbridge, Waltham Forest","LONDON, WOODFORD GREEN",IG8
526,Barnet,LONDON,N12
527,Greenwich,LONDON,SE18


In [339]:
london_processed_data.shape

(309, 3)

We now have only 310 rows. We can proceed with our further steps. Getting some descriptive statistics

In [340]:
london_processed_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 309 entries, 0 to 529
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   borough    309 non-null    object
 1   town       309 non-null    object
 2   post_code  309 non-null    object
dtypes: object(3)
memory usage: 9.7+ KB


## Geolocations of the London Neighbourhoods

### ArcGis API

We need to get the geographical co-ordinates for the neighbourhoods to plot out map. We will use the arcgis package to do so. 

Arcgis doesn't have a limitation on the number of API calls made so it fits our use case perfectly.

In [142]:
!pip install arcgis

Collecting arcgis
  Downloading arcgis-1.9.1.tar.gz (3.5 MB)
Collecting cachetools
  Using cached cachetools-5.0.0-py3-none-any.whl (9.1 kB)
Collecting keyring<=21.8.*,>=19
  Using cached keyring-21.7.0-py3-none-any.whl (32 kB)
Collecting lerc
  Using cached lerc-0.1.0-py3-none-any.whl
Collecting python-certifi-win32
  Using cached python_certifi_win32-1.6-py2.py3-none-any.whl (7.2 kB)
Collecting pyshp>=2
  Using cached pyshp-2.1.3-py3-none-any.whl
Collecting geomet
  Using cached geomet-0.3.0-py3-none-any.whl (28 kB)
Collecting requests_toolbelt
  Using cached requests_toolbelt-0.9.1-py2.py3-none-any.whl (54 kB)
Collecting requests_ntlm
  Using cached requests_ntlm-1.1.0-py2.py3-none-any.whl (5.7 kB)
Collecting requests-negotiate-sspi
  Using cached requests_negotiate_sspi-0.5.2-py2.py3-none-any.whl (7.1 kB)
Collecting requests-kerberos
  Using cached requests_kerberos-0.14.0-py2.py3-none-any.whl (11 kB)
Collecting winkerberos
  Using cached winkerberos-0.8.0-cp38-cp38-win_amd64.whl (

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spyder 4.2.5 requires pyqt5<5.13, which is not installed.
spyder 4.2.5 requires pyqtwebengine<5.13, which is not installed.


Collecting pyspnego[kerberos]
  Using cached pyspnego-0.3.1-cp38-cp38-win_amd64.whl (198 kB)
Collecting pypiwin32>=223
  Using cached pypiwin32-223-py3-none-any.whl (1.7 kB)
Collecting ntlm-auth>=1.0.2
  Using cached ntlm_auth-1.5.0-py2.py3-none-any.whl (29 kB)
Collecting tomli>=1.0.0
  Using cached tomli-2.0.0-py3-none-any.whl (12 kB)
Building wheels for collected packages: arcgis
  Building wheel for arcgis (setup.py): started
  Building wheel for arcgis (setup.py): finished with status 'done'
  Created wheel for arcgis: filename=arcgis-1.9.1-py2.py3-none-any.whl size=4201856 sha256=bf7acd57e585b7d30a138461f7e895a45c8b4e4e0a46122aea59f6be05f02abd
  Stored in directory: c:\users\hp\appdata\local\pip\cache\wheels\d2\ac\cb\4db5b0612d13b9113dc1e38e4db869cbc91798185408534912
Successfully built arcgis
Installing collected packages: tomli, pyspnego, setuptools-scm, pypiwin32, ntlm-auth, winkerberos, requests-toolbelt, requests-ntlm, requests-negotiate-sspi, requests-kerberos, python-certifi

In [341]:
from arcgis.geocoding import geocode
from arcgis.gis import GIS
gis = GIS()

Defining London arcgis geocode function to return latitude and longitude

In [342]:
def get_x_y_uk(address1):
   lat_coords = 0
   lng_coords = 0
   g = geocode(address='{}, London, England, GBR'.format(address1))[0]
   lng_coords = g['location']['x']
   lat_coords = g['location']['y']
   return str(lat_coords) +","+ str(lng_coords)

Checking sample data

In [343]:
c = get_x_y_uk('SE2')

In [344]:
c

'51.499741450000045,0.12406135200006929'

Looks good, We Copy over the postal codes of london to pass it into the geolocator function that we just defined above

In [345]:
geo_coordinates_uk = london_processed_data['post_code']    
geo_coordinates_uk

0           SE2
1        W3, W4
6           EC3
7           WC2
9          SE20
         ...   
522    IG8, E18
523         IG8
526         N12
527        SE18
529         W12
Name: post_code, Length: 309, dtype: object

Passing postal codes of london to get the geographical co-ordinates

In [346]:
coordinates_latlng_uk = geo_coordinates_uk.apply(lambda x: get_x_y_uk(x))
coordinates_latlng_uk

0       51.499741450000045,0.12406135200006929
1       51.49716026685069,-0.25251983676694634
6      51.513145000000065,-0.07873241499993355
7       51.51462500000008,-0.11486033199997792
9        51.48249000000004,0.11919361600007505
                        ...                   
522    51.514136428718516,-0.07020212713173243
523    51.507408360000056,-0.12769869299995662
526    51.542635000000075,-0.09858089899995548
527    51.503130000000056,-0.10802582299993446
529    51.515085000000056,-0.24269643599996016
Name: post_code, Length: 309, dtype: object

### Latitude

Extracting the latitude from our previously collected coordinates

In [347]:
lat_uk = coordinates_latlng_uk.apply(lambda x: x.split(',')[0])
lat_uk

0      51.499741450000045
1       51.49716026685069
6      51.513145000000065
7       51.51462500000008
9       51.48249000000004
              ...        
522    51.514136428718516
523    51.507408360000056
526    51.542635000000075
527    51.503130000000056
529    51.515085000000056
Name: post_code, Length: 309, dtype: object

### Longitude

Extracting the Longitude from our previously collected coordinates

In [348]:
lng_uk = coordinates_latlng_uk.apply(lambda x: x.split(',')[1])
lng_uk

0       0.12406135200006929
1      -0.25251983676694634
6      -0.07873241499993355
7      -0.11486033199997792
9       0.11919361600007505
               ...         
522    -0.07020212713173243
523    -0.12769869299995662
526    -0.09858089899995548
527    -0.10802582299993446
529    -0.24269643599996016
Name: post_code, Length: 309, dtype: object

We now have the geographical co-ordinates of the London Neighbourhoods.

We proceed with Merging our source data with the geographical co-ordinates to make our dataset ready for the next stage

In [349]:
london_merged = pd.concat([london_processed_data,lat_uk.astype(float), lng_uk.astype(float)], axis=1)
london_merged.columns= ['borough','town','post_code','latitude','longitude']
london_merged

Unnamed: 0,borough,town,post_code,latitude,longitude
0,"Bexley, Greenwich",LONDON,SE2,51.499741,0.124061
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4",51.497160,-0.252520
6,City,LONDON,EC3,51.513145,-0.078732
7,Westminster,LONDON,WC2,51.514625,-0.114860
9,Bromley,LONDON,SE20,51.482490,0.119194
...,...,...,...,...,...
522,Redbridge,LONDON,"IG8, E18",51.514136,-0.070202
523,"Redbridge, Waltham Forest","LONDON, WOODFORD GREEN",IG8,51.507408,-0.127699
526,Barnet,LONDON,N12,51.542635,-0.098581
527,Greenwich,LONDON,SE18,51.503130,-0.108026


In [350]:
london_merged.dtypes

borough       object
town          object
post_code     object
latitude     float64
longitude    float64
dtype: object

### Co-ordinates for London

Getting the geocode for London to help visualize it on the map

In [351]:
london = geocode(address='London, England, GBR')[0]
london_lng_coords = london['location']['x']
london_lat_coords = london['location']['y']
london_lng_coords

-0.12769869299995662

In [352]:
london_lat_coords

51.507408360000056

## Visualize the Map of London

To help visualize the Map of London and the neighbourhoods in London, we make use of the folium package.

In [353]:
# Creating the map of London
map_London = folium.Map(location=[london_lat_coords, london_lng_coords], zoom_start=12)
map_London

# adding markers to map
for latitude, longitude, borough, town in zip(london_merged['latitude'], london_merged['longitude'], london_merged['borough'], london_merged['town']):
    label = '{}, {}'.format(town, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color='red',
        fill=True
        ).add_to(map_London)  
    
map_London

### Venues in London

To proceed with the next part, we need to define Foursquare API credentials.

Using Foursquare API, we are able to get the venue and venue categories around each neighbourhood in London.

In [354]:
CLIENT_ID = 'YH0KN4YRWINARM3GJXK55B15FOOWPDBT0V312BX0DHLRQWCE' 
CLIENT_SECRET = 'JUIYHSAE3X4WKW1YB5LERPFFQYGWR41CCRBHYWGVJIC5XQNB'
VERSION = '20180605' # Foursquare API version

Defining a function to get the neraby venues in the neighbourhood. This will help us get venue categories which is important for our analysis

In [356]:
LIMIT=100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            LIMIT
            )
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Category']
    
    return(nearby_venues)

Getting the venues in London

In [357]:
venues_in_London = getNearbyVenues(london_merged['borough'], london_merged['latitude'], london_merged['longitude'])

Bexley, Greenwich
Ealing, Hammersmith and Fulham
City
Westminster
Bromley
Islington
Islington
Barnet
Enfield
Wandsworth
Southwark
City
Richmond upon Thames
Barnet
Islington
Wandsworth
Westminster
Bromley
Newham
Ealing
Westminster
Lewisham
Camden
Southwark
Tower Hamlets
Bexley
City
Lewisham
Greenwich
Tower Hamlets
Camden
Haringey
Tower Hamlets
Haringey
Barnet
Brent
Lambeth
Lewisham
Tower Hamlets
Kensington and Chelsea, Hammersmith and Fulham
Brent
Barnet
Barnet
Southwark
Tower Hamlets
Camden
Tower Hamlets
Waltham Forest
Newham
Islington
Richmond upon Thames
Lewisham
Camden
Westminster
Greenwich
Kensington and Chelsea
Barnet
Westminster
Lewisham
Waltham Forest
Hounslow, Ealing, Hammersmith and Fulham
Brent
Barnet
Lambeth, Wandsworth
Islington
Barnet
Merton
Barnet
Westminster
Barnet, Brent, Camden
Lewisham
Bexley
Haringey
Bromley
Tower Hamlets
Newham
Hackney
Islington
Southwark
Lewisham
Brent
Southwark
Ealing
Kensington and Chelsea
Wandsworth
Southwark
Barnet
Newham
Richmond upon Thames
E

Sampling our data

In [358]:
venues_in_London.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Category
0,"Bexley, Greenwich",51.499741,0.124061,Southmere Lake,Lake
1,"Ealing, Hammersmith and Fulham",51.49716,-0.25252,Hack & Veldt,Coffee Shop
2,"Ealing, Hammersmith and Fulham",51.49716,-0.25252,Lara Restaurant,Mediterranean Restaurant
3,"Ealing, Hammersmith and Fulham",51.49716,-0.25252,Chief Coffee,Coffee Shop
4,"Ealing, Hammersmith and Fulham",51.49716,-0.25252,Good Boy Coffee,Coffee Shop


In [359]:
venues_in_London.shape

(12427, 5)

Wow, we have scraped together 10567 records for venues. This will definitely make the clustering interesting.



### Grouping by Venue Categories
We need to now see how many Venue Categories are there for further processing

In [379]:
venues_in_London.groupby('Venue Category').count().sort_values(by = 'Neighbourhood',ascending=False)

Unnamed: 0_level_0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Pub,929,929,929,929
Coffee Shop,828,828,828,828
Café,603,603,603,603
Hotel,565,565,565,565
Park,325,325,325,325
...,...,...,...,...
Pool Hall,1,1,1,1
Poke Place,1,1,1,1
Jewish Restaurant,1,1,1,1
Piercing Parlor,1,1,1,1


We can see 295 records, just goes to show how diverse and interesting the place is.

### One Hot Encoding 
We need to Encode our venue categories to get a better result for our clustering

In [361]:
London_venue_cat = pd.get_dummies(venues_in_London[['Venue Category']], prefix="", prefix_sep="")
London_venue_cat

Unnamed: 0,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Aquarium,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12422,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12423,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12424,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12425,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Adding Neighbourhood into the mix.

In [362]:
London_venue_cat['Neighbourhood'] = venues_in_London['Neighbourhood'] 

# moving neighborhood column to the first column
fixed_columns = [London_venue_cat.columns[-1]] + list(London_venue_cat.columns[:-1])
London_venue_cat = London_venue_cat[fixed_columns]

London_venue_cat.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Aquarium,Arepa Restaurant,Argentinian Restaurant,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Ealing, Hammersmith and Fulham",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Ealing, Hammersmith and Fulham",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Ealing, Hammersmith and Fulham",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Ealing, Hammersmith and Fulham",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Venue categories mean value
We will group the Neighbourhoods and calculate the mean venue categories value in each Neighbourhood

In [363]:
London_grouped = London_venue_cat.groupby('Neighbourhood').mean().reset_index()
London_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Aquarium,Arepa Restaurant,Argentinian Restaurant,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Barnet,0.0,0.0,0.003333,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.013333,0.0,0.0,0.003333,0.0,0.0,0.0,0.001667
1,"Barnet, Brent, Camden",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Bexley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Bexley, Greenwich",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Brent,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010017,...,0.0,0.0,0.003339,0.008347,0.0,0.0,0.001669,0.0,0.0,0.0



Let's make a function to get the top most common venue categories

In [364]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


There are way too many venue categories, we can take the top 10 to cluster the neighbourhoods.

Creating a function to label the columns of the venue correctly

In [365]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))


### Top venue categories

Getting the top venue categories in London

In [380]:
London_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Aquarium,Arepa Restaurant,Argentinian Restaurant,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Barnet,0.0,0.0,0.003333,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.013333,0.0,0.0,0.003333,0.0,0.0,0.0,0.001667
1,"Barnet, Brent, Camden",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Bexley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Bexley, Greenwich",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Brent,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010017,...,0.0,0.0,0.003339,0.008347,0.0,0.0,0.001669,0.0,0.0,0.0


In [366]:
# create a new dataframe for London
neighborhoods_venues_sorted_london = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted_london['Neighbourhood'] = London_grouped['Neighbourhood']

for ind in np.arange(London_grouped.shape[0]):
    neighborhoods_venues_sorted_london.iloc[ind, 1:] = return_most_common_venues(London_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted_london.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Barnet,Pub,Park,Coffee Shop,Café,Bakery,Bus Stop,Cocktail Bar,Gastropub,Furniture / Home Store,Train Station
1,"Barnet, Brent, Camden",Park,Construction & Landscaping,Pizza Place,Furniture / Home Store,Office,Outdoors & Recreation,Outdoor Sculpture,Outdoor Event Space,Organic Grocery,Optical Shop
2,Bexley,Lake,Gym / Fitness Center,Pet Store,Motorcycle Shop,Convenience Store,Supermarket,Grocery Store,Pakistani Restaurant,Outdoors & Recreation,Outdoor Sculpture
3,"Bexley, Greenwich",Indian Restaurant,Bus Stop,Pizza Place,Lake,Grocery Store,Accessories Store,Okonomiyaki Restaurant,Outdoors & Recreation,Outdoor Sculpture,Outdoor Event Space
4,Brent,Coffee Shop,Pub,Pizza Place,Greek Restaurant,Italian Restaurant,Park,Middle Eastern Restaurant,Bar,Cocktail Bar,Restaurant


## Model Building

### K Means
Let's cluster the city of london to roughly 5 to make it easier to analyze. 

We use the K Means clustering technique to do so.

In [383]:
# set number of clusters
k_num_clusters = 5

London_grouped_clustering = London_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans_london = KMeans(n_clusters=k_num_clusters, random_state=0).fit(London_grouped_clustering)
kmeans_london

KMeans(n_clusters=5, random_state=0)

### Labelling Clustered Data

In [368]:
kmeans_london.labels_

array([0, 1, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0])

So our model has labeled the city

In [369]:
neighborhoods_venues_sorted_london.insert(0, 'Cluster Labels', kmeans_london.labels_ +1)

Join London_merged with our neighbourhood venues sorted to add latitude & longitude for each of the neighborhood to prepare it for plotting

In [370]:
london_data = london_merged

london_data = london_data.join(neighborhoods_venues_sorted_london.set_index('Neighbourhood'), on='borough')

london_data.head()

Unnamed: 0,borough,town,post_code,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bexley, Greenwich",LONDON,SE2,51.499741,0.124061,5,Indian Restaurant,Bus Stop,Pizza Place,Lake,Grocery Store,Accessories Store,Okonomiyaki Restaurant,Outdoors & Recreation,Outdoor Sculpture,Outdoor Event Space
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4",51.49716,-0.25252,1,Coffee Shop,Pizza Place,Grocery Store,Italian Restaurant,Bus Stop,Movie Theater,Park,Liquor Store,Massage Studio,Thai Restaurant
6,City,LONDON,EC3,51.513145,-0.078732,1,Coffee Shop,Hotel,Pub,Italian Restaurant,Wine Bar,French Restaurant,Gym / Fitness Center,Cocktail Bar,Restaurant,Sandwich Place
7,Westminster,LONDON,WC2,51.514625,-0.11486,1,Hotel,Pub,Coffee Shop,Café,Theater,Cocktail Bar,Italian Restaurant,French Restaurant,Restaurant,Sandwich Place
9,Bromley,LONDON,SE20,51.48249,0.119194,1,Theater,Hotel,Plaza,Bus Stop,Cocktail Bar,Pub,Sandwich Place,Garden,Bakery,Wine Bar



Drop all the NaN values to prevent data skew

In [371]:
london_data_nonan = london_data.dropna(subset=['Cluster Labels'])

### Visualizing the clustered neighbourhood
Let's plot the clusters

In [173]:
map_clusters_london = folium.Map(location=[london_lat_coords, london_lng_coords], zoom_start=12)

# set color scheme for the clusters
x = np.arange(k_num_clusters)
ys = [i + x + (i*x)**2 for i in range(k_num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(london_data_nonan['latitude'], london_data_nonan['longitude'], london_data_nonan['borough'], london_data_nonan['Cluster Labels']):
    label = folium.Popup('Cluster ' + str(int(cluster) +1) + '\n' + str(poi) , parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)]
        ).add_to(map_clusters_london)
        
map_clusters_london

## Examining our Clusters

Cluster 1

In [174]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 1, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
45,"BEXLEYHEATH, LONDON",1,Lake,Gym / Fitness Center,Pet Store,Motorcycle Shop,Convenience Store,Supermarket,Grocery Store,Pakistani Restaurant,Outdoors & Recreation,Outdoor Sculpture
124,LONDON,1,Lake,Gym / Fitness Center,Pet Store,Motorcycle Shop,Convenience Store,Supermarket,Grocery Store,Pakistani Restaurant,Outdoors & Recreation,Outdoor Sculpture
291,"LONDON, SIDCUP",1,Lake,Gym / Fitness Center,Pet Store,Motorcycle Shop,Convenience Store,Supermarket,Grocery Store,Pakistani Restaurant,Outdoors & Recreation,Outdoor Sculpture
506,LONDON,1,Lake,Gym / Fitness Center,Pet Store,Motorcycle Shop,Convenience Store,Supermarket,Grocery Store,Pakistani Restaurant,Outdoors & Recreation,Outdoor Sculpture


Cluster 2

In [175]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 2, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,LONDON,2,Coffee Shop,Pizza Place,Grocery Store,Italian Restaurant,Bus Stop,Movie Theater,Park,Liquor Store,Massage Studio,Thai Restaurant
6,LONDON,2,Coffee Shop,Hotel,Pub,Italian Restaurant,Wine Bar,French Restaurant,Gym / Fitness Center,Cocktail Bar,Restaurant,Sandwich Place
7,LONDON,2,Hotel,Pub,Coffee Shop,Café,Theater,Cocktail Bar,Italian Restaurant,French Restaurant,Restaurant,Sandwich Place
9,LONDON,2,Theater,Hotel,Plaza,Bus Stop,Cocktail Bar,Pub,Sandwich Place,Garden,Bakery,Wine Bar
10,LONDON,2,Pub,Café,Coffee Shop,Bus Stop,Bar,Grocery Store,Cocktail Bar,Thai Restaurant,Vegetarian / Vegan Restaurant,Pizza Place
...,...,...,...,...,...,...,...,...,...,...,...,...
522,LONDON,2,Hotel,Coffee Shop,Pub,Indian Restaurant,Café,Gym / Fitness Center,Korean Restaurant,Pizza Place,Cocktail Bar,Sandwich Place
523,"LONDON, WOODFORD GREEN",2,Hotel,Coffee Shop,Pub,Theater,Indian Restaurant,Café,Cocktail Bar,Sandwich Place,Steakhouse,Pizza Place
526,LONDON,2,Pub,Park,Coffee Shop,Café,Bakery,Bus Stop,Cocktail Bar,Gastropub,Furniture / Home Store,Train Station
527,LONDON,2,Pub,Coffee Shop,Hotel,Bar,Café,Gym / Fitness Center,Sandwich Place,Italian Restaurant,Grocery Store,Thai Restaurant


Cluster 3

In [176]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 3, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,LONDON,3,Lake,Accessories Store,Okonomiyaki Restaurant,Pakistani Restaurant,Outdoors & Recreation,Outdoor Sculpture,Outdoor Event Space,Organic Grocery,Optical Shop,Opera House


Cluster 4

In [177]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 4, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
167,"LONDON, WELLING",4,Indian Restaurant,Grocery Store,Bus Stop,Pizza Place,Office,Outdoors & Recreation,Outdoor Sculpture,Outdoor Event Space,Organic Grocery,Optical Shop
378,"HARROW, STANMOREEDGWARE, LONDON",4,Bus Stop,Indian Restaurant,Bakery,Gym,Nail Salon,Optical Shop,Palace,Music Venue,Pakistani Restaurant,Outdoors & Recreation
458,"LONDON, ERITH",4,Indian Restaurant,Grocery Store,Bus Stop,Pizza Place,Office,Outdoors & Recreation,Outdoor Sculpture,Outdoor Event Space,Organic Grocery,Optical Shop


Cluster 5

In [178]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 5, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
270,LONDON,5,Outdoors & Recreation,Italian Restaurant,Stables,Grocery Store,Accessories Store,Office,Outdoor Sculpture,Outdoor Event Space,Organic Grocery,Optical Shop




---

