<h1> The battle of Neighborhoods

London is a multi-cultural city, characterized by several neighborhoods, each one with an history and populated by different ethnic groups. Among the others, the Italian population is one of the most numerous, and the Italian cuisine is one of the most appreciated.

In this project, we would like to understand, with a data-driven approach, what neighborhoods offers the highest business opportunities if a new Italian restaurant is opened. Specifically we would like to answer the following question: if someone is looking to open an italian restaurant, where would they open it? 

In order to answer this question, we perform a segmentation and clusterization of the city based on its neighborhoods. This allows us to analyse the distribution of Italian restaurants in London, and thus to obtain the areas where this business is still in a preliminary stage.  

<h1> Data

In order to perform the analysis, we need to extract the data on London neighborhoods by using public available data from Wikipedia. Moreover, we will employ Foursquare to obtain data on the venues in each neighborhoods.


We start importing the needed packages:

In [91]:
import pandas as pd

import numpy as np

from bs4 import BeautifulSoup

import json

import requests
import lxml
from pandas.io.json import json_normalize

import matplotlib.cm as cm
import matplotlib.colors as colors

import folium 

from sklearn.cluster import KMeans

import pgeocode

London is administered by the City of London and 32 London boroughs. Data on the boroughs can be found on Wikipedia at the following link:

https://en.wikipedia.org/wiki/List_of_areas_of_London

By using BeautifulSoup, we can extract info from this page (scraping) and we can look for the table class, as follows:

In [92]:
link = 'https://en.wikipedia.org/wiki/List_of_areas_of_London'
page = requests.get(link)
print(page)
soup = BeautifulSoup(page.content, 'html')
table = soup.find('table', {'class':'wikitable sortable'}).tbody
table

<Response [200]>


<tbody><tr>
<th>Location</th>
<th>London borough</th>
<th>Post town</th>
<th>Postcode district</th>
<th>Dial code</th>
<th>OS grid ref
</th></tr>
<tr>
<td><a href="/wiki/Abbey_Wood" title="Abbey Wood">Abbey Wood</a></td>
<td>Bexley,  Greenwich <sup class="reference" id="cite_ref-mills1_1-0"><a href="#cite_note-mills1-1">[1]</a></sup></td>
<td>LONDON</td>
<td>SE2</td>
<td>020</td>
<td><span class="plainlinks nourlexpansion" style="white-space: nowrap"><a class="external text" href="https://tools.wmflabs.org/os/coor_g/?pagename=List_of_areas_of_London&amp;params=TQ465785_region%3AGB_scale%3A25000">TQ465785</a></span>
</td></tr>
<tr>
<td><a href="/wiki/Acton,_London" title="Acton, London">Acton</a></td>
<td>Ealing, Hammersmith and Fulham<sup class="reference" id="cite_ref-mills2_2-0"><a href="#cite_note-mills2-2">[2]</a></sup></td>
<td>LONDON</td>
<td>W3, W4</td>
<td>020</td>
<td><span class="plainlinks nourlexpansion" style="white-space: nowrap"><a class="external text" href="https://too

Then, we find all the table rows, we obtain the column headers, and we build the dataframe:

In [93]:

rows = table.find_all('tr')

columns = [i.text.replace('\n', '') for i in rows[0].find_all('th')]

df = pd.DataFrame(columns = columns)

df.columns

Index(['Location', 'London borough', 'Post town', 'Postcode district',
       'Dial code', 'OS grid ref'],
      dtype='object')

Then, we fill the dataframe by using a for loop

In [94]:
for i in range(1, len(rows)):
    tds = rows[i].find_all('td')    
    if len(tds) == 7:
        values = [tds[0].text, tds[1].text, tds[2].text.replace('\n', ''.replace('\xa0','')), tds[3].text, tds[4].text.replace('\n', ''.replace('\xa0','')), tds[5].text.replace('\n', ''.replace('\xa0','')), tds[6].text.replace('\n', ''.replace('\xa0',''))]
    else:
        values = [td.text.replace('\n', '').replace('\xa0','') for td in tds]
        
        df = df.append(pd.Series(values, index = columns), ignore_index = True)
                                                                                        
df.head()

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [1]",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[2]",LONDON,"W3, W4",20,TQ205805
2,Addington,Croydon[2],CROYDON,CR0,20,TQ375645
3,Addiscombe,Croydon[2],CROYDON,CR0,20,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728


Re-naming some columns for convenience, and removing from the table useless columns. We also keep only Locations within London post town:

In [95]:
df=df.rename(columns = {'London\xa0borough':'Borough'})
df=df.rename(columns = {'Postcode\xa0district':'Postcode'})
df['Borough'] = df['Borough'].map(lambda x: x.rstrip(']').rstrip('0123456789').rstrip('['))

df = df[['Location', 'Borough', 'Postcode', 'Post town']].reset_index(drop=True)

df=df[df['Post town']=='LONDON']

df.head()

Unnamed: 0,Location,Borough,Postcode,Post town
0,Abbey Wood,"Bexley, Greenwich",SE2,LONDON
1,Acton,"Ealing, Hammersmith and Fulham","W3, W4",LONDON
6,Aldgate,City,EC3,LONDON
7,Aldwych,Westminster,WC2,LONDON
9,Anerley,Bromley,SE20,LONDON


In case of multiple postcodes for the same Borough, we keep only the first one

In [96]:
df['Postcode']=df['Postcode'].str.split(',', expand=True).iloc[:,0]


df.Postcode = df.Postcode.str.strip()
df= df[df['Postcode'].str.startswith(('W'))].reset_index(drop=True)
df.head()

Unnamed: 0,Location,Borough,Postcode,Post town
0,Acton,"Ealing, Hammersmith and Fulham",W3,LONDON
1,Aldwych,Westminster,WC2,LONDON
2,Bayswater,Westminster,W2,LONDON
3,Bedford Park,Ealing,W4,LONDON
4,Bloomsbury,Camden,WC1,LONDON


Let us check the shape

In [97]:
df.shape

(33, 4)

To obtain the coordinates (latitude and longitude) of the locations, the pgeocode package is used:

In [98]:
lat=[]
lon=[]
for x in df['Postcode'].tolist():
    nomi_db = nomi.query_postal_code(x)
    lat.append(nomi_db.latitude)
    lon.append(nomi_db.longitude)

data = {'Latitude': lat, 'Longitude': lon}

data

{'Latitude': [51.5114,
  51.5142,
  51.5143,
  51.4927,
  51.5236,
  51.5142,
  51.5166,
  51.4927,
  51.5142,
  51.5122,
  51.5166,
  51.4927,
  51.4927,
  51.4938,
  51.5118,
  51.5236,
  51.5009,
  51.5236,
  51.5303,
  51.5303,
  51.5166,
  51.5166,
  51.523,
  51.5075,
  51.5143,
  51.505,
  51.5166,
  51.5142,
  51.5236,
  51.5136,
  51.4938,
  51.505,
  51.505],
 'Longitude': [-0.26571666666666666,
  -0.12338181818181815,
  -0.1886454545454546,
  -0.258,
  -0.12229999999999998,
  -0.12338181818181815,
  -0.09864285714285716,
  -0.258,
  -0.12338181818181815,
  -0.2851888888888889,
  -0.09864285714285716,
  -0.258,
  -0.258,
  -0.2204,
  -0.3358999999999999,
  -0.12229999999999998,
  -0.1985,
  -0.12229999999999998,
  -0.18458,
  -0.18458,
  -0.09864285714285716,
  -0.09864285714285716,
  -0.2188,
  -0.205,
  -0.1886454545454546,
  -0.2211,
  -0.09864285714285716,
  -0.12338181818181815,
  -0.12229999999999998,
  -0.311625,
  -0.21798,
  -0.2211,
  -0.2211]}

and we add them to the main dataframe:

In [99]:
coordinates = pd.DataFrame.from_dict(data)
df['Latitude'] = coordinates['Latitude']
df['Longitude'] = coordinates['Longitude']
df.head()

Unnamed: 0,Location,Borough,Postcode,Post town,Latitude,Longitude
0,Acton,"Ealing, Hammersmith and Fulham",W3,LONDON,51.5114,-0.265717
1,Aldwych,Westminster,WC2,LONDON,51.5142,-0.123382
2,Bayswater,Westminster,W2,LONDON,51.5143,-0.188645
3,Bedford Park,Ealing,W4,LONDON,51.4927,-0.258
4,Bloomsbury,Camden,WC1,LONDON,51.5236,-0.1223


Hence, the final dimensions of our dataframe are checked:

In [100]:
df.shape

(33, 6)

Finally, with Foursquare we can obtain information regarding the venues for the geographical location data in London. 
This allows us to provide suggestions on the optimal locations
