# Analyzing boroughs in London for Starting a Restaurant

## Introduction
London is the capital and largest city of England and the United Kingdom. It is one of the world's most important financial, commerce and educational centers. London has a diverse range of people and cultures, and more than 300 languages are spoken in the region. Its estimated population is roughly 9 million, which made it the third-most populous city in Europe. If we are looking to open a new restaurant, this is one of the best cities to consider possible locations. This project can be useful for business owners and entrepreneurs who are looking to invest in a restaurant. The main objective of this project is to carefully analyze appropriate data and find recommendations for the stakeholders.

## Data Collection
The data required for this project has been collected from multiple sources. A summary of the data required for this project is given below.

### Borough geo coordinates data
The data of the boroughs in London was scraped from https://en.wikipedia.org/wiki/List_of_London_boroughs.

### Borough earnings data
Information on the income of the population of the borough is collected on the basis of two sources: data on the income of taxpayers living in the borough https://data.london.gov.uk/dataset/average-income-tax-payers-borough, and data on the income of people working in the borough https://data.london.gov.uk/dataset/earnings-workplace-borough.

### Geographical Coordinates
The geographical coordinates for London data has been obtained from the GeoPy library in python.

### Venue Data
The venue data has been extracted using the Foursquare API. This data contains venue recommendations for all boroughs in London and is used to study the popular venues of different boroughs.

## Data usage
The data on the venues will be used with K-Means clustering model to analyze different clusters of boroughs and determine the best location to start a restaurant business. Depending on the level of income of the working and living population, an adjustment function will be added to the cluster label of borough, to clarifying the attractiveness of opening a restaurant in this location.

## Import libraries

In [395]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium
import math
from sklearn.preprocessing import StandardScaler

print('Libraries imported.')

Libraries imported.


## Preparing data for analysis

In [2]:
#Creating soup object
url = 'https://en.wikipedia.org/wiki/List_of_London_boroughs'
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")

In [106]:
#Creating borough Dataframe
table_contents=[]
row = {}
counter = 0

table=soup.find('table')

for cell in table.findAll('td'):
    if counter > 9:
        counter = 0
        table_contents.append(row)
        row = {}
    
    if counter == 0:
        row['Borough'] = cell.text.strip()
    elif counter == 6:
        row['Area_sq_mi'] = float(cell.text.strip())
    elif counter == 8:
        row['Latitude'] = float(cell.text.split('/')[2].split(';')[0])
        row['Longitude'] = float(cell.text.split('/')[2].split(';')[1][1:7])
        
    counter +=1
    
    
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Barking and Dagenham [note 1]':'Barking and Dagenham',
                                             'Greenwich [note 2]':'Greenwich',
                                             'Hammersmith and Fulham [note 4]':'Hammersmith and Fulham'})

display(df)



Unnamed: 0,Borough,Area_sq_mi,Latitude,Longitude
0,Barking and Dagenham,13.93,51.5607,0.1557
1,Barnet,33.49,51.6252,-0.151
2,Bexley,23.38,51.4549,0.1505
3,Brent,16.7,51.5588,-0.281
4,Bromley,57.97,51.4039,0.0198
5,Camden,8.4,51.529,-0.125
6,Croydon,33.41,51.3714,-0.097
7,Ealing,21.44,51.513,-0.308
8,Enfield,31.74,51.6538,-0.079
9,Greenwich,18.28,51.4892,0.0648


In [326]:
# create map of London to visualize boroughs lacations
address = 'London, England'

geolocator = Nominatim(user_agent="London_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
map_london = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Borough']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_london)  
    
map_london

In [295]:
df['Radius'] = (df['Area_sq_mi'] * 2.58999 / math.pi)**0.5*1000
display(df)

Unnamed: 0,Borough,Area_sq_mi,Latitude,Longitude,Radius
0,Barking and Dagenham,13.93,51.5607,0.1557,3388.829082
1,Barnet,33.49,51.6252,-0.151,5254.503444
2,Bexley,23.38,51.4549,0.1505,4390.321866
3,Brent,16.7,51.5588,-0.281,3710.499205
4,Bromley,57.97,51.4039,0.0198,6913.146454
5,Camden,8.4,51.529,-0.125,2631.562871
6,Croydon,33.41,51.3714,-0.097,5248.223785
7,Ealing,21.44,51.513,-0.308,4204.230299
8,Enfield,31.74,51.6538,-0.079,5115.376082
9,Greenwich,18.28,51.4892,0.0648,3882.059638


In [166]:
#Get information about venues in the boroughs
def getNearbyVenues(names, latitudes, longitudes, radius, LIMIT=500):
    CLIENT_ID = client_id
    CLIENT_SECRET = client_secret
    VERSION = '20180605'
    
    venues_list=[]
    for name, lat, lng, radius in zip(names, latitudes, longitudes, radius):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Borough', 
                  'Borough Latitude', 
                  'Borough Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [136]:
client_id = 'FZPPYJO4H4RKTWFPC5AE2IEKJCYYON0HOME2XCSXOR5C3QL3'
client_secret = 'UDXXPYVBFUMJBKMG4X1GURLV21OUT3INSVQ4RLWMV05NON1L'

london_venues = getNearbyVenues(names=df['Borough'], latitudes=df['Latitude'], longitudes=df['Longitude'], radius=df['Radius'])
london_venues.shape

Barking and Dagenham
Barnet
Bexley
Brent
Bromley
Camden
Croydon
Ealing
Enfield
Greenwich
Hackney
Hammersmith and Fulham
Haringey
Harrow
Havering
Hillingdon
Hounslow
Islington
Kensington and Chelsea
Kingston upon Thames
Lambeth
Lewisham
Merton
Newham
Redbridge
Richmond upon Thames
Southwark
Sutton
Tower Hamlets
Waltham Forest
Wandsworth


(3084, 7)

## Explore and cluster the boroughs (venue data)

In [389]:
#Create df with information about number of venues in neighborhoods
df_restaurant = london_venues[london_venues['Venue Category'].str.contains('Restaurant')]
rest_count = df_restaurant[['Borough', 'Venue']].groupby('Borough').count().sort_values('Venue', ascending=False)

#Setting boundaries to exclude Boroughs with the most and least number of restaurants
high_border = np.percentile(rest_count['Venue'], 75)
low_border = np.percentile(rest_count['Venue'], 25)

#List of Boroughs to explore
interesting_boroughs = rest_count[(rest_count['Venue']<high_border)&(rest_count['Venue']>low_border)].index

#london_grouped[london_grouped['Borough'].isin(df_explore_1.index)]
df_explore = df[df['Borough'].isin(interesting_boroughs)]
london_venues_explore = london_venues[london_venues['Borough'].isin(interesting_boroughs)]

In [390]:
#Get dummies for Dataframe and compute the weight ov venue category for each neighborhood
london_processing = pd.get_dummies(london_venues_explore[['Venue Category']], prefix="", prefix_sep="")
london_processing['Borough'] = london_venues_explore['Borough'] 

fixed_columns = [london_processing.columns[-1]] + list(london_processing.columns[:-1])
london_processing = london_processing[fixed_columns]

london_grouped = london_processing.groupby('Borough').mean().reset_index()
london_grouped.shape

(13, 213)

In [391]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [392]:
#Explore top 10 venue categories for each neighborhood
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Borough']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
boroughs_venues_sorted = pd.DataFrame(columns=columns)
boroughs_venues_sorted['Borough'] = london_grouped['Borough']

for ind in np.arange(london_grouped.shape[0]):
    boroughs_venues_sorted.iloc[ind, 1:] = return_most_common_venues(london_grouped.iloc[ind, :], num_top_venues)

display(boroughs_venues_sorted)

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Barking and Dagenham,Grocery Store,Supermarket,Coffee Shop,Park,Pub,Furniture / Home Store,Fast Food Restaurant,Hotel,Metro Station,Italian Restaurant
1,Barnet,Café,Turkish Restaurant,Coffee Shop,Park,Pub,Bakery,Grocery Store,Greek Restaurant,Bar,Garden Center
2,Brent,Indian Restaurant,Coffee Shop,Clothing Store,Hotel,Gym / Fitness Center,Park,Hookah Bar,Sandwich Place,Café,Pub
3,Bromley,Pub,Park,Coffee Shop,Gym / Fitness Center,Pizza Place,Italian Restaurant,Gastropub,Indian Restaurant,Supermarket,Bar
4,Enfield,Pub,Coffee Shop,Park,Turkish Restaurant,Café,Gym / Fitness Center,Garden Center,Greek Restaurant,Supermarket,Bakery
5,Hillingdon,Pub,Indian Restaurant,Coffee Shop,Supermarket,Gym / Fitness Center,Hotel,Park,Thai Restaurant,Sandwich Place,Burger Joint
6,Islington,Pub,Coffee Shop,Café,Park,Bakery,Gastropub,French Restaurant,Theater,Fish Market,Mediterranean Restaurant
7,Kingston upon Thames,Pub,Café,Park,Coffee Shop,Garden,Gym / Fitness Center,Gastropub,Thai Restaurant,Italian Restaurant,Hotel
8,Lambeth,Pub,Coffee Shop,Park,Café,Brewery,Market,Gastropub,Pizza Place,Beer Bar,Farmers Market
9,Lewisham,Pub,Park,Coffee Shop,Gastropub,Italian Restaurant,Café,Indian Restaurant,Gym / Fitness Center,Farmers Market,Bakery


In [393]:
#Cluster processing for boroughs based on venue data
kclusters = 3

london_grouped_clustering = london_grouped.drop('Borough', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(london_grouped_clustering)

boroughs_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

london_merged = df_explore

london_merged = london_merged.merge(boroughs_venues_sorted.set_index('Borough'), how='left',on='Borough')

display(london_merged)

Unnamed: 0,Borough,Area_sq_mi,Latitude,Longitude,Radius,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Barking and Dagenham,13.93,51.5607,0.1557,3388.829082,2,Grocery Store,Supermarket,Coffee Shop,Park,Pub,Furniture / Home Store,Fast Food Restaurant,Hotel,Metro Station,Italian Restaurant
1,Barnet,33.49,51.6252,-0.151,5254.503444,0,Café,Turkish Restaurant,Coffee Shop,Park,Pub,Bakery,Grocery Store,Greek Restaurant,Bar,Garden Center
2,Brent,16.7,51.5588,-0.281,3710.499205,2,Indian Restaurant,Coffee Shop,Clothing Store,Hotel,Gym / Fitness Center,Park,Hookah Bar,Sandwich Place,Café,Pub
3,Bromley,57.97,51.4039,0.0198,6913.146454,1,Pub,Park,Coffee Shop,Gym / Fitness Center,Pizza Place,Italian Restaurant,Gastropub,Indian Restaurant,Supermarket,Bar
4,Enfield,31.74,51.6538,-0.079,5115.376082,0,Pub,Coffee Shop,Park,Turkish Restaurant,Café,Gym / Fitness Center,Garden Center,Greek Restaurant,Supermarket,Bakery
5,Hillingdon,44.67,51.5441,-0.476,6068.510162,2,Pub,Indian Restaurant,Coffee Shop,Supermarket,Gym / Fitness Center,Hotel,Park,Thai Restaurant,Sandwich Place,Burger Joint
6,Islington,5.74,51.5416,-0.102,2175.354565,1,Pub,Coffee Shop,Café,Park,Bakery,Gastropub,French Restaurant,Theater,Fish Market,Mediterranean Restaurant
7,Kingston upon Thames,14.38,51.4085,-0.306,3443.13103,1,Pub,Café,Park,Coffee Shop,Garden,Gym / Fitness Center,Gastropub,Thai Restaurant,Italian Restaurant,Hotel
8,Lambeth,10.36,51.4607,-0.116,2922.496401,1,Pub,Coffee Shop,Park,Café,Brewery,Market,Gastropub,Pizza Place,Beer Bar,Farmers Market
9,Lewisham,13.57,51.4452,-0.02,3344.75284,1,Pub,Park,Coffee Shop,Gastropub,Italian Restaurant,Café,Indian Restaurant,Gym / Fitness Center,Farmers Market,Bakery


In [396]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(london_merged['Latitude'], london_merged['Longitude'], london_merged['Borough'], london_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Preparing data and cluster the boroughs (income data)

In [416]:
#Creating inocme Dataframe
income_data = pd.read_csv('income_by_borough.csv', sep=';')
income_data_explore = income_data[income_data['Borough'].isin(interesting_boroughs)]

income_data_explore

Unnamed: 0,Borough,Tax_payers,Workplace
0,Barking and Dagenham,23900,28553
1,Barnet,28700,32143
3,Brent,24700,30134
4,Bromley,32000,29819
8,Enfield,26300,29134
15,Hillingdon,27100,33596
17,Islington,33400,39348
19,Kingston upon Thames,32400,31308
20,Lambeth,29900,35036
21,Lewisham,27300,33294


In [417]:
#Cluster processing for boroughs based on income data
income_clusters = 3

income_data_clustering = income_data_explore.drop('Borough', 1)

kmeans = KMeans(n_clusters=income_clusters, random_state=0).fit(income_data_clustering)

income_data_explore.insert(3, 'Income Cluster Labels', kmeans.labels_)

display(income_data_explore)

Unnamed: 0,Borough,Tax_payers,Workplace,Income Cluster Labels
0,Barking and Dagenham,23900,28553,0
1,Barnet,28700,32143,1
3,Brent,24700,30134,0
4,Bromley,32000,29819,1
8,Enfield,26300,29134,0
15,Hillingdon,27100,33596,1
17,Islington,33400,39348,1
19,Kingston upon Thames,32400,31308,1
20,Lambeth,29900,35036,1
21,Lewisham,27300,33294,1


In [418]:
#Creating final dataframe
borough_clustered = london_merged.merge(income_data_explore.set_index('Borough')['Income Cluster Labels'], how='left',on='Borough')
borough_clustered = borough_clustered[['Borough','Latitude', 'Longitude','Cluster Labels', 'Income Cluster Labels']]

borough_clustered['Cluster Labels'] = borough_clustered['Cluster Labels'].astype('str').replace({'0':'C', '1':'B', '2':'A',})
borough_clustered['Income Cluster Labels'] = borough_clustered['Income Cluster Labels'].astype('str').replace({'0':'-', '1':'', '2':'+',})

borough_clustered['Final cluster'] = borough_clustered['Cluster Labels'].astype('str') + borough_clustered['Income Cluster Labels']
borough_clustered['Final cluster'].unique()

array(['A-', 'C', 'B', 'C-', 'A', 'A+', 'B-'], dtype=object)

In [420]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
colors_dict = dict(zip([letter+symb for letter in ['A', 'B', 'C'] for symb in ['+', '', '-']], range(9)))
x = np.arange(len(colors_dict.keys()))
ys = [i + x + (i*x)**2 for i in range(len(colors_dict.keys()))]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]


# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(borough_clustered['Latitude'], borough_clustered['Longitude'], borough_clustered['Borough'], borough_clustered['Final cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[colors_dict[cluster]],
        fill=True,
        fill_color=rainbow[colors_dict[cluster]],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters