<h1 style="text-align:center">Applied Data Science Capstone Project</h1>

### Module 4 Peer-graded Assignment: Capstone Project - The Battle of Neighborhoods

You have the opportunity to be as creative as you want and come up with an idea to leverage the Foursquare location data to explore or compare neighborhoods or cities of your choice or to come up with a problem that you can use the Foursquare location data to solve. 

### Review criteria
Part 1:
1. A description of the problem and a discussion of the background. (15 marks)
2. A description of the data and how it will be used to solve the problem. (15 marks)

Part 2:
1. A link to your Notebook on your Github repository, showing your code. (15 marks)
2. A full report consisting of all of the following components (15 marks):
- Introduction where you discuss the business problem and who would be interested in this project.
- Data where you describe the data that will be used to solve the problem and the source of the data.
- Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
- Results section where you discuss the results.
- Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
- Conclusion section where you conclude the report.
3. Your choice of a presentation or blogpost.

## Part 1

### 1. Problem and background description 

I live in UK and once the national lockdown restrictions will be lifted, I would like to book a weekend getaway in London to enjoy the city life and live as a tourist for a couple of days.

Now the question is where should I book my stay. We will answer this question based on location only, rate per night, rating or other factors will not influence our decision for this project.  
The aim of this project will be to find the best neighborhood in London from a touristic point of view and we will choose the best based on how many restaurants, cafes, pubs and other atractions are located in that area. 

This analysis will help other tourists discover interesting areas in London and perhaps current residents of London that haven't got the chance until now to explore all the surroundings and would love to know more about their or other neighborhoods.

### 2. What data and how it will be used to answer our question ? 

In order to make this analysis, we require geolocation data for the city of London. The starting point will be the postal codes, on which we can further our search and explore neighborhoods and venues in these area, as well as additional tourist attractions. 

For the neighborhoods and postal codes of London, we will scrape the <a href='https://en.wikipedia.org/wiki/List_of_areas_of_London'>Wikipedia page </a> and we will select information as follows: 
* London borough = borough 
* Post town = town in the borough 
* Postcode district = postcode

There are 32 boroughs in the London area since 1 April 1965. 
* 12 designated Inner London boroughs
* remaining twenty were designated Outer London boroughs

For more details about neighborhoods of London such as local authorithy or headquarters, visit this <a href='https://en.wikipedia.org/wiki/List_of_London_boroughs'>link.</a>

We will work only with the City of London. 

Because Wikipedia lacks information about the latitude and longitude of each area in discussion, we'll use ArcGIS API to get the geo locations of the neighbourhoods of London.

ArcGIS Online enables you to connect people, locations, and data using interactive maps. Work with smart, data-driven styles and intuitive analysis tools that deliver location intelligence. Share your insights with the world or specific groups.

The following columns are added to our initial dataset which prepares our data.

* latitude : Latitude for Neighbourhood
* longitude : Longitude for Neighbourhood

For additional information about different venues in our neighborhoods, we will use Foursquare.

Foursquare is a location data provider which makes continuous updates to its data making it very reliable. 

The information retrieved within an area of interest includes venue names, locations, menus and even photos. As such, the foursquare location platform will be used as the sole data source since all the stated required information can be obtained through the API.

After finding the list of neighbourhoods, we then connect to the Foursquare API to gather information about venues inside each and every neighbourhood. For each neighbourhood, we have chosen the radius to be 500 meters.

The data retrieved from Foursquare contained information of venues within a specified distance of the longitude and latitude of the postcodes. The information obtained per venue as follows:

* Neighbourhood : Name of the Neighbourhood
* Neighbourhood Latitude : Latitude of the Neighbourhood
* Neighbourhood Longitude : Longitude of the Neighbourhood
* Venue : Name of the Venue
* Venue Latitude : Latitude of Venue
* Venue Longitude : Longitude of Venue
* Venue Category : Category of Venue

Based on all the information collected for London, we have sufficient data to build our model. We cluster the neighbourhoods together based on similar venue categories. We then present our observations and findings.

## Part 2

#### Installing required libraries for beginning of analysis 

In [1]:
#installing beautiful soup, pandas & numpy
!pip install beautifulsoup4

import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner

from bs4 import BeautifulSoup

print('Libraries imported')

Libraries imported


#### Scraping the web page

In [2]:
# scraping the wikipedia page for raw data

url = "https://en.wikipedia.org/wiki/List_of_areas_of_London"
london_table = pd.read_html(url,header=0,flavor='html5lib')[1]
print(london_table.head())

      Location                     London borough       Post town  \
0   Abbey Wood              Bexley, Greenwich [7]          LONDON   
1        Acton  Ealing, Hammersmith and Fulham[8]          LONDON   
2    Addington                         Croydon[8]         CROYDON   
3   Addiscombe                         Croydon[8]         CROYDON   
4  Albany Park                             Bexley  BEXLEY, SIDCUP   

  Postcode district Dial code OS grid ref  
0               SE2       020    TQ465785  
1            W3, W4       020    TQ205805  
2               CR0       020    TQ375645  
3               CR0       020    TQ345665  
4         DA5, DA14       020    TQ478728  


#### Data pre-processing

In [3]:
# replace the spaces with underscores in the title
london_table.rename(columns=lambda x: x.strip().replace(" ", "_"), inplace=True)
london_table.head()

Unnamed: 0,Location,London borough,Post_town,Postcode district,Dial code,OS_grid_ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728


In [4]:
london_table.head()

Unnamed: 0,Location,London borough,Post_town,Postcode district,Dial code,OS_grid_ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728


#### Feature selection

In [5]:
# we need only the borough, town & postal codes
df = london_table.drop( [ london_table.columns[0], london_table.columns[4], london_table.columns[5] ], axis=1)
df.head()

Unnamed: 0,London borough,Post_town,Postcode district
0,"Bexley, Greenwich [7]",LONDON,SE2
1,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4"
2,Croydon[8],CROYDON,CR0
3,Croydon[8],CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"


In [6]:
df = df.rename(columns={'London borough': 'Borough'})
df.head()

Unnamed: 0,London borough,Post_town,Postcode district
0,"Bexley, Greenwich [7]",LONDON,SE2
1,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4"
2,Croydon[8],CROYDON,CR0
3,Croydon[8],CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"


In [7]:
# rename the postcode district column and the london borough to something simpler
df.columns = ['borough','town','post_code']
df

Unnamed: 0,borough,town,post_code
0,"Bexley, Greenwich [7]",LONDON,SE2
1,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4"
2,Croydon[8],CROYDON,CR0
3,Croydon[8],CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"
...,...,...,...
526,Greenwich,LONDON,SE18
527,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4
528,Hammersmith and Fulham,LONDON,W12
529,Hillingdon,HAYES,UB4


In [8]:
#remove the Square brackets [ ] and numbers from the borough column
df['borough'] = df['borough'].map(lambda x: x.rstrip(']').rstrip('0123456789').rstrip('['))
df

Unnamed: 0,borough,town,post_code
0,"Bexley, Greenwich",LONDON,SE2
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
2,Croydon,CROYDON,CR0
3,Croydon,CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"
...,...,...,...
526,Greenwich,LONDON,SE18
527,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4
528,Hammersmith and Fulham,LONDON,W12
529,Hillingdon,HAYES,UB4


#### Feature Engineering

The dataset contains information related to all the cities in the London area. We can narrow down and further process the data by selecting only the neighbourhoods pertaining to 'City of London'


In [9]:
# select only London town
df = df[df['town'].str.contains('LONDON')]
df.head()

Unnamed: 0,borough,town,post_code
0,"Bexley, Greenwich",LONDON,SE2
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
6,City,LONDON,EC3
7,Westminster,LONDON,WC2
9,Bromley,LONDON,SE20


In [10]:
#check how many records we are working with
df.shape

(308, 3)

#### Location data

Installing additional tools in order to get the geographical location data.

In [11]:
# installing ArcGis

!pip install arcgis

from arcgis.geocoding import geocode
from arcgis.gis import GIS
gis = GIS()

print("library installed")

library installed


**In order to plot out our map, we need to get the geographical co-ordinates for the neighbourhoods and for this we will use the ArcGis package to do so. Arcgis doesn't have a limitation on the number of API calls made so it fits our use case perfectly.**

In [12]:
#defining London arcgis geocode function to return latitude and longitude
def get_x_y_uk(address1):
   lat_coords = 0
   lng_coords = 0
   g = geocode(address='{}, London, England, GBR'.format(address1))[0]
   lng_coords = g['location']['x']
   lat_coords = g['location']['y']
   return str(lat_coords) +","+ str(lng_coords)

#Checking sample data
c = get_x_y_uk('SE2')
c

'51.492450000000076,0.12127000000003818'

Looks good. We will copy over the postal codes of London to pass it into the geolocator function that we just defined above.


In [13]:
geo_coordinates_uk = df['post_code']    
geo_coordinates_uk

0           SE2
1        W3, W4
6           EC3
7           WC2
9          SE20
         ...   
521    IG8, E18
522         IG8
525         N12
526        SE18
528         W12
Name: post_code, Length: 308, dtype: object

Passing postal codes of london to get the geographical coordinates.


In [14]:
coordinates_latlng_uk = geo_coordinates_uk.apply(lambda x: get_x_y_uk(x))
coordinates_latlng_uk

0       51.492450000000076,0.12127000000003818
1        51.51324000000005,-0.2674599999999714
6       51.51200000000006,-0.08057999999994081
7       51.51651000000004,-0.11967999999995982
9       51.41009000000008,-0.05682999999993399
                        ...                   
521    51.589770000000044,0.030520000000024083
522      51.50642000000005,-0.1272099999999341
525     51.615920000000074,-0.1767399999999384
526      51.48207000000008,0.07143000000002075
528      51.50645000000003,-0.2369099999999662
Name: post_code, Length: 308, dtype: object

**Latitude coordinates**

In [15]:
#extracting the latitude from our previously collected coordinates
lat_uk = coordinates_latlng_uk.apply(lambda x: x.split(',')[0])
lat_uk

0      51.492450000000076
1       51.51324000000005
6       51.51200000000006
7       51.51651000000004
9       51.41009000000008
              ...        
521    51.589770000000044
522     51.50642000000005
525    51.615920000000074
526     51.48207000000008
528     51.50645000000003
Name: post_code, Length: 308, dtype: object

**Longitude coordinates**

In [16]:
#extracting the longitude from our previously collected coordinates
lng_uk = coordinates_latlng_uk.apply(lambda x: x.split(',')[1])
lng_uk

0       0.12127000000003818
1       -0.2674599999999714
6      -0.08057999999994081
7      -0.11967999999995982
9      -0.05682999999993399
               ...         
521    0.030520000000024083
522     -0.1272099999999341
525     -0.1767399999999384
526     0.07143000000002075
528     -0.2369099999999662
Name: post_code, Length: 308, dtype: object

Gathered the geographical coordinates of the London Neighbourhoods. We'll proceed with merging our source data with the geographical coordinates to process the dataset for the next stage.

In [17]:
london_merged = pd.concat([df,lat_uk.astype(float), lng_uk.astype(float)], axis=1)
london_merged.columns= ['borough','town','post_code','latitude','longitude']
london_merged

Unnamed: 0,borough,town,post_code,latitude,longitude
0,"Bexley, Greenwich",LONDON,SE2,51.49245,0.12127
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4",51.51324,-0.26746
6,City,LONDON,EC3,51.51200,-0.08058
7,Westminster,LONDON,WC2,51.51651,-0.11968
9,Bromley,LONDON,SE20,51.41009,-0.05683
...,...,...,...,...,...
521,Redbridge,LONDON,"IG8, E18",51.58977,0.03052
522,"Redbridge, Waltham Forest","LONDON, WOODFORD GREEN",IG8,51.50642,-0.12721
525,Barnet,LONDON,N12,51.61592,-0.17674
526,Greenwich,LONDON,SE18,51.48207,0.07143


In [18]:
#check data type
london_merged.dtypes

borough       object
town          object
post_code     object
latitude     float64
longitude    float64
dtype: object

Getting the geocode for London to help visualize it on the map.

In [19]:
london = geocode(address='London, England, GBR')[0]
london_lng_coords = london['location']['x']
london_lat_coords = london['location']['y']
print("Latitude:", london_lat_coords, "and longitude:",london_lng_coords)

Latitude: 51.50642000000005 and longitude: -0.1272099999999341


#### Visualize the Map of London

To visualize the map of London and the boroughs in London, we make use of the folium package. 
Folium is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the borough.

In [20]:
#installing additional libraries 
!conda install -c conda-forge folium=0.5.0 --yes  # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
print("Folium installed")

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Folium installed


Creating the map of London

In [21]:
map_London = folium.Map(location=[london_lat_coords, london_lng_coords], zoom_start=12)
map_London

In [22]:
# adding markers to map
for latitude, longitude, borough, town in zip(london_merged['latitude'], london_merged['longitude'], london_merged['borough'], london_merged['town']):
    label = '{}, {}'.format(town, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color='red',
        fill=True
        ).add_to(map_London)  

In [23]:
map_London #zoom in enough to click on the markers and view the name of the neighborhood

Now that we have visualized the boroughs, we need to find out in which area are the most venues and venue categories within a 500m radius.

This is where Foursquare comes into play. With the help of Foursquare we define a function which collects information pertaining to each neighbourhood including that of the name of the neighbourhood, geo-coordinates, venue and venue categories.

In [24]:
#define Foursquare API credentials
CLIENT_ID = 'XUAJQ5DWX0MO0U55Q11RWZK5HNHYIEKZDUK1N0EYM5441MBA' 
CLIENT_SECRET = 'RI1I0XBLONHSSQRCJUXAXT5VSHNLP4WAIDYUHWRHAS4T1ZGH'
VERSION = '20210310' # Foursquare API version

In [25]:
radius = 500
LIMIT = 100

lng = london_lng_coords
lat = london_lat_coords

Defining a function to get the nearby venues in the neighbourhood in order to get venue categories.

In [26]:
#function
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            LIMIT
            )
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Category']
    
    return(nearby_venues)

In [27]:
venues_in_London = getNearbyVenues(london_merged['borough'], london_merged['latitude'], london_merged['longitude'])

Bexley, Greenwich 
Ealing, Hammersmith and Fulham
City
Westminster
Bromley
Islington
Islington
Barnet
Enfield
Wandsworth
Southwark
City
Richmond upon Thames
Barnet
Islington
Wandsworth
Westminster
Bromley
Newham
Ealing
Westminster
Lewisham
Camden
Southwark
Tower Hamlets
Bexley
City
Lewisham
Greenwich
Tower Hamlets
Camden
Haringey
Tower Hamlets
Haringey
Barnet
Brent
Lambeth
Lewisham
Tower Hamlets
Kensington and Chelsea, Hammersmith and Fulham
Brent
Barnet
Barnet
Southwark
Tower Hamlets
Camden
Tower Hamlets
Waltham Forest
Newham
Islington
Richmond upon Thames
Lewisham
Camden
Westminster
Greenwich
Kensington and Chelsea
Barnet
Westminster
Lewisham
Waltham Forest
Hounslow, Ealing, Hammersmith and Fulham
Brent
Barnet
Lambeth, Wandsworth
Islington
Barnet
Merton
Barnet
Westminster
Barnet, Brent, Camden
Lewisham
Bexley
Haringey
Bromley
Tower Hamlets
Newham
Hackney
Islington
Southwark
Lewisham
Brent
Southwark
Ealing
Kensington and Chelsea
Wandsworth
Southwark
Barnet
Newham
Richmond upon Thames


In [28]:
# install additional libraries 
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

In [29]:
venues_in_London.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Category
0,"Bexley, Greenwich",51.49245,0.12127,Lesnes Abbey,Historic Site
1,"Bexley, Greenwich",51.49245,0.12127,Sainsbury's,Supermarket
2,"Bexley, Greenwich",51.49245,0.12127,Lidl,Supermarket
3,"Bexley, Greenwich",51.49245,0.12127,Abbey Wood Railway Station (ABW),Train Station
4,"Bexley, Greenwich",51.49245,0.12127,Bean @ Work,Coffee Shop


In [30]:
venues_in_London.shape

(10349, 5)

We have scraped together 10349 records for venues. This will definitely make the clustering interesting.

#### Grouping by Venue Categories

We need to now see how many venue categories are for further processing.

In [31]:
venues_in_London.groupby('Venue Category').max()

Unnamed: 0_level_0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Accessories Store,Westminster,51.51656,-0.14770,Balenciaga
Adult Boutique,Islington,51.52969,-0.08697,Sh! Women's Erotic Emporium
African Restaurant,Westminster,51.52587,-0.08808,Red Sea Restaurant
American Restaurant,Waltham Forest,51.61780,0.02795,Spielburger
Antique Shop,Westminster,51.51651,-0.11968,The London Silver Vaults
...,...,...,...,...
Wings Joint,Hammersmith and Fulham,51.54187,-0.19795,Wingmans
Women's Store,Westminster,51.55457,0.00278,Vivien of Holloway
Xinjiang Restaurant,Southwark,51.47480,-0.09313,Silk Road
Yoga Studio,Westminster,51.55457,-0.03558,yogahaven


We see 301 records, this shows how diverse and interesting London is. 

#### One Hot Encoding

We need to Encode our venue categories to get a better result for our clustering


In [32]:
London_venue_cat = pd.get_dummies(venues_in_London[['Venue Category']], prefix="", prefix_sep="")
London_venue_cat

Unnamed: 0,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo Exhibit
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10344,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10345,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10346,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10347,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
#adding neighborhood
London_venue_cat['Neighbourhood'] = venues_in_London['Neighbourhood'] 

# moving neighborhood column to the first column
fixed_columns = [London_venue_cat.columns[-1]] + list(London_venue_cat.columns[:-1])
London_venue_cat = London_venue_cat[fixed_columns]

London_venue_cat.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo Exhibit
0,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Venue categories mean value

Group the neighborhoods and calculate the mean venue categories value in each neighborhood.

In [34]:
London_grouped = London_venue_cat.groupby('Neighbourhood').mean().reset_index()
London_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo Exhibit
0,Barnet,0.0,0.0,0.0,0.001764,0.0,0.0,0.0,0.007055,0.0,...,0.001764,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Barnet, Brent, Camden",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Bexley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Bexley, Greenwich",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bexley, Greenwich",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [35]:
#make a function to get the top most common venue categories
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

There are way too many venue categories, we can take the top 10 to cluster the neighborhoods.

In [36]:
#Creating a function to label the columns of the venue correctly
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))


Top venue categories

Getting the top venue categories in London

In [37]:
# create a new dataframe for London
neighborhoods_venues_sorted_london = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted_london['Neighbourhood'] = London_grouped['Neighbourhood']

for ind in np.arange(London_grouped.shape[0]):
    neighborhoods_venues_sorted_london.iloc[ind, 1:] = return_most_common_venues(London_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted_london.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Barnet,Coffee Shop,Café,Grocery Store,Pub,Bus Stop,Italian Restaurant,Supermarket,Pharmacy,Turkish Restaurant,Gym / Fitness Center
1,"Barnet, Brent, Camden",Gym / Fitness Center,Hardware Store,Supermarket,Clothing Store,Zoo Exhibit,Filipino Restaurant,Event Space,Exhibit,Falafel Restaurant,Farmers Market
2,Bexley,Supermarket,Historic Site,Train Station,Platform,Coffee Shop,Park,Golf Course,Construction & Landscaping,Bus Stop,Fishing Store
3,"Bexley, Greenwich",Bus Stop,Sports Club,Home Service,Massage Studio,Golf Course,Historic Site,Park,Construction & Landscaping,Event Space,Food & Drink Shop
4,"Bexley, Greenwich",Supermarket,Coffee Shop,Platform,Train Station,Historic Site,Event Space,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant


#### Model Building: K Means

We'll cluster the city of london to roughly 7 clusters to make it easier to analyze and use the K Means clustering technique to do so.

In [38]:
#installing additional libraries 
from sklearn.cluster import KMeans
print("Kmeans imported")

Kmeans imported


In [39]:
# set number of clusters
k_num_clusters = 10

London_grouped_clustering = London_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans_london = KMeans(n_clusters=k_num_clusters, random_state=0).fit(London_grouped_clustering)
kmeans_london

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=10, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)

In [40]:
#Labelling Clustered Data
kmeans_london.labels_

array([1, 4, 5, 3, 5, 1, 1, 2, 1, 1, 8, 8, 1, 1, 8, 1, 6, 1, 1, 9, 8, 8,
       8, 8, 8, 7, 8, 8, 1, 8, 1, 1, 1, 1, 8, 8, 1, 1, 1, 0, 1, 1, 8, 1,
       8, 1, 1, 1, 1, 1], dtype=int32)

In [41]:
#model has labeled the city
neighborhoods_venues_sorted_london.insert(0, 'Cluster Labels', kmeans_london.labels_ +1)

In [42]:
#join London_merged with our neighbourhood venues sorted to add latitude & longitude for each of the neighborhood to prepare it for plotting
london_data = london_merged
london_data = london_data.join(neighborhoods_venues_sorted_london.set_index('Neighbourhood'), on='borough')
london_data.head()

Unnamed: 0,borough,town,post_code,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bexley, Greenwich",LONDON,SE2,51.49245,0.12127,6,Supermarket,Coffee Shop,Platform,Train Station,Historic Site,Event Space,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4",51.51324,-0.26746,7,Grocery Store,Train Station,Park,Breakfast Spot,Indian Restaurant,Film Studio,Event Space,Exhibit,Falafel Restaurant,Farmers Market
6,City,LONDON,EC3,51.512,-0.08058,2,Hotel,Coffee Shop,Gym / Fitness Center,Italian Restaurant,Pub,Sandwich Place,Restaurant,Wine Bar,French Restaurant,Salad Place
7,Westminster,LONDON,WC2,51.51651,-0.11968,2,Coffee Shop,Hotel,Pub,Café,Italian Restaurant,Sandwich Place,Hotel Bar,Theater,Clothing Store,Restaurant
9,Bromley,LONDON,SE20,51.41009,-0.05683,2,Supermarket,Hotel,Grocery Store,Convenience Store,Fast Food Restaurant,Park,Gym / Fitness Center,Historic Site,Golf Course,Gastropub


In [43]:
#drop all the NaN values to prevent data skew
london_data_nonan = london_data.dropna(subset=['Cluster Labels'])

#### Visualizing the clustered neighborhoods

Plot the clusters

In [44]:
import matplotlib.cm as cm
import matplotlib.colors as colors

In [45]:
map_clusters_london = folium.Map(location=[london_lat_coords, london_lng_coords], zoom_start=12)

# set color scheme for the clusters
x = np.arange(k_num_clusters)
ys = [i + x + (i*x)**2 for i in range(k_num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(london_data_nonan['latitude'], london_data_nonan['longitude'], london_data_nonan['borough'], london_data_nonan['Cluster Labels']):
    label = folium.Popup('Cluster ' + str(int(cluster) +1) + '\n' + str(poi) , parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)]
        ).add_to(map_clusters_london)
        
map_clusters_london

#### Examining our Clusters

**Cluster 1**

In [55]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 1, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
453,LONDON,1,Flower Shop,Pub,Park,Train Station,Gym / Fitness Center,Restaurant,Tennis Court,Wine Shop,Fish & Chips Shop,Film Studio


**Cluster 2**

In [56]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 2, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,LONDON,2,Hotel,Coffee Shop,Gym / Fitness Center,Italian Restaurant,Pub,Sandwich Place,Restaurant,Wine Bar,French Restaurant,Salad Place
7,LONDON,2,Coffee Shop,Hotel,Pub,Café,Italian Restaurant,Sandwich Place,Hotel Bar,Theater,Clothing Store,Restaurant
9,LONDON,2,Supermarket,Hotel,Grocery Store,Convenience Store,Fast Food Restaurant,Park,Gym / Fitness Center,Historic Site,Golf Course,Gastropub
10,LONDON,2,Coffee Shop,Pub,Café,Food Truck,Vietnamese Restaurant,Italian Restaurant,Park,Gym / Fitness Center,Cocktail Bar,Hotel
12,LONDON,2,Coffee Shop,Pub,Café,Food Truck,Vietnamese Restaurant,Italian Restaurant,Park,Gym / Fitness Center,Cocktail Bar,Hotel
...,...,...,...,...,...,...,...,...,...,...,...,...
518,LONDON,2,Pub,Coffee Shop,Bar,Clothing Store,Indian Restaurant,Sushi Restaurant,Café,Grocery Store,Pharmacy,Platform
519,LONDON,2,Italian Restaurant,Coffee Shop,Café,Pizza Place,Fast Food Restaurant,Supermarket,Grocery Store,Turkish Restaurant,Pub,Pharmacy
522,"LONDON, WOODFORD GREEN",2,Hotel,Monument / Landmark,Pub,Garden,Café,Plaza,Theater,Sandwich Place,Bakery,Restaurant
525,LONDON,2,Coffee Shop,Café,Grocery Store,Pub,Bus Stop,Italian Restaurant,Supermarket,Pharmacy,Turkish Restaurant,Gym / Fitness Center


**Cluster 3**

In [57]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 3, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
356,LONDON,3,Chinese Restaurant,Bus Stop,Convenience Store,Warehouse Store,Sandwich Place,Discount Store,Fast Food Restaurant,Pharmacy,Filipino Restaurant,Event Space


**Cluster 4**

In [58]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 4, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
167,"LONDON, WELLING",4,Bus Stop,Sports Club,Home Service,Massage Studio,Golf Course,Historic Site,Park,Construction & Landscaping,Event Space,Food & Drink Shop
457,"LONDON, ERITH",4,Bus Stop,Sports Club,Home Service,Massage Studio,Golf Course,Historic Site,Park,Construction & Landscaping,Event Space,Food & Drink Shop


**Cluster 5**

In [50]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 5, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
121,LONDON,5,Gym / Fitness Center,Hardware Store,Supermarket,Clothing Store,Zoo Exhibit,Filipino Restaurant,Event Space,Exhibit,Falafel Restaurant,Farmers Market


**Cluster 6**

In [59]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 6, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,LONDON,6,Supermarket,Coffee Shop,Platform,Train Station,Historic Site,Event Space,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant
45,"BEXLEYHEATH, LONDON",6,Supermarket,Historic Site,Train Station,Platform,Coffee Shop,Park,Golf Course,Construction & Landscaping,Bus Stop,Fishing Store
124,LONDON,6,Supermarket,Historic Site,Train Station,Platform,Coffee Shop,Park,Golf Course,Construction & Landscaping,Bus Stop,Fishing Store
291,"LONDON, SIDCUP",6,Supermarket,Historic Site,Train Station,Platform,Coffee Shop,Park,Golf Course,Construction & Landscaping,Bus Stop,Fishing Store
505,LONDON,6,Supermarket,Historic Site,Train Station,Platform,Coffee Shop,Park,Golf Course,Construction & Landscaping,Bus Stop,Fishing Store


**Cluster 7**

In [60]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 7, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,LONDON,7,Grocery Store,Train Station,Park,Breakfast Spot,Indian Restaurant,Film Studio,Event Space,Exhibit,Falafel Restaurant,Farmers Market


In [61]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 10, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
248,LONDON,10,Polish Restaurant,Platform,Grocery Store,Train Station,Fish & Chips Shop,Fried Chicken Joint,Café,Italian Restaurant,Pub,Fishing Store


#### Results 
By analyzing each cluster, we can see the most common venue category for each cluster. Cluster 2 includes in the most frequent Coffee shops, pubs and restaurant which makes it the most attractive area compared with cluster 6 where the most common venue is supermarket which from residents’ point of view would be of interest, but not as much for a tourist

#### Discussion 
The neighborhoods of London are multicultural. There are a lot of different cuisines including Indian, Italian, Turkish and Chinese found in the large variety of restaurants, bars, juice bars, coffee shops, fish and chips shops and breakfast spots. 
London has as well a lot of shopping options: flower shops, fish markets, fishing stores, clothing stores and supermarkets. 
The main modes of transport seem to be buses and trains. 
For leisure, the neighborhoods are set up to have lots of parks, golf courses, zoo, gyms and historic sites


#### Conclusion
Having such a variety available in all areas, it’s hard to pick one neighborhood based only on this aspect so for further analysis in order to choose the optimal neighborhood for visiting, other factors need to be taken in consideration which leaves our decision open for discussion. This is an opportunity for further processing including rating of accommodation, perhaps criminal rates in neighborhoods and of course proximity to public transport for accommodation.
For the moment, based on this project, I would personally pick a center location just out of conveniency to avoid spending too much money on transportation and enjoy walking and exploring all these areas by foot while keeping fit. 
