# Capstone Project - The Battle of Neighborhoods (Week 1)

## Introduction

Due to the recent political events happened in Hong Kong, our client, Mr.X, starts looking for information related to investment immegration in three cities.
These  cities are
* Montreal, Canada
* London, UK
* Tokyo, Japan

To prevent strong feeling of nostalgia, Mr.X would like to first understand which city is most similar to where he currently lives, so he can save time for deep research of one city.

The project will perform analysis by:
1. Define Mr.X's neighborhood
2. Define neighborhoods in the previously listed three cities
3. Perform clustering on all the neighborhoods, including Mr.X's neighborhood
4. Calculate the similarity of the cities and provide suggestion

## Data

We will use the following data for analysis
1. Area data of the two cities

By using Beautiful Soup library, we will extract the information from

 * https://en.wikipedia.org/wiki/London_boroughs - List of areas of London
 * https://en.wikipedia.org/wiki/Boroughs_of_Montreal - List of boroughs of Montreal
 * https://en.wikipedia.org/wiki/Special_wards_of_Tokyo - List of special wards of Tokyo

 
2. Foursquare
Using the Foorsquare API, we will extract the information related to the neighborhoods.
 * https://foursquare.com/city-guide

## Work Begin

### Step 1: Define Mr.X's neighborhood

In [1]:
!pip -q install folium
import folium #This is a map visualization library
print("Folium library installed")

Folium library installed


#### 1. Demonstrate the area where Mr.X lives

In [2]:
X_Lat = 22.29716
X_Lng = 114.17419
X_map = folium.Map(location=[X_Lat, X_Lng], zoom_start=16)
folium.CircleMarker(
        location = [X_Lat, X_Lng],
        radius = 10,
        popup = 'Mr.X\'s neighborhood',
        color = 'red',
        fill = True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(X_map)
X_map

#### 2. Use Foursquare API to find out the top 100 spots around the area

In [3]:
CLIENT_ID = 'VZ0HSTNIOI5XILLIDGV4EJAIT44AFNV0MLNYRMJFXAI0A0PW' # your Foursquare ID
CLIENT_SECRET = 'IP3GTNEJCA4MJYEJDLXCHC1TMDSURJDONDEDEMZJDCL2ULS5' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: VZ0HSTNIOI5XILLIDGV4EJAIT44AFNV0MLNYRMJFXAI0A0PW
CLIENT_SECRET:IP3GTNEJCA4MJYEJDLXCHC1TMDSURJDONDEDEMZJDCL2ULS5


In [4]:
radius = 1000 #The nearest 100,000 meters / 100 kilometers
limit = 50

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    X_Lat, 
    X_Lng, 
    radius, 
    limit)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=VZ0HSTNIOI5XILLIDGV4EJAIT44AFNV0MLNYRMJFXAI0A0PW&client_secret=IP3GTNEJCA4MJYEJDLXCHC1TMDSURJDONDEDEMZJDCL2ULS5&v=20180605&ll=22.29716,114.17419&radius=1000&limit=50'

In [5]:
import requests
X_results = requests.get(url).json()
X_results

{'meta': {'code': 200, 'requestId': '5d3870b84c7b08002342a012'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Tsim Sha Tsui',
  'headerFullLocation': 'Tsim Sha Tsui, Hong Kong',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 230,
  'suggestedBounds': {'ne': {'lat': 22.30616000900001,
    'lng': 114.1838991742224},
   'sw': {'lat': 22.288159990999993, 'lng': 114.16448082577759}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '54e8649b498e2016978cc814',
       'name': 'Ichiran (一蘭)',
       'location': {'address': 'Shop B, B/F, 8 Minden Ave',
        'lat': 22.29677912364417,
        'lng': 114.17389215285702,
        'labeledLatLngs': [{'label': 'display',
    

#### 3. Turn it into a pandas dataframe

In [6]:
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize
print("Libraries installed")

Libraries installed


In [7]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [8]:
X_venues = json_normalize(X_results['response']['groups'][0]['items'])
X_venues = X_venues.loc[:, ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']]
X_venues['venue.categories'] = X_venues.apply(get_category_type, axis=1)
X_venues.columns = ['name', 'categories', 'lat', 'lng']
X_venues['city'] = 'Hong Kong'
X_venues['borough'] = 'X'

print(X_venues.shape)
X_venues.head()

(50, 6)


Unnamed: 0,name,categories,lat,lng,city,borough
0,Ichiran (一蘭),Ramen Restaurant,22.296779,114.173892,Hong Kong,X
1,Hyatt Regency Hong Kong Tsim Sha Tsui (香港尖沙咀凱悅酒店),Hotel,22.297452,114.173917,Hong Kong,X
2,Via Tokyo,Ice Cream Shop,22.299232,114.174669,Hong Kong,X
3,The Peninsula Hong Kong (香港半島酒店),Hotel,22.295102,114.171854,Hong Kong,X
4,Kowloon Shangri-La (九龍香格里拉大酒店),Hotel,22.297371,114.176921,Hong Kong,X


# Step 2: Extract information of two cities neighborhoods

In [9]:
from bs4 import BeautifulSoup

In [10]:
LD_url = 'https://en.wikipedia.org/wiki/London_boroughs'
LD_page = requests.get(LD_url).text
LD_soup = BeautifulSoup(LD_page, 'lxml')

In [11]:
LD_table = LD_soup.find('table', class_= 'wikitable')
#Extract the rows
rows = LD_table.find_all('tr')
print("Total numbers of rows: ", len(rows))

#Extract the columns
columns = [v.text for v in rows[0].find_all('th')]
print("Original Columns: ", columns)

#Delete the '\xa0' and '\n' symbols in columns
columns = [column.replace('\xa0','') for column in columns]
columns = [column.replace('\n','') for column in columns]
print("Modified Columns: ", columns)

#Remove the last column
columns = columns[0:2]
print("Modified Columns: ", columns)

Total numbers of rows:  33
Original Columns:  ['London borough\n', 'Designation\n', 'Former areas\n']
Modified Columns:  ['London borough', 'Designation', 'Former areas']
Modified Columns:  ['London borough', 'Designation']


In [12]:
LD_df = pd.DataFrame(columns = columns)
row = [v.text for v in rows[1].find_all('td')]
print ("Original Row: ", row)

row = [v.text.replace('\n', '') for v in rows[1].find_all('td')]
print ("Modified Row: ", row, '\n')

#Now, insert all row information into the dataframe
for i in range(1, len(rows)): #Skip the first row becasue it's already in the column name
    row_i = [v.text.replace('\n', '') for v in rows[i].find_all('td')]
    row_i = row_i[0:2]
    #A list is generated, 
    LD_df = LD_df.append(pd.Series(row_i, index = columns), ignore_index = True)
    
# Add column for more information later
LD_df['latitude'] = np.nan
LD_df['longitude'] = np.nan
LD_df.head()

Original Row:  ['Greenwich\n', 'Inner\n', 'Greenwich (22a)', 'Woolwich (part) (22b)', '', '', '\n']
Modified Row:  ['Greenwich', 'Inner', 'Greenwich (22a)', 'Woolwich (part) (22b)', '', '', ''] 



Unnamed: 0,London borough,Designation,latitude,longitude
0,Greenwich,Inner,,
1,Hackney,Inner,,
2,Hammersmith[notes 2],Inner,,
3,Islington,Inner,,
4,Kensington and Chelsea,Inner,,


In [13]:
LD_df = LD_df[['London borough','latitude','longitude']]

#Clean the [notes] mark in some of the rows
LD_df['London borough'][2] = 'Hammersmith'
LD_df['London borough'][11] = 'Barking'

LD_df.head(15)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,London borough,latitude,longitude
0,Greenwich,,
1,Hackney,,
2,Hammersmith,,
3,Islington,,
4,Kensington and Chelsea,,
5,Lambeth,,
6,Lewisham,,
7,Southwark,,
8,Tower Hamlets,,
9,Wandsworth,,


Now, we retrieve the latitude and longitude of all the neighborhoods by using geopy library.

This process costs a lot of time, however it is difficult for me to find tables with areas and coordinates in one table, or I have to finish the table manually

I may install a progress bar plugin in order to visualize the progress in the future

In [14]:
from  geopy.geocoders import Nominatim
print("Total rows:", LD_df.shape)
geolocator = Nominatim()
country ="UK"

for index, row in LD_df.iterrows():
    borough = row['London borough']
    print(index, borough) #This is to prevent overtime operation in Juypter Notebook, which can be deleted
    loc = geolocator.geocode(borough+','+ country)
    if (loc != None): #Some area's coordinates can not be found, we will leave it NaN
        LD_df['latitude'][index] = loc.latitude
        LD_df['longitude'][index] = loc.longitude

LD_df.head()

Total rows: (32, 3)
0 Greenwich


  app.launch_new_instance()


1 Hackney


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


2 Hammersmith
3 Islington
4 Kensington and Chelsea
5 Lambeth
6 Lewisham
7 Southwark
8 Tower Hamlets
9 Wandsworth
10 Westminster
11 Barking
12 Barnet
13 Bexley
14 Brent
15 Bromley
16 Croydon
17 Ealing
18 Enfield
19 Haringey
20 Harrow
21 Havering
22 Hillingdon
23 Hounslow
24 Kingston upon Thames
25 Merton
26 Newham
27 Redbridge
28 Richmond upon Thames
29 Sutton
30 Camden
31 Waltham Forest


Unnamed: 0,London borough,latitude,longitude
0,Greenwich,51.482084,-0.004542
1,Hackney,51.54324,-0.049362
2,Hammersmith,51.492038,-0.22364
3,Islington,51.538429,-0.099905
4,Kensington and Chelsea,51.498995,-0.199123


In [15]:
#Create a London Map, showing all the information
LD_loc = geolocator.geocode('London,UK')
LD_Lat = LD_loc.latitude
LD_Lng = LD_loc.longitude
LD_map = folium.Map(location=[LD_Lat, LD_Lng], zoom_start=10)
for index, row in LD_df.iterrows():
    folium.CircleMarker(
            location = [row.latitude, row.longitude],
            radius = 10,
            popup = row['London borough'],
            color = 'red',
            fill = True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(LD_map)
LD_map

Now we got the latitude and longitude information, we will retrieve the spots information of london

In [16]:
LD_fulllist = pd.DataFrame(columns = ['borough', 'name', 'categories', 'lat', 'lng'])

for index, row in LD_df.iterrows():
    print(index, row['London borough'])
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        row.latitude, 
        row.longitude, 
        radius, 
        limit)
    LD_results = requests.get(url).json()
    LD_venues = json_normalize(LD_results['response']['groups'][0]['items'])
    LD_venues = LD_venues.loc[:, ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']]
    LD_venues['venue.categories'] = LD_venues.apply(get_category_type, axis=1)
    LD_venues.columns = ['name', 'categories', 'lat', 'lng']
    LD_venues['borough'] = row['London borough']
    LD_fulllist = LD_fulllist.append(LD_venues, ignore_index = True)

print(LD_fulllist.shape)

0 Greenwich


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


1 Hackney
2 Hammersmith
3 Islington
4 Kensington and Chelsea
5 Lambeth
6 Lewisham
7 Southwark
8 Tower Hamlets
9 Wandsworth
10 Westminster
11 Barking
12 Barnet
13 Bexley
14 Brent
15 Bromley
16 Croydon
17 Ealing
18 Enfield
19 Haringey
20 Harrow
21 Havering
22 Hillingdon
23 Hounslow
24 Kingston upon Thames
25 Merton
26 Newham
27 Redbridge
28 Richmond upon Thames
29 Sutton
30 Camden
31 Waltham Forest
(1180, 5)


In [17]:
LD_fulllist['city'] = 'London'
print(LD_fulllist.head())
print(LD_fulllist.tail())

     borough      categories        lat       lng  \
0  Greenwich   Historic Site  51.483234 -0.005579   
1  Greenwich          Museum  51.482889 -0.006420   
2  Greenwich          Garden  51.483007 -0.008362   
3  Greenwich  History Museum  51.481329 -0.005581   
4  Greenwich          Market  51.481624 -0.009092   

                              name    city  
0          Old Royal Naval College  London  
1                     Painted Hall  London  
2  Greenwich Naval College Gardens  London  
3         National Maritime Museum  London  
4                 Greenwich Market  London  
             borough            categories        lat       lng  \
1175  Waltham Forest         Grocery Store  51.561975 -0.010584   
1176  Waltham Forest              Pharmacy  51.562078 -0.010214   
1177  Waltham Forest  Gym / Fitness Center  51.559761 -0.014014   
1178  Waltham Forest         Grocery Store  51.561866 -0.015175   
1179  Waltham Forest          Dessert Shop  51.553808  0.005148   

        

In [18]:
MTL_url = 'https://en.wikipedia.org/wiki/Boroughs_of_Montreal'
MTL_page = requests.get(MTL_url).text
soup = BeautifulSoup(MTL_page, 'lxml')

MTL_table = soup.find('table', class_= 'wikitable')

#Extract the rows
rows = MTL_table.find_all('tr')
print("Total numbers of rows: ", len(rows))

#Extract the columns
columns = [v.text for v in rows[0].find_all('th')]
print("Original Columns: ", columns)

#Delete the '\xa0' and '\n' symbols in columns
columns = [column.replace('\xa0','') for column in columns]
columns = [column.replace('\n','') for column in columns]
print("Modified Columns: ", columns)

#Remove the last column
columns = columns[0:2]
print("Modified Columns: ", columns)

MTL_df = pd.DataFrame(columns = columns)
row = [v.text for v in rows[1].find_all('td')]
print ("Original Row: ", row)

row = [v.text.replace('\n', '') for v in rows[1].find_all('td')]
print ("Modified Row: ", row, '\n')

#Now, insert all row information into the dataframe
for i in range(1, len(rows)): #Skip the first row becasue it's already in the column name
    row_i = [v.text.replace('\n', '') for v in rows[i].find_all('td')]
    row_i = row_i[0:2]
    #A list is generated, 
    MTL_df = MTL_df.append(pd.Series(row_i, index = columns), ignore_index = True)
    
# Add column for more information later
MTL_df['latitude'] = np.nan
MTL_df['longitude'] = np.nan
MTL_df = MTL_df[['Borough','latitude','longitude']]
MTL_df.head()

Total numbers of rows:  20
Original Columns:  ['Number(map)', 'Borough', 'Population Canada 2016 Census[1]', 'Area in km²', 'Density per km²\n']
Modified Columns:  ['Number(map)', 'Borough', 'Population Canada 2016 Census[1]', 'Area in km²', 'Density per km²']
Modified Columns:  ['Number(map)', 'Borough']
Original Row:  ['1.', 'Ahuntsic-Cartierville', '134,245', '24.2', '5,547.3\n']
Modified Row:  ['1.', 'Ahuntsic-Cartierville', '134,245', '24.2', '5,547.3'] 



Unnamed: 0,Borough,latitude,longitude
0,Ahuntsic-Cartierville,,
1,Anjou,,
2,Côte-des-Neiges–Notre-Dame-de-Grâce,,
3,Lachine,,
4,LaSalle,,


In [19]:
print("Total rows:", MTL_df.shape)
geolocator = Nominatim()
country ="Canada"

for index, row in MTL_df.iterrows():
    borough = row['Borough']
    print(index, borough) #This is to prevent overtime operation in Juypter Notebook, which can be deleted
    loc = geolocator.geocode(borough+','+ country)
    if (loc != None): #Some area's coordinates can not be found, we will leave it NaN
        MTL_df['latitude'][index] = loc.latitude
        MTL_df['longitude'][index] = loc.longitude

MTL_df.head()

Total rows: (19, 3)
0 Ahuntsic-Cartierville


  from ipykernel import kernelapp as app


1 Anjou


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


2 Côte-des-Neiges–Notre-Dame-de-Grâce
3 Lachine
4 LaSalle
5 Le Plateau-Mont-Royal
6 Le Sud-Ouest
7 L'Île-Bizard–Sainte-Geneviève
8 Mercier–Hochelaga-Maisonneuve
9 Montréal-Nord
10 Outremont
11 Pierrefonds-Roxboro
12 Rivière-des-Prairies–Pointe-aux-Trembles
13 Rosemont–La Petite-Patrie
14 Saint-Laurent
15 Saint-Léonard
16 Verdun
17 Ville-Marie
18 Villeray–Saint-Michel–Parc-Extension


Unnamed: 0,Borough,latitude,longitude
0,Ahuntsic-Cartierville,45.541892,-73.680319
1,Anjou,45.618279,-73.596173
2,Côte-des-Neiges–Notre-Dame-de-Grâce,45.483575,-73.627053
3,Lachine,45.448697,-73.711054
4,LaSalle,45.432514,-73.629267


In [20]:
#Create a Montreal Map, showing all the information
MTL_loc = geolocator.geocode('Montreal,Canada')
MTL_Lat = MTL_loc.latitude
MTL_Lng = MTL_loc.longitude
MTL_map = folium.Map(location=[MTL_Lat, MTL_Lng], zoom_start=10)
for index, row in MTL_df.iterrows():
    folium.CircleMarker(
            location = [row.latitude, row.longitude],
            radius = 10,
            popup = row['Borough'],
            color = 'red',
            fill = True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(MTL_map)
MTL_map

In [21]:
MTL_fulllist = pd.DataFrame(columns = ['borough', 'name', 'categories', 'lat', 'lng'])

for index, row in MTL_df.iterrows():
    print(index, row['Borough'])
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        row.latitude, 
        row.longitude, 
        radius, 
        limit)
    MTL_results = requests.get(url).json()
    MTL_venues = json_normalize(MTL_results['response']['groups'][0]['items'])
    MTL_venues = MTL_venues.loc[:, ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']]
    MTL_venues['venue.categories'] = MTL_venues.apply(get_category_type, axis=1)
    MTL_venues.columns = ['name', 'categories', 'lat', 'lng']
    MTL_venues['borough'] = row['Borough']
    MTL_fulllist = MTL_fulllist.append(MTL_venues, ignore_index = True)

print(MTL_fulllist.shape)

0 Ahuntsic-Cartierville
1 Anjou
2 Côte-des-Neiges–Notre-Dame-de-Grâce
3 Lachine
4 LaSalle
5 Le Plateau-Mont-Royal
6 Le Sud-Ouest
7 L'Île-Bizard–Sainte-Geneviève
8 Mercier–Hochelaga-Maisonneuve
9 Montréal-Nord
10 Outremont
11 Pierrefonds-Roxboro
12 Rivière-des-Prairies–Pointe-aux-Trembles
13 Rosemont–La Petite-Patrie
14 Saint-Laurent
15 Saint-Léonard
16 Verdun
17 Ville-Marie
18 Villeray–Saint-Michel–Parc-Extension
(481, 5)


In [22]:
MTL_fulllist['city'] = 'Montreal'
print(MTL_fulllist.head())
print(MTL_fulllist.tail())

                 borough          categories        lat        lng  \
0  Ahuntsic-Cartierville                Park  45.540585 -73.685730   
1  Ahuntsic-Cartierville        Liquor Store  45.544110 -73.674498   
2  Ahuntsic-Cartierville  Italian Restaurant  45.540799 -73.685707   
3  Ahuntsic-Cartierville                Café  45.543601 -73.667883   
4  Ahuntsic-Cartierville      Breakfast Spot  45.544712 -73.674450   

                   name      city  
0  Parc Marcelin-Wilson  Montreal  
1         SAQ Sélection  Montreal  
2      Sapori Di Napoli  Montreal  
3            Le Brûloir  Montreal  
4   L'Oeuforie Matinale  Montreal  
                                  borough             categories        lat  \
476  Villeray–Saint-Michel–Parc-Extension          Grocery Store  45.538238   
477  Villeray–Saint-Michel–Parc-Extension                   Café  45.540551   
478  Villeray–Saint-Michel–Parc-Extension  Vietnamese Restaurant  45.538129   
479  Villeray–Saint-Michel–Parc-Extension      

In [23]:
TKO_url = 'https://en.wikipedia.org/wiki/Special_wards_of_Tokyo'
TKO_page = requests.get(TKO_url).text
soup = BeautifulSoup(TKO_page, 'lxml')
TKO_table = soup.findAll('table', class_= 'wikitable')
TKO_table = TKO_table[1]

#Extract the rows
rows = TKO_table.find_all('tr')
print("Total numbers of rows: ", len(rows))

#Extract the columns
columns = [v.text for v in rows[0].find_all('th')]
print("Original Columns: ", columns)

#Delete the '\xa0' and '\n' symbols in columns
columns = [column.replace('\xa0','') for column in columns]
columns = [column.replace('\n','') for column in columns]
print("Modified Columns: ", columns)

#Remove the last column
columns = columns[0:3]
print("Modified Columns: ", columns)

TKO_df = pd.DataFrame(columns = columns)
row = [v.text for v in rows[1].find_all('td')]
print ("Original Row: ", row)

row = [v.text.replace('\n', '') for v in rows[1].find_all('td')]
print ("Modified Row: ", row, '\n')

#Now, insert all row information into the dataframe
for i in range(1, len(rows)): #Skip the first row becasue it's already in the column name
    row_i = [v.text.replace('\n', '') for v in rows[i].find_all('td')]
    row_i = row_i[0:3]
    #A list is generated, 
    TKO_df = TKO_df.append(pd.Series(row_i, index = columns), ignore_index = True)
    
# Add column for more information later
TKO_df['latitude'] = np.nan
TKO_df['longitude'] = np.nan
TKO_df = TKO_df[['Name','latitude','longitude']]

Total numbers of rows:  25
Original Columns:  ['No.\n', 'Flag\n', 'Name\n', 'Kanji\n', 'Population(as of October\xa02016[update])\n', 'Density(/km2)\n', 'Area(km2)\n', 'Major districts\n']
Modified Columns:  ['No.', 'Flag', 'Name', 'Kanji', 'Population(as of October2016[update])', 'Density(/km2)', 'Area(km2)', 'Major districts']
Modified Columns:  ['No.', 'Flag', 'Name']
Original Row:  ['01', '', 'Chiyoda', '千代田区\n', '0059,441', '05,100', '011.66\n', 'Nagatachō, Kasumigaseki, Ōtemachi, Marunouchi, Akihabara, Yūrakuchō, Iidabashi, Kanda\n']
Modified Row:  ['01', '', 'Chiyoda', '千代田区', '0059,441', '05,100', '011.66', 'Nagatachō, Kasumigaseki, Ōtemachi, Marunouchi, Akihabara, Yūrakuchō, Iidabashi, Kanda'] 



In [24]:
# DataFrame Cleaning
TKO_df['Name'][10] = 'Ōta'
TKO_df = TKO_df.drop(23)
TKO_df

Unnamed: 0,Name,latitude,longitude
0,Chiyoda,,
1,Chūō,,
2,Minato,,
3,Shinjuku,,
4,Bunkyō,,
5,Taitō,,
6,Sumida,,
7,Kōtō,,
8,Shinagawa,,
9,Meguro,,


In [26]:
print("Total rows:", TKO_df.shape)
geolocator = Nominatim()
country ="Japan"

for index, row in TKO_df.iterrows():
    borough = row['Name']
    print(index, borough) #This is to prevent overtime operation in Juypter Notebook, which can be deleted
    loc = geolocator.geocode(borough+','+ country)
    if (loc != None): #Some area's coordinates can not be found, we will leave it NaN
        TKO_df['latitude'][index] = loc.latitude
        TKO_df['longitude'][index] = loc.longitude

TKO_df.head()

Total rows: (23, 3)
0 Chiyoda


  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


1 Chūō
2 Minato
3 Shinjuku
4 Bunkyō
5 Taitō
6 Sumida
7 Kōtō
8 Shinagawa
9 Meguro
10 Ōta
11 Setagaya
12 Shibuya
13 Nakano
14 Suginami
15 Toshima
16 Kita
17 Arakawa
18 Itabashi
19 Nerima
20 Adachi
21 Katsushika
22 Edogawa


Unnamed: 0,Name,latitude,longitude
0,Chiyoda,35.69381,139.753216
1,Chūō,35.666255,139.775565
2,Minato,35.643227,139.740055
3,Shinjuku,35.693763,139.703632
4,Bunkyō,35.71881,139.744732


In [27]:
#Create a Tokyo Map, showing all the information
TKO_loc = geolocator.geocode('Tokyo,Japan')
TKO_Lat = TKO_loc.latitude
TKO_Lng = TKO_loc.longitude
TKO_map = folium.Map(location=[TKO_Lat, TKO_Lng], zoom_start=10)
for index, row in TKO_df.iterrows():
    folium.CircleMarker(
            location = [row.latitude, row.longitude],
            radius = 10,
            popup = row['Name'],
            color = 'red',
            fill = True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(TKO_map)
TKO_map

In [28]:
TKO_fulllist = pd.DataFrame(columns = ['borough', 'name', 'categories', 'lat', 'lng'])

for index, row in TKO_df.iterrows():
    print(index, row['Name'])
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        row.latitude, 
        row.longitude, 
        radius, 
        limit)
    TKO_results = requests.get(url).json()
    TKO_venues = json_normalize(TKO_results['response']['groups'][0]['items'])
    TKO_venues = TKO_venues.loc[:, ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']]
    TKO_venues['venue.categories'] = TKO_venues.apply(get_category_type, axis=1)
    TKO_venues.columns = ['name', 'categories', 'lat', 'lng']
    TKO_venues['borough'] = row['Name']
    TKO_fulllist = TKO_fulllist.append(TKO_venues, ignore_index = True)

print(TKO_fulllist.shape)

0 Chiyoda
1 Chūō
2 Minato
3 Shinjuku
4 Bunkyō
5 Taitō
6 Sumida
7 Kōtō
8 Shinagawa
9 Meguro
10 Ōta
11 Setagaya
12 Shibuya
13 Nakano
14 Suginami
15 Toshima
16 Kita
17 Arakawa
18 Itabashi
19 Nerima
20 Adachi
21 Katsushika
22 Edogawa
(1150, 5)


In [29]:
TKO_fulllist['city'] = 'Tokyo'
print(TKO_fulllist.head())
print(TKO_fulllist.tail())

   borough                 categories        lat         lng  \
0  Chiyoda                    Stadium  35.693356  139.749865   
1  Chiyoda                       Park  35.691653  139.751201   
2  Chiyoda  Japanese Curry Restaurant  35.695544  139.757356   
3  Chiyoda         Tempura Restaurant  35.695765  139.754682   
4  Chiyoda                 Art Museum  35.690541  139.754694   

                                        name   city  
0                     Nippon Budokan (日本武道館)  Tokyo  
1                    Kitanomaru Park (北の丸公園)  Tokyo  
2                         Bondy (欧風カレー ボンディ)  Tokyo  
3                     Kanda Tendonya (神田天丼家)  Tokyo  
4  National Museum of Modern Art (東京国立近代美術館)  Tokyo  
      borough          categories        lat         lng  \
1145  Edogawa   Convenience Store  35.685041  139.864712   
1146  Edogawa       Grocery Store  35.675274  139.871389   
1147  Edogawa        Noodle House  35.675267  139.871563   
1148  Edogawa  Donburi Restaurant  35.683460  139.8

### Step 3. Connect all the information extracted

In [30]:
All_venues = pd.DataFrame(columns = ['city', 'borough','categories','lat','lng','name'])
All_venues = All_venues.append(LD_fulllist, ignore_index = True)
All_venues = All_venues.append(MTL_fulllist, ignore_index = True)
All_venues = All_venues.append(TKO_fulllist, ignore_index = True)
All_venues = All_venues.append(X_venues, ignore_index = True)
All_venues = All_venues[['city', 'borough','categories','lat','lng','name']]

print(All_venues.shape)
All_venues.head()

(2861, 6)


Unnamed: 0,city,borough,categories,lat,lng,name
0,London,Greenwich,Historic Site,51.483234,-0.005579,Old Royal Naval College
1,London,Greenwich,Museum,51.482889,-0.00642,Painted Hall
2,London,Greenwich,Garden,51.483007,-0.008362,Greenwich Naval College Gardens
3,London,Greenwich,History Museum,51.481329,-0.005581,National Maritime Museum
4,London,Greenwich,Market,51.481624,-0.009092,Greenwich Market


In [31]:
# one hot encoding
venues_onehot = pd.get_dummies(All_venues[['categories']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
venues_onehot['borough'] = All_venues['borough'] 

# move neighborhood column to the first column
fixed_columns = [venues_onehot.columns[-1]] + list(venues_onehot.columns[:-1])
venues_onehot = venues_onehot[fixed_columns]

print(venues_onehot.shape)
venues_onehot.head()


(2861, 311)


Unnamed: 0,borough,Accessories Store,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,...,Vietnamese Restaurant,Wagashi Place,Warehouse Store,Waterfront,Wine Bar,Wine Shop,Women's Store,Yakitori Restaurant,Yoga Studio,Yoshoku Restaurant
0,Greenwich,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Greenwich,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Greenwich,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Greenwich,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Greenwich,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
venues_grouped = venues_onehot.groupby('borough').mean().reset_index()
venues_grouped.head()

Unnamed: 0,borough,Accessories Store,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,...,Vietnamese Restaurant,Wagashi Place,Warehouse Store,Waterfront,Wine Bar,Wine Shop,Women's Store,Yakitori Restaurant,Yoga Studio,Yoshoku Restaurant
0,Adachi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Ahuntsic-Cartierville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Anjou,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Arakawa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Barking,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [33]:
num_top_venues = 5

for hood in venues_grouped['borough']:
    print("----"+hood+"----")
    temp = venues_grouped[venues_grouped['borough'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adachi----
                  venue  freq
0     Convenience Store  0.32
1           Supermarket  0.06
2            Restaurant  0.06
3  Fast Food Restaurant  0.06
4      Ramen Restaurant  0.06


----Ahuntsic-Cartierville----
            venue  freq
0            Café  0.15
1  Sandwich Place  0.08
2    Liquor Store  0.08
3     Pizza Place  0.08
4   Train Station  0.08


----Anjou----
                venue  freq
0         Coffee Shop  0.11
1          Restaurant  0.11
2  Italian Restaurant  0.11
3    Sushi Restaurant  0.05
4         Auto Garage  0.05


----Arakawa----
                  venue  freq
0     Convenience Store  0.38
1    Italian Restaurant  0.06
2                  Café  0.04
3  Fast Food Restaurant  0.04
4         Grocery Store  0.04


----Barking----
           venue  freq
0          Hotel  0.12
1  Grocery Store  0.12
2    Supermarket  0.09
3           Park  0.09
4    Gas Station  0.06


----Barnet----
           venue  freq
0            Pub  0.16
1    Coffee Shop  0.11
2    

In [34]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [51]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['borough']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['borough'] = venues_grouped['borough']

for ind in np.arange(venues_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(venues_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Adachi,Convenience Store,Restaurant,Ramen Restaurant,Fast Food Restaurant,Café,Supermarket,BBQ Joint,Donburi Restaurant,Steakhouse,Discount Store
1,Ahuntsic-Cartierville,Café,Italian Restaurant,Park,Chinese Restaurant,Pizza Place,Train Station,Hockey Arena,Liquor Store,Middle Eastern Restaurant,Breakfast Spot
2,Anjou,Italian Restaurant,Coffee Shop,Restaurant,Furniture / Home Store,Bowling Alley,Pet Store,Thai Restaurant,Pizza Place,Liquor Store,Paper / Office Supplies Store
3,Arakawa,Convenience Store,Italian Restaurant,Sake Bar,Grocery Store,Café,Park,Fast Food Restaurant,Japanese Restaurant,Burger Joint,Chinese Restaurant
4,Barking,Hotel,Grocery Store,Supermarket,Park,Coffee Shop,Gas Station,Discount Store,Business Service,Breakfast Spot,Fast Food Restaurant


### Step 4. Start Clustering

In [52]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

venues_grouped_clustering = venues_grouped.drop('borough', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(venues_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [53]:
Final = neighborhoods_venues_sorted[['borough','Cluster Labels']]
Final.head()

Unnamed: 0,borough,Cluster Labels
0,Adachi,1
1,Ahuntsic-Cartierville,4
2,Anjou,4
3,Arakawa,1
4,Barking,2


In [54]:
Final['city'] = np.nan

#Insert the city name back to the sorted dataframe
for index, row in Final.iterrows():
    for index_LD, row_LD in LD_df.iterrows():
        if row['borough'] == row_LD['London borough']:
            Final['city'][index] = 'London'
    for index_MTL, row_MTL in MTL_df.iterrows():
        if row['borough'] == row_MTL['Borough']:
            Final['city'][index] = 'Montreal'    
    for index_TKO, row_TKO in TKO_df.iterrows():
        if row['borough'] == row_TKO['Name']:
            Final['city'][index] = 'Tokyo'
    if Final['city'][index] != 'London'\
    and Final['city'][index] != 'Montreal' \
    and Final['city'][index] != 'Tokyo':
        Final['city'][index] = 'Mr.X'

Final

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.loc[key] = value
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the

Unnamed: 0,borough,Cluster Labels,city
0,Adachi,1,Tokyo
1,Ahuntsic-Cartierville,4,Montreal
2,Anjou,4,Montreal
3,Arakawa,1,Tokyo
4,Barking,2,London
5,Barnet,2,London
6,Bexley,2,London
7,Brent,4,London
8,Bromley,2,London
9,Bunkyō,1,Tokyo


In [55]:
Final_grouped = Final.groupby(['city', 'Cluster Labels']).count()
Final_grouped

Unnamed: 0_level_0,Unnamed: 1_level_0,borough
city,Cluster Labels,Unnamed: 2_level_1
London,2,28
London,4,4
Montreal,2,1
Montreal,3,1
Montreal,4,17
Mr.X,2,1
Tokyo,0,15
Tokyo,1,8


### Final Result: Mr.X's neighborhood is grouped in cluster 2, which is closest to London, UK.

# Reflection:

First, lets see the number of data retrieved from each city

In [49]:
print("Number of venues in Mr.X's neighborhood:", X_venues.shape[0], '\n')
print("Number of boroughs in London:", LD_df.shape[0])
print("Number of venues in London:", LD_fulllist.shape[0])
print("Average number of venues in each borough in London: ", LD_fulllist.shape[0]/LD_df.shape[0],'\n')
print("Number of boroughs in Montreal:", MTL_df.shape[0])
print("Number of venues in Montreal:", MTL_fulllist.shape[0])
print("Average number of venues in each borough in Montreal: ", MTL_fulllist.shape[0]/MTL_df.shape[0], '\n')
print("Number of boroughs in Tokyo:", TKO_df.shape[0])
print("Number of venues in Tokyo:", TKO_fulllist.shape[0])
print("Average number of venues in each borough in Tokyo: ", TKO_fulllist.shape[0]/TKO_df.shape[0])

Number of venues in Mr.X's neighborhood: 50 

Number of boroughs in London: 32
Number of venues in London: 1180
Average number of venues in each borough in London:  36.875 

Number of boroughs in Montreal: 19
Number of venues in Montreal: 481
Average number of venues in each borough in Montreal:  25.31578947368421 

Number of boroughs in Tokyo: 23
Number of venues in Tokyo: 1150
Average number of venues in each borough in Tokyo:  50.0


The number of venues collected varies for each city and each neighborhood, and here are some possible reasons:
* Bad choice of city because of the difference in borough size, scale of economic activity and popularity
    * The choice of city was based on this website (https://www.leeabbamonte.com/travel-blog/30-best-cities-in-the-world.html), which Toronto, Canada is changed into Montreal, Canada for experimental purpose.
* Bad Selection of radius (100KM) and number of venues (50 per borough)
* Lack of community contribution in some of cities / boroughs to the foursquare platform
* The labels of venues are extremely detailed, using ranked venues dataframe as example

In [56]:
neighborhoods_venues_sorted.head()

Unnamed: 0,Cluster Labels,borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1,Adachi,Convenience Store,Restaurant,Ramen Restaurant,Fast Food Restaurant,Café,Supermarket,BBQ Joint,Donburi Restaurant,Steakhouse,Discount Store
1,4,Ahuntsic-Cartierville,Café,Italian Restaurant,Park,Chinese Restaurant,Pizza Place,Train Station,Hockey Arena,Liquor Store,Middle Eastern Restaurant,Breakfast Spot
2,4,Anjou,Italian Restaurant,Coffee Shop,Restaurant,Furniture / Home Store,Bowling Alley,Pet Store,Thai Restaurant,Pizza Place,Liquor Store,Paper / Office Supplies Store
3,1,Arakawa,Convenience Store,Italian Restaurant,Sake Bar,Grocery Store,Café,Park,Fast Food Restaurant,Japanese Restaurant,Burger Joint,Chinese Restaurant
4,2,Barking,Hotel,Grocery Store,Supermarket,Park,Coffee Shop,Gas Station,Discount Store,Business Service,Breakfast Spot,Fast Food Restaurant


Lets use row 0 (Adachi, Tokyo) as an example, the 2nd most common venue is "Restaurant", which its sub-categories is also included and counted as a new category.
* 3. Ramen Restaurant
* 4. Fast Food Restaurant
* 8. Donburi Restaurant

This may negatively affect the accuracy of clustering, which may need further cleaning of data.