# Capstone Project - The Battle of the Neighborhoods (Week 4 & 5)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Similarity of Neighborhoods](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Similarity of Neighborhoods <a name="introduction"></a>

In this project, we will try to make comparison on neighborhoods of several major financial capitals. Inspired by the question asked in the Week 4 description, namely the similarity or dissimilarity of cities, we will try to group the neighborhoods and boroughs over different cities, as a preliminary attempt to identify the functionality and types of them. My hope it that such categorization will shed light on the design of cities of similar type in the future. 

## Data <a name="data"></a>

Aside from a couple of tables, which we will obtain by scraping some webpages, we will also make use of **geopy** for the geospatial location of neighboorhoods. We will also access information of venues using **Foursquare API**.

First, let's load up the libraries. lxml is used for parsing the html content involved in table wrangling. geopy is used to obtain geospatial location from address. Also, we will use K Means clustering.

In [1]:
import numpy as np
import pandas as pd
import requests
import lxml.html as lh
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

# All requested packages already installed.

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         713 KB

The following NEW packages will be INSTALLED:

    altair:  4.1.0-py_1 conda-forge
    branca:  0.4.1-py_0 conda-forge
    folium:  0.5.0-py_0 conda-forge
    vincent: 0.4.4-py_1 conda-forge


Down

In the wikipedia page below, I found the table for the 20 districts of Paris. Let's wrangle it. (Note: 20 is a somewhat smaller number of districts as compared to New York data, although comparable to Toronto. However, with the attempt I made, I could not find any more detailed infos on the smaller administritive zones. Please let me know in case you know where to get it. As a preliminary attempt, we are good with 20 districts.)

In [2]:
url = 'https://en.wikipedia.org/wiki/Arrondissements_of_Paris'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')

Through trial and error, we get to know that the 15-34 entries are the "Arrondissements"(French word for county or district) we need. And we need only the names for future location checking. We will make a list of these names.

In [3]:
paris_dist = []
for x in range(15,35):
    paris_dist.append(tr_elements[x].text_content().split('\n')[2])
print(paris_dist)

['Louvre', 'Bourse', 'Temple', 'Hôtel-de-Ville', 'Panthéon', 'Luxembourg', 'Palais-Bourbon', 'Élysée', 'Opéra', 'Entrepôt', 'Popincourt', 'Reuilly', 'Gobelins', 'Observatoire', 'Vaugirard', 'Passy', 'Batignolles-Monceau', 'Butte-Montmartre', 'Buttes-Chaumont', 'Ménilmontant']


Now we will obtain the geospatial location from the district names using geopy. 

In [4]:
suffix = ', Paris, France'
lats = []
longs = []

geolocator = Nominatim(user_agent="ny_explorer")
for i in range(len(paris_dist)):
    location = geolocator.geocode(paris_dist[i] + suffix)
    latitude = location.latitude
    longitude = location.longitude
    lats.append(latitude)
    longs.append(longitude)
print(lats,longs)

[48.8611473, 48.8686296, 48.8665004, 48.856426299999995, 48.84619085, 48.8504333, 48.86159615, 48.8466437, 48.8706446, 48.876106, 48.858416, 48.8396154, 48.8323973, 48.8295667, 48.8413705, 48.8575047, 48.881452, 48.8900117, 48.8783961, 48.8667079] [2.33802768704666, 2.3414739, 2.360708, 2.3525275780116073, 2.346078521905153, 2.3329507, 2.3179092733655935, 2.3698297, 2.33233, 2.35991, 2.379703, 2.3957517, 2.3555829, 2.3239624642685364, 2.3003827, 2.2809828, 2.3166666, 2.3464668, 2.3812008, 2.3833739]


Make a dataframe. **(Note: 'Neighborhood' column will be named 'Hood' in all three dataframe (Paris, Toronto, and NY) involved in this project. Because I found that there are venues in Toronto and New York categorized as 'Neighborhood' (returnd by Foursquare queries), making it a problem joining the dataframes correctly after one-hot encoding.)**

In [5]:
paris_dict = {'Hood': paris_dist, 'Latitude': lats, 'Longitude':longs}
df_paris = pd.DataFrame(paris_dict)
df_paris

Unnamed: 0,Hood,Latitude,Longitude
0,Louvre,48.861147,2.338028
1,Bourse,48.86863,2.341474
2,Temple,48.8665,2.360708
3,Hôtel-de-Ville,48.856426,2.352528
4,Panthéon,48.846191,2.346079
5,Luxembourg,48.850433,2.332951
6,Palais-Bourbon,48.861596,2.317909
7,Élysée,48.846644,2.36983
8,Opéra,48.870645,2.33233
9,Entrepôt,48.876106,2.35991


Let us also test the folium map by visualizing Paris and its arrondissements.

In [6]:
address = 'Paris, France'

location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Paris are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Paris are 48.8566969, 2.3514616.


In [7]:
# create map of New York using latitude and longitude values
map_paris = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(df_paris['Latitude'], df_paris['Longitude'], df_paris['Hood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_paris)  
    
map_paris

Now we will tackle the Toronto table. The process is similar to Week 3's assignment. 

In [8]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')

In [9]:
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content().replace('\n','')
    print('%d:"%s"'%(i,name))
    col.append((name,[]))

1:"Postal Code"
2:"Borough"
3:"Neighborhood"


In [10]:
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 10, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content().replace('\n','')
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

What the last part does, is that it builds up a list of 3 tuples, each of which contains the name of a column, and a list that holds the content of the column in postal code table.

In [11]:
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
# Somehow the last row of [[],['Canadian Postal Code'], []] is always included. 
# Couldn't figure out the cause, just deleted it directly.
df = df[:-1]
msk = (df.Borough == 'Not assigned')
df = df[~msk]
df.reset_index(drop=True)
df.rename(columns={'Neighborhood':'Hood'},inplace=True)

Again, we take only the assigned rows as we are only interested in the boroughs. Again, 'Neighborhood' is renamed as 'Hood', due to the potential issue that we already mentioned. 

In [12]:
ll = pd.read_csv("https://cocl.us/Geospatial_data")
df = df.merge(ll, on='Postal Code')
msk = []
for x in df['Borough']:
    msk.append('Toronto' in x)
msk
df_toronto = df[msk]
df_toronto.reset_index(drop=True)

Unnamed: 0,Postal Code,Borough,Hood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
7,M6G,Downtown Toronto,Christie,43.669542,-79.422564
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


Similar things will be done to get New York data. We will fetch the data for the whole NYC instead of just Manhattan.

In [13]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [14]:
neighborhoods_data = newyork_data['features']
# define the dataframe columns
column_names = ['Borough', 'Hood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Hood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Again, 'Neighborhood' is named 'Hood'.

In [15]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)
# Change the name
df_newyork = neighborhoods

The dataframe has 5 boroughs and 306 neighborhoods.


In [16]:
df_newyork.head()

Unnamed: 0,Borough,Hood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


Now, let's set up Foursquare credentials. 

In [17]:
CLIENT_ID = 'VL3S4DCT1KRZF3FTAC2JZLNZYQLOHPQ4HMEEITVKAMZRMMOZ'
CLIENT_SECRET = 'QLJMPTMTLO5FUEYY4CYS22BEGVHUMWNFANPKCGRQYE4YXYQB'
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: VL3S4DCT1KRZF3FTAC2JZLNZYQLOHPQ4HMEEITVKAMZRMMOZ
CLIENT_SECRET:QLJMPTMTLO5FUEYY4CYS22BEGVHUMWNFANPKCGRQYE4YXYQB


In [18]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        LIMIT = 100
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Hood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We will get venue informations by querying Foursquare database. **(Note that this part takes time. And due to the limitation on the account, it'd be nice to debug with only Toronto data or Paris data before we go all in and load the  New York data. New York data requires a lot more quieries. With a free account we can do the following for only once.)** 

In [19]:
toronto_venues = getNearbyVenues(names=df_toronto['Hood'],
                                   latitudes=df_toronto['Latitude'],
                                   longitudes=df_toronto['Longitude']
                                  )
paris_venues = getNearbyVenues(names=df_paris['Hood'],
                                   latitudes=df_paris['Latitude'],
                                   longitudes=df_paris['Longitude']
                                  )
newyork_venues = getNearbyVenues(names=df_newyork['Hood'],
                                   latitudes=df_newyork['Latitude'],
                                   longitudes=df_newyork['Longitude']
                                  )
print('Done loading venues.')

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West, Forest Hill Road Park
High Park, The Junction South
North Toronto West,  Lawrence Park
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport


In [20]:
toronto_venues.to_csv(r'toronto_venues.csv', index = False)
paris_venues.to_csv(r'paris_venues.csv', index = False)
newyork_venues.to_csv(r'newyork_venues.csv', index = False)
print('Venue files saved.')

Venue files saved.


In [21]:
print(toronto_venues.shape)
toronto_venues.head()

(1605, 7)


Unnamed: 0,Hood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
3,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
4,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa


In [22]:
print(paris_venues.shape)
paris_venues.head()

(1308, 7)


Unnamed: 0,Hood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Louvre,48.861147,2.338028,Cour Carrée du Louvre,48.86036,2.338543,Pedestrian Plaza
1,Louvre,48.861147,2.338028,Musée du Louvre,48.860847,2.33644,Art Museum
2,Louvre,48.861147,2.338028,La Vénus de Milo (Vénus de Milo),48.859943,2.337234,Exhibit
3,Louvre,48.861147,2.338028,Place du Palais Royal,48.862523,2.336688,Plaza
4,Louvre,48.861147,2.338028,Palais Royal,48.863236,2.337127,Historic Site


In [23]:
print(newyork_venues.shape)
newyork_venues.head()

(9866, 7)


Unnamed: 0,Hood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
2,Wakefield,40.894705,-73.847201,Walgreens,40.896528,-73.8447,Pharmacy
3,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
4,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop


Let's also make sure that 'Neighborhood' is not in the column names any more. 

In [24]:
print('Neighborhood' in toronto_venues.columns.to_list())
print('Neighborhood' in paris_venues.columns.to_list())
print('Neighborhood' in newyork_venues.columns.to_list())

False
False
False


In [25]:
print('There are {} uniques categories in Toronto venues.'.format(len(toronto_venues['Venue Category'].unique())))
print('There are {} uniques categories in Paris venues.'.format(len(paris_venues['Venue Category'].unique())))
print('There are {} uniques categories in New York venues.'.format(len(newyork_venues['Venue Category'].unique())))

There are 235 uniques categories in Toronto venues.
There are 210 uniques categories in Paris venues.
There are 430 uniques categories in New York venues.


## Methodology <a name="methodology"></a>

We will apply K-Means clustering to group the similar neighborhoods of the three cities. We start with one-hot encoding for all three dataframes, as the clustering scheme cannot handle categorical data. And again, 'Hood' is used in place of 'Neighborhood'. This is the place where I had much trouble in the first shot.

In [26]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Hood'] = toronto_venues['Hood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])

toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Hood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
# one hot encoding
paris_onehot = pd.get_dummies(paris_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
paris_onehot['Hood'] = paris_venues['Hood'] 

# move neighborhood column to the first column
fixed_columns = [paris_onehot.columns[-1]] + list(paris_onehot.columns[:-1])
paris_onehot = paris_onehot[fixed_columns]

paris_onehot.head()

Unnamed: 0,Hood,Afghan Restaurant,African Restaurant,Alsatian Restaurant,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Toy / Game Store,Trail,Train Station,Turkish Restaurant,Udon Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store
0,Louvre,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Louvre,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Louvre,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Louvre,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Louvre,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
# one hot encoding
newyork_onehot = pd.get_dummies(newyork_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
newyork_onehot['Hood'] = newyork_venues['Hood'] 

# move neighborhood column to the first column
fixed_columns = [newyork_onehot.columns[-1]] + list(newyork_onehot.columns[:-1])
newyork_onehot = newyork_onehot[fixed_columns]

newyork_onehot.head()

Unnamed: 0,Hood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,Arcade,Arepa Restaurant,...,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Group means will be taken to compute the frequencies of each category.

In [29]:
toronto_grouped = toronto_onehot.groupby('Hood').mean().reset_index()
paris_grouped = paris_onehot.groupby('Hood').mean().reset_index()
newyork_grouped = newyork_onehot.groupby('Hood').mean().reset_index()

In [30]:
toronto_grouped.head()

Unnamed: 0,Hood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.066667,0.066667,0.066667,0.133333,0.133333,0.133333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.016129,0.0,0.0,0.016129,0.0,0.0,0.016129


In [31]:
paris_grouped.head()

Unnamed: 0,Hood,Afghan Restaurant,African Restaurant,Alsatian Restaurant,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Toy / Game Store,Trail,Train Station,Turkish Restaurant,Udon Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store
0,Batignolles-Monceau,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0
1,Bourse,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.07,0.01,0.02
2,Butte-Montmartre,0.0,0.0,0.0,0.0,0.014085,0.014085,0.0,0.0,0.014085,...,0.0,0.0,0.0,0.0,0.0,0.014085,0.0,0.014085,0.0,0.0
3,Buttes-Chaumont,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,0.0,0.0
4,Entrepôt,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.02,0.0


In [32]:
newyork_grouped.shape

(300, 431)

Set up k-means. We will cluster each city into five groups of neighborhoods.

In [33]:
def kmeans(df, kclusters = 3):

    df_clustering = df.drop('Hood', 1)

    # run k-means clustering
    kmeansclustering = KMeans(n_clusters=kclusters, random_state=0).fit(df_clustering)

    # check cluster labels generated for each row in the dataframe
    return kmeansclustering.labels_

In [34]:
toronto_labels = kmeans(toronto_grouped)
paris_labels = kmeans(paris_grouped)
newyork_labels = kmeans(newyork_grouped)

In [35]:
toronto_labels

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [36]:
paris_labels

array([0, 0, 2, 1, 0, 0, 0, 0, 0, 2, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
      dtype=int32)

In [37]:
newyork_labels

array([0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0,
       0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0], d

The 3 functions below extract top 10 venues of each neighborhood from the response of our Fourquare inquires, attach the k-means labels to the left, and then join it to the right of the 3 neighboorhood tables.

In [38]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [39]:
def mcv(df_grouped, num_top_venues = 10):

    indicators = ['st', 'nd', 'rd']

    # create columns according to number of top venues
    columns = ['Hood']
    for ind in np.arange(num_top_venues):
        try:
            columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        except:
            columns.append('{}th Most Common Venue'.format(ind+1))

    # create a new dataframe
    neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
    neighborhoods_venues_sorted['Hood'] = df_grouped['Hood']

    for ind in np.arange(df_grouped.shape[0]):
        neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(df_grouped.iloc[ind, :], num_top_venues)
    return neighborhoods_venues_sorted

In [40]:
def assemble(df, df_grouped, df_labels):
    df_mcv = mcv(df_grouped)

    # add clustering labels
    df_mcv.insert(0, 'Cluster Labels', df_labels)

    df_merged = df

    # merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
    df_merged = df_merged.join(df_mcv.set_index('Hood'), on='Hood')

    return df_merged

In [41]:
toronto_merged = assemble(df_toronto, toronto_grouped, toronto_labels)
paris_merged = assemble(df_paris, paris_grouped, paris_labels)
newyork_merged = assemble(df_newyork, newyork_grouped, newyork_labels)

In [42]:
toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Hood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Bakery,Park,Pub,Theater,Breakfast Spot,Café,Restaurant,Beer Store,Spa
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Coffee Shop,Music Venue,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Café,Park,College Auditorium,College Cafeteria
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Clothing Store,Coffee Shop,Cosmetics Shop,Restaurant,Café,Bubble Tea Shop,Japanese Restaurant,Italian Restaurant,Middle Eastern Restaurant,Hotel
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Café,Coffee Shop,Cocktail Bar,Gastropub,American Restaurant,Italian Restaurant,Restaurant,Beer Bar,Clothing Store,Moroccan Restaurant
19,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Trail,Neighborhood,Pub,Health Food Store,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Yoga Studio


In [43]:
paris_merged.head()

Unnamed: 0,Hood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Louvre,48.861147,2.338028,0,French Restaurant,Plaza,Hotel,Art Museum,Garden,Bakery,Italian Restaurant,Boutique,Cosmetics Shop,Historic Site
1,Bourse,48.86863,2.341474,0,French Restaurant,Wine Bar,Hotel,Cocktail Bar,Creperie,Bistro,Clothing Store,Bakery,Salad Place,Bookstore
2,Temple,48.8665,2.360708,0,French Restaurant,Hotel,Wine Bar,Restaurant,Art Gallery,Bakery,Cocktail Bar,Italian Restaurant,Bar,Sandwich Place
3,Hôtel-de-Ville,48.856426,2.352528,0,French Restaurant,Ice Cream Shop,Plaza,Wine Bar,Cosmetics Shop,Park,Art Gallery,Clothing Store,Restaurant,Bookstore
4,Panthéon,48.846191,2.346079,0,French Restaurant,Hotel,Bar,Bakery,Pub,Italian Restaurant,Indie Movie Theater,Café,Ice Cream Shop,Plaza


In [44]:
newyork_merged.head()

Unnamed: 0,Borough,Hood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bronx,Wakefield,40.894705,-73.847201,0.0,Pharmacy,Dessert Shop,Laundromat,Ice Cream Shop,Sandwich Place,Gas Station,Donut Shop,Factory,Eye Doctor,Exhibit
1,Bronx,Co-op City,40.874294,-73.829939,0.0,Bus Station,Baseball Field,Mattress Store,Discount Store,Pizza Place,Grocery Store,Pharmacy,Bagel Shop,Fast Food Restaurant,Ice Cream Shop
2,Bronx,Eastchester,40.887556,-73.827806,0.0,Caribbean Restaurant,Bus Station,Deli / Bodega,Diner,Pizza Place,Convenience Store,Metro Station,Seafood Restaurant,Fast Food Restaurant,Platform
3,Bronx,Fieldston,40.895437,-73.905643,0.0,Plaza,Athletics & Sports,River,Bus Station,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant,Farm
4,Bronx,Riverdale,40.890834,-73.912585,0.0,Park,Bus Station,Gym,Food Truck,Baseball Field,Medical Supply Store,Bank,Home Service,Playground,Plaza


Results seems okay. Except the float type Cluster label in New York dataframe. Checking the unique values of the column, I found nan. What's going on?

In [45]:
toronto_merged.isnull().values.any()

False

In [46]:
paris_merged.isnull().values.any()

False

In [47]:
newyork_merged.isnull().values.any()

True

Now we see the problem: The joining process of neighborhood table and labels&venues table created nan. As I dug deeper, two neighborhoods from Staten Island have got no return from our four square responses. Since it's only two, we can drop them at this point. Don't worry about the labeling, as we have seen integer labels in previous print out, so they are valid integers. 

In [48]:
newyork_merged.dropna(inplace = True)
newyork_merged.isnull().values.any()

False

## Results and Discussion <a name="results"></a>

Let's now visualize the negihborhoods and have a look at the groups. The function below create folium maps of the cities and mark different groups in different colors on the map.

In [49]:
def MapAndMarkers(address, kclusters, df):
    geolocator = Nominatim(user_agent="ny_explorer", timeout=3)
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude

    # create map
    map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

    # set color scheme for the clusters
    x = np.arange(kclusters)
    ys = [i + x + (i*x)**2 for i in range(kclusters)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    # add markers to the map
    markers_colors = []
    for lat, lon, poi, cluster in zip(df['Latitude'], df['Longitude'], df['Hood'], df['Cluster Labels'].astype(int)):
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7).add_to(map_clusters)
       
    display(map_clusters)

In [50]:
MapAndMarkers('Toronto, Canada', 3, toronto_merged)

In [51]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Downtown Toronto,0,Coffee Shop,Bakery,Park,Pub,Theater,Breakfast Spot,Café,Restaurant,Beer Store,Spa
4,Downtown Toronto,0,Coffee Shop,Music Venue,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Café,Park,College Auditorium,College Cafeteria
9,Downtown Toronto,0,Clothing Store,Coffee Shop,Cosmetics Shop,Restaurant,Café,Bubble Tea Shop,Japanese Restaurant,Italian Restaurant,Middle Eastern Restaurant,Hotel
15,Downtown Toronto,0,Café,Coffee Shop,Cocktail Bar,Gastropub,American Restaurant,Italian Restaurant,Restaurant,Beer Bar,Clothing Store,Moroccan Restaurant
19,East Toronto,0,Trail,Neighborhood,Pub,Health Food Store,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Yoga Studio
20,Downtown Toronto,0,Coffee Shop,Cocktail Bar,Bakery,Café,Cheese Shop,Beer Bar,Restaurant,Seafood Restaurant,Pub,Creperie
24,Downtown Toronto,0,Coffee Shop,Italian Restaurant,Café,Sandwich Place,Burger Joint,Japanese Restaurant,Ice Cream Shop,Bar,Thai Restaurant,Salad Place
25,Downtown Toronto,0,Grocery Store,Café,Park,Diner,Baby Store,Restaurant,Athletics & Sports,Italian Restaurant,Candy Store,Coffee Shop
30,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Deli / Bodega,Gym,Hotel,Clothing Store,Thai Restaurant,Sushi Restaurant,Concert Hall
31,West Toronto,0,Pharmacy,Bakery,Grocery Store,Pool,Brewery,Café,Bar,Bank,Supermarket,Middle Eastern Restaurant


The biggest class of neighborhoods of Toronto, as we can see, consists of many restaurants, bars, and cafe as the most popular venues. As can be seen, this group locates at the very center of the city. This is, we can guess, the busiest part of the city. The existence of banks as popular spots came off as a surprise, but it's probably not considering Toronto's role as the financial hub of Canada. 

In [52]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
61,Central Toronto,1,Park,Dim Sum Restaurant,Bus Line,Swim School,Event Space,Electronics Store,Eastern European Restaurant,Donut Shop,Doner Restaurant,Dog Run
68,Central Toronto,1,Park,Jewelry Store,Trail,Bus Line,Sushi Restaurant,Yoga Studio,Department Store,Eastern European Restaurant,Donut Shop,Doner Restaurant
91,Downtown Toronto,1,Park,Playground,Trail,Deli / Bodega,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Donut Shop,Doner Restaurant,Dog Run


The second biggest group of neighborhoods, as we can see, locate some what further away from the center than the 1st tier. However, the occurance of parks, trails, palyground hint at more of a vibe of residential area. 

In [53]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
62,Central Toronto,2,Home Service,Garden,Yoga Studio,Deli / Bodega,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Donut Shop,Doner Restaurant,Dog Run


The 3rd group is closer to the second group and has even the same stores in the last few places. And geographically we can see they locate further away from center.  

In [54]:
MapAndMarkers('Paris, France', 3, paris_merged)

In [55]:
paris_merged.loc[paris_merged['Cluster Labels'] == 0, paris_merged.columns[[0] + list(range(3, paris_merged.shape[1]))]]

Unnamed: 0,Hood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Louvre,0,French Restaurant,Plaza,Hotel,Art Museum,Garden,Bakery,Italian Restaurant,Boutique,Cosmetics Shop,Historic Site
1,Bourse,0,French Restaurant,Wine Bar,Hotel,Cocktail Bar,Creperie,Bistro,Clothing Store,Bakery,Salad Place,Bookstore
2,Temple,0,French Restaurant,Hotel,Wine Bar,Restaurant,Art Gallery,Bakery,Cocktail Bar,Italian Restaurant,Bar,Sandwich Place
3,Hôtel-de-Ville,0,French Restaurant,Ice Cream Shop,Plaza,Wine Bar,Cosmetics Shop,Park,Art Gallery,Clothing Store,Restaurant,Bookstore
4,Panthéon,0,French Restaurant,Hotel,Bar,Bakery,Pub,Italian Restaurant,Indie Movie Theater,Café,Ice Cream Shop,Plaza
5,Luxembourg,0,Italian Restaurant,Wine Bar,French Restaurant,Plaza,Chocolate Shop,Clothing Store,Fountain,Seafood Restaurant,Café,Bistro
7,Élysée,0,Hotel,Sandwich Place,French Restaurant,Coffee Shop,Hotel Bar,Bar,Cosmetics Shop,Burger Joint,Fruit & Vegetable Store,Convenience Store
8,Opéra,0,Hotel,French Restaurant,Japanese Restaurant,Theater,Concert Hall,Bookstore,Plaza,Chocolate Shop,Taiwanese Restaurant,Korean Restaurant
9,Entrepôt,0,French Restaurant,Hotel,Coffee Shop,Bistro,Café,Restaurant,Japanese Restaurant,Indian Restaurant,Bar,Pizza Place
10,Popincourt,0,French Restaurant,Bar,Café,Pastry Shop,Bistro,Cocktail Bar,Italian Restaurant,Restaurant,Coffee Shop,Japanese Restaurant


The first group is the biggest. It consists of, again, quite a lot of restaurants and bars. Many hotels and French restaurants take the first and second place and in popularity, indicating that we are looking at a tourism city indeed.  

In [56]:
paris_merged.loc[paris_merged['Cluster Labels'] == 1, paris_merged.columns[[0] + list(range(3, paris_merged.shape[1]))]]

Unnamed: 0,Hood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Palais-Bourbon,1,French Restaurant,Plaza,Food Truck,Italian Restaurant,Hotel,Pedestrian Plaza,Beer Garden,Cultural Center,Coffee Shop,Smoke Shop
13,Observatoire,1,French Restaurant,Hotel,Brasserie,Thai Restaurant,Fast Food Restaurant,Café,Food & Drink Shop,Modern European Restaurant,Bistro,Sushi Restaurant
18,Buttes-Chaumont,1,French Restaurant,Restaurant,Bar,Italian Restaurant,Pool,Park,Hotel,Gas Station,Latin American Restaurant,Beer Garden


In [57]:
paris_merged.loc[paris_merged['Cluster Labels'] == 2, paris_merged.columns[[0] + list(range(3, paris_merged.shape[1]))]]

Unnamed: 0,Hood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
17,Butte-Montmartre,2,Bar,French Restaurant,Bistro,Italian Restaurant,Pizza Place,Middle Eastern Restaurant,Fast Food Restaurant,Sandwich Place,Coffee Shop,Convenience Store
19,Ménilmontant,2,Bar,Pizza Place,Cocktail Bar,Italian Restaurant,Hotel,Burger Joint,Brewery,French Restaurant,Beer Bar,Restaurant


The 2nd and 3rd group are quite similar in their categories, namely many popular restaurants, among which, French ones are the most dominant. Also, they are geographically located more distant from the center than the 1st group. However, top venues of neighborhoods in Paris are generally not as as diversed in terms of their functionalities as those in Toronto.  

In [58]:
MapAndMarkers('New York, NY', 3, newyork_merged)

In [59]:
newyork_merged.loc[newyork_merged['Cluster Labels'] == 0, newyork_merged.columns[[1] + list(range(5, newyork_merged.shape[1]))]]

Unnamed: 0,Hood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Wakefield,Pharmacy,Dessert Shop,Laundromat,Ice Cream Shop,Sandwich Place,Gas Station,Donut Shop,Factory,Eye Doctor,Exhibit
1,Co-op City,Bus Station,Baseball Field,Mattress Store,Discount Store,Pizza Place,Grocery Store,Pharmacy,Bagel Shop,Fast Food Restaurant,Ice Cream Shop
2,Eastchester,Caribbean Restaurant,Bus Station,Deli / Bodega,Diner,Pizza Place,Convenience Store,Metro Station,Seafood Restaurant,Fast Food Restaurant,Platform
3,Fieldston,Plaza,Athletics & Sports,River,Bus Station,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant,Farm
4,Riverdale,Park,Bus Station,Gym,Food Truck,Baseball Field,Medical Supply Store,Bank,Home Service,Playground,Plaza
5,Kingsbridge,Pizza Place,Bar,Supermarket,Latin American Restaurant,Sandwich Place,Mexican Restaurant,Deli / Bodega,Pharmacy,Donut Shop,Bakery
6,Marble Hill,Sandwich Place,Gym,Coffee Shop,Steakhouse,Pizza Place,Tennis Stadium,Seafood Restaurant,Miscellaneous Shop,Bank,Pharmacy
7,Woodlawn,Deli / Bodega,Pub,Pizza Place,Playground,Grocery Store,Food Truck,Bar,Train Station,Donut Shop,Trail
8,Norwood,Pizza Place,Chinese Restaurant,Bank,Park,Pharmacy,Sandwich Place,Mobile Phone Shop,Caribbean Restaurant,Fast Food Restaurant,Coffee Shop
9,Williamsbridge,Caribbean Restaurant,Nightclub,Soup Place,Bar,Zoo Exhibit,Filipino Restaurant,Exhibit,Eye Doctor,Factory,Falafel Restaurant


The biggest group of NYC, as we can see, do not have as many restaurants and bars in the first places as Toronto or Paris. We can see that the functionalities seem to be more diversed. Every neighborhood seems to have a bit of everything. My experience in NYC confirms the above observation, each neighborhood in NYC functions like a city of its own. 

In [60]:
newyork_merged.loc[newyork_merged['Cluster Labels'] == 1, newyork_merged.columns[[1] + list(range(5, newyork_merged.shape[1]))]]

Unnamed: 0,Hood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
192,Somerville,Park,Zoo Exhibit,Ethiopian Restaurant,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant,Farm,Farmers Market
203,Todt Hill,Park,Zoo Exhibit,Ethiopian Restaurant,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant,Farm,Farmers Market
303,Bayswater,Park,Playground,Zoo Exhibit,Filipino Restaurant,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant,Farm


The second group seem to be have more space available for parks, zooes, even farms and factories. Geographically they are further away from the center.

In [61]:
newyork_merged.loc[newyork_merged['Cluster Labels'] == 2, newyork_merged.columns[[1] + list(range(5, newyork_merged.shape[1]))]]

Unnamed: 0,Hood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
198,New Brighton,Bus Stop,Park,Daycare,Deli / Bodega,Playground,Bowling Alley,Discount Store,Zoo Exhibit,Fast Food Restaurant,Filipino Restaurant
202,Grymes Hill,Dog Run,Bus Stop,Zoo Exhibit,Exhibit,Eye Doctor,Factory,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant
224,Park Hill,Bus Stop,Coffee Shop,Gym / Fitness Center,Hotel,Athletics & Sports,Zoo Exhibit,Filipino Restaurant,Exhibit,Eye Doctor,Factory
226,Graniteville,Bus Stop,Sandwich Place,Food Truck,Grocery Store,Zoo Exhibit,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant
227,Arlington,Bus Stop,Deli / Bodega,Coffee Shop,Tree,Home Service,Grocery Store,Zoo Exhibit,Field,Event Space,Exhibit
256,Randall Manor,Bus Stop,Park,Deli / Bodega,Bagel Shop,Pizza Place,Playground,Zoo Exhibit,Filipino Restaurant,Exhibit,Eye Doctor
258,Elm Park,Bus Stop,Italian Restaurant,Deli / Bodega,American Restaurant,Ice Cream Shop,Pizza Place,Financial or Legal Service,Exhibit,Eye Doctor,Factory
285,Willowbrook,Bus Stop,Deli / Bodega,Spa,Pizza Place,Zoo Exhibit,Filipino Restaurant,Event Space,Exhibit,Eye Doctor,Factory
286,Sandy Ground,Bus Stop,Intersection,Market,Zoo Exhibit,Financial or Legal Service,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant
305,Fox Hills,Bus Station,Bus Stop,Grocery Store,Sandwich Place,Zoo Exhibit,Filipino Restaurant,Event Space,Exhibit,Eye Doctor,Factory


The second biggest group lies some what in between the biggest and the smaller ones. We can see more bus stops, food and groceries than the last bunch of neighborhoods. But we find also plenty of Exhibits, Factories, unlike in the biggest group. So it seems that this is a group in between residential and industrial.  

## Conclusion <a name="conclusion"></a>

In conclusion, from what we discovered so far, we can see that New York city has neighborhoods that are less uniform and more complete in their functionalities. Whereas Toronto and Paris both seem to have more restaurants as popular venues. As the result of such preliminary study, we can draw the conslusion that Toronto probably shares more in common with Paris than with New York City. However, the popularity of French restaurants and hotels in Paris is unique, we do not see any cuisine being so dominant in the other two city. Such popularity demonstrates the important role of tourism in Paris. 