# Part 1: Getting the data - Scraping

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from bs4 import BeautifulSoup
import requests

We will use the BeautifulSoup API to scrape data from Wikipedia.

In [2]:
data = requests.get('https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1012118802')
soup = BeautifulSoup(data.content, "html.parser")

Now let's create a list for each future column of our dataframe, and a list of all cells that will be present in the table.

In [3]:
table_contents=[]
table=soup.find('table')
table.findAll('td')
postal_code = []
borough = []
neighborhood = []
cells = [str(cell)[4:-6] for cell in table.findAll('td')]

This way, we can segment all cells the following way : 
* All the $3 \times i$ cells will be postal codes
* All the $3 \times i + 1$ cells will be boroughs
* All the $3 \times i + 2$ cells will be neighborhoods

In [4]:
for i in range(len(cells)//3):
    if cells[3*i+1] == 'Not assigned':
        pass
    else:
        postal_code.append(cells[3*i])
        borough.append(cells[3*i+1])
        neighborhood.append(cells[3*i+2])

Let's now create the dataframe

In [5]:
df = pd.DataFrame(postal_code, columns=['PostalCode'])
df['Borough'] = borough
df['Neighborhood'] = neighborhood


print('Number of rows:', len(df))
print('Number unique postal codes:', df['PostalCode'].nunique())
print('Number of unassigned neighborhoods:', len(df[df['Neighborhood'] == 'Not assigned']))

Number of rows: 103
Number unique postal codes: 103
Number of unassigned neighborhoods: 0


All neighborhoods are assigned and there are no postal codes duplicates, meaning the neighborhoods with the same postal code have been fused into one cell, separated by a comma, as shown below.<br>
I'll have to admit, though, this preprocessing was already done on the Wikipedia page that I used.

In [6]:
df[df['PostalCode'] == 'M5A']['Neighborhood'].iloc[0]

'Regent Park, Harbourfront'

In [7]:
print(df.shape)

(103, 3)


# Part 2: Getting the coordinates

Let's get the Latitude/Longitude dataframe and rename its `Postal Code` column to `PostalCode`, so as to be able to execute a pandas merge.<br>
The geocoder kept giving me `None`, which is why I used the csv file.

In [8]:
coord = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv')
coord = coord.rename(columns={'Postal Code': 'PostalCode'})

Let's now merge the dataframes on the postal code to get the coordinates for each postal code.

In [9]:
df = df.merge(right=coord, on='PostalCode', how='left')
df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


Despite the fact that we will not use it in the remainder of this notebook, we will create the dataframe limited to the Toronto neighborhoods

In [10]:
toronto_df = df['Borough'].apply(lambda x: 'Toronto' in x)
toronto = df[toronto_df].reset_index(drop=True)
print('Shape of the Toronto subset:', toronto.shape)
toronto

Shape of the Toronto subset: (40, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
7,M6G,Downtown Toronto,Christie,43.669542,-79.422564
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


# Part 3: Segmenting and visualizing the clusters

## Getting the data ready

Let's import the various libraries for the analysis

In [11]:
import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium

from geopy.geocoders import Nominatim

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


For starters, we'll locate Toronto so as to place the folium map.

In [12]:
address = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


I have decided to include neighborhoods that are not in Toronto but that are close nonetheless, since the exact geographical borders of the city might not have definite value when thinking about relocating.

In [13]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Obviously those credentials won't work, so don't try to run the cells (or enter your own credentials).

In [18]:
CLIENT_ID = 'XXXX' 
CLIENT_SECRET = 'XXXX'
ACCESS_TOKEN = 'XXXX'
VERSION = 'XXXX'
LIMIT = 100

Let's map each neighborhood to its nearby venues.

In [19]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [20]:
toronto_venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

The aim of the following few cells is to group all the venues by neighborhood (we're gonna need to one-hot encode the venues and then group them)

In [21]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
toronto_onehot['Neighborhood']

0                                               Parkwoods
1                                               Parkwoods
2                                               Parkwoods
3                                        Victoria Village
4                                        Victoria Village
                              ...                        
2114    Mimico NW, The Queensway West, South of Bloor,...
2115    Mimico NW, The Queensway West, South of Bloor,...
2116    Mimico NW, The Queensway West, South of Bloor,...
2117    Mimico NW, The Queensway West, South of Bloor,...
2118    Mimico NW, The Queensway West, South of Bloor,...
Name: Neighborhood, Length: 2119, dtype: object

In [23]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.035714
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90,"Willowdale, Willowdale East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.029412,0.0,0.0,0.0,0.000000
91,"Willowdale, Willowdale West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000
92,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000
93,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.142857,0.000000,0.0,0.0,0.0,0.000000


In [24]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

The number of top venues could be changed depending on the aim of your analysis.<br>
In our case, since we only want a brief exploration, the arbitrary choice of 10 seems satisfying.

In [25]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Lounge,Skating Rink,Latin American Restaurant,Clothing Store,Breakfast Spot,Yoga Studio,Middle Eastern Restaurant,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop
1,"Alderwood, Long Branch",Pizza Place,Pharmacy,Gym,Dance Studio,Coffee Shop,Pub,Sandwich Place,Miscellaneous Shop,Molecular Gastronomy Restaurant,Modern European Restaurant
2,"Bathurst Manor, Wilson Heights, Downsview North",Bank,Coffee Shop,Pharmacy,Deli / Bodega,Park,Sandwich Place,Middle Eastern Restaurant,Mobile Phone Shop,Restaurant,Fried Chicken Joint
3,Bayview Village,Café,Japanese Restaurant,Bank,Chinese Restaurant,Motel,Monument / Landmark,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop
4,"Bedford Park, Lawrence Manor East",Italian Restaurant,Coffee Shop,Restaurant,Sandwich Place,Butcher,Sushi Restaurant,Juice Bar,Spa,Liquor Store,Fast Food Restaurant


## Clustering

We will now build the model.

In [26]:
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

kmeans.labels_[0:10] 

array([1, 0, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

In [27]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df

toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,4.0,Food & Drink Shop,Fast Food Restaurant,Park,Yoga Studio,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Mexican Restaurant,Metro Station
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,Pizza Place,Coffee Shop,Hockey Arena,Portuguese Restaurant,Yoga Studio,Mexican Restaurant,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1.0,Coffee Shop,Bakery,Park,Theater,Breakfast Spot,Café,Pub,Event Space,Chocolate Shop,Cosmetics Shop
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1.0,Clothing Store,Boutique,Miscellaneous Shop,Accessories Store,Vietnamese Restaurant,Coffee Shop,Furniture / Home Store,Gift Shop,Moroccan Restaurant,Monument / Landmark
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,1.0,Coffee Shop,Sushi Restaurant,Yoga Studio,Sandwich Place,Bar,Fried Chicken Joint,Beer Bar,Creperie,Mexican Restaurant,Burger Joint


Let's make sure each location has the required amount of venues

In [28]:
toronto_merged.isna().sum()

PostalCode                0
Borough                   0
Neighborhood              0
Latitude                  0
Longitude                 0
Cluster Labels            4
1st Most Common Venue     4
2nd Most Common Venue     4
3rd Most Common Venue     4
4th Most Common Venue     4
5th Most Common Venue     4
6th Most Common Venue     4
7th Most Common Venue     4
8th Most Common Venue     4
9th Most Common Venue     4
10th Most Common Venue    4
dtype: int64

It turns out there are a few problems. Let's investigate.

In [29]:
toronto_merged[toronto_merged['Cluster Labels'].isna()]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242,,,,,,,,,,,
11,M9B,Etobicoke,"West Deane Park, Princess Gardens, Martin Grov...",43.650943,-79.554724,,,,,,,,,,,
45,M2L,North York,"York Mills, Silver Hills",43.75749,-79.374714,,,,,,,,,,,
95,M1X,Scarborough,Upper Rouge,43.836125,-79.205636,,,,,,,,,,,


Ok so given the results we have here, we can safely drop the nan values (they probably correspond to uninhabited neighborhoods)

In [30]:
toronto_merged.dropna(inplace=True)
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].astype('int') # This will be useful to encode the colors

Let us now visualize the clusters

In [31]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Cluster investigation

Let's now investigate each cluster.

In [32]:
c0 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
print('Number of points in cluster 0:', len(c0))
c0.head()

Number of points in cluster 0: 9


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,0,Pizza Place,Coffee Shop,Hockey Arena,Portuguese Restaurant,Yoga Studio,Mexican Restaurant,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop
8,East York,0,Pizza Place,Athletics & Sports,Gastropub,Intersection,Flea Market,Pharmacy,Breakfast Spot,Bank,Gym / Fitness Center,Train Station
32,Scarborough,0,Pizza Place,Playground,Yoga Studio,Metro Station,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Mexican Restaurant
50,North York,0,Gym,Pizza Place,Home Service,Yoga Studio,Metro Station,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant
70,Etobicoke,0,Pizza Place,Coffee Shop,Discount Store,Chinese Restaurant,Sandwich Place,Middle Eastern Restaurant,Intersection,Mexican Restaurant,Metro Station,Motel


In [33]:
c1 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
print('Number of points in cluster 1:', len(c1))
c1.head()

Number of points in cluster 1: 76


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Downtown Toronto,1,Coffee Shop,Bakery,Park,Theater,Breakfast Spot,Café,Pub,Event Space,Chocolate Shop,Cosmetics Shop
3,North York,1,Clothing Store,Boutique,Miscellaneous Shop,Accessories Store,Vietnamese Restaurant,Coffee Shop,Furniture / Home Store,Gift Shop,Moroccan Restaurant,Monument / Landmark
4,Downtown Toronto,1,Coffee Shop,Sushi Restaurant,Yoga Studio,Sandwich Place,Bar,Fried Chicken Joint,Beer Bar,Creperie,Mexican Restaurant,Burger Joint
6,Scarborough,1,Fast Food Restaurant,Print Shop,Metro Station,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Mexican Restaurant,Men's Store,Monument / Landmark
7,North York,1,Gym,Clothing Store,Restaurant,Coffee Shop,Asian Restaurant,Supermarket,Caribbean Restaurant,Sushi Restaurant,Shopping Mall,Sporting Goods Shop


In [34]:
c2 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
print('Number of points in cluster 2:', len(c2))
c2.head()

Number of points in cluster 2: 2


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
52,North York,2,Gym,Yoga Studio,Metro Station,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Mexican Restaurant,Men's Store
83,Central Toronto,2,Trail,Gym,Yoga Studio,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Mexican Restaurant,Metro Station,Men's Store


In [35]:
c3 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
print('Number of points in cluster 3:', len(c3))
c3.head()

Number of points in cluster 3: 1


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,Scarborough,3,Bar,Yoga Studio,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Mexican Restaurant,Metro Station,Lounge


In [36]:
c4 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
print('Number of points in cluster 4:', len(c0))
c4.head()

Number of points in cluster 4: 9


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,4,Food & Drink Shop,Fast Food Restaurant,Park,Yoga Studio,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Mexican Restaurant,Metro Station
21,York,4,Park,Women's Store,Pool,Metro Station,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Mexican Restaurant
35,East York,4,Intersection,Park,Convenience Store,Mexican Restaurant,Monument / Landmark,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant
61,Central Toronto,4,Park,Swim School,Bus Line,Yoga Studio,Mexican Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Metro Station
64,York,4,Park,Convenience Store,Yoga Studio,Metro Station,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Mexican Restaurant


Okay we will now try to categorize each cluster (in a shallow way, this is just the starting point of a future, more thorough, investigation)

In [37]:
def ten_most_common(cluster):
    
    dic = dict(cluster['1st Most Common Venue'].value_counts())
    venues = ['2nd Most Common Venue', '3rd Most Common Venue',
       '4th Most Common Venue', '5th Most Common Venue',
       '6th Most Common Venue', '7th Most Common Venue',
       '8th Most Common Venue', '9th Most Common Venue',
       '10th Most Common Venue']
    for venue in venues:
        dic_tmp = dict(cluster[venue].value_counts())
        for key in dic_tmp.keys():
            if key in dic.keys():
                dic[key] += dic_tmp[key]
            else:
                dic[key] = dic_tmp[key]
    
    sorted_list = sorted(dic.items(), key=lambda x:x[1], reverse=True)
    sorted_dict = dict(sorted_list[:10])
    return(sorted_dict)

In [39]:
df_cluster = pd.DataFrame()
for i, cluster in enumerate([c0, c1, c2, c3, c4]):
    most_common = ten_most_common(cluster)
    df_cluster['Cluster ' + str(i)] = [key + ' (' + str(val) + ')' for key, val in most_common.items()]
    
df_cluster

Unnamed: 0,Cluster 0,Cluster 1,Cluster 2,Cluster 3,Cluster 4
0,Pizza Place (9),Coffee Shop (42),Gym (2),Bar (1),Park (11)
1,Pharmacy (5),Restaurant (29),Yoga Studio (2),Yoga Studio (1),Mobile Phone Shop (11)
2,Molecular Gastronomy Restaurant (5),Café (26),Metro Station (2),Molecular Gastronomy Restaurant (1),Mexican Restaurant (11)
3,Modern European Restaurant (5),Mobile Phone Shop (22),Modern European Restaurant (2),Modern European Restaurant (1),Miscellaneous Shop (11)
4,Miscellaneous Shop (5),Middle Eastern Restaurant (21),Mobile Phone Shop (2),Mobile Phone Shop (1),Middle Eastern Restaurant (11)
5,Coffee Shop (4),Yoga Studio (19),Miscellaneous Shop (2),Miscellaneous Shop (1),Modern European Restaurant (10)
6,Mexican Restaurant (4),Mexican Restaurant (19),Middle Eastern Restaurant (2),Middle Eastern Restaurant (1),Metro Station (8)
7,Mobile Phone Shop (4),Modern European Restaurant (19),Mexican Restaurant (2),Mexican Restaurant (1),Yoga Studio (7)
8,Yoga Studio (3),Miscellaneous Shop (18),Men's Store (2),Metro Station (1),Molecular Gastronomy Restaurant (6)
9,Intersection (3),Park (16),Trail (1),Lounge (1),Convenience Store (3)


It seems that :
* Cluster 0 is the cluster of lively but away from downtown type of neighborhoods
* Cluster 1 is downtown
* Cluster 2 is an outlier cluster
* Cluster 3 is a simple outlier
* Cluster 4 is the quiet (sort of, in a non-Wall Street business kind of way) side of downtown

Of course, this conclusion could be improved with some deeper analysis, and not everyone may think the same as me even with the same data.<br>
Also, keep in mind that the results may vary greatly depending on the initialization of the clustering algorithm.