## Segmenting and Clustering Neighborhoods in Toronto

In the first step, we import all the libraries and modules that we need for creating the DataFrame.

In [1]:
#import library/module to open URLs
import urllib.request

#import library we later use to work with the html from the wikipedia page
from bs4 import BeautifulSoup

#import pandas and numpy for further data wrangling/cleaning/analysis
import pandas as pd
import numpy as np

We start with assigning the URL for the Wikipedia Page to a variable "url".
This variable then serves as the input for the function "urlopen" which gets the html-file for the Wikipedia Page.
The result is saved in a variable called "page".

In [2]:
#assign URL to variable
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#use urllib.request to save the html
page = urllib.request.urlopen(url)

In the next step, we use BeautifulSoup to parse the html-file into an easier-to-read tree format.

In [3]:
#parse html into BeatifulSoup format
soup = BeautifulSoup(page, "lxml")

Using the function "find_all" we extract all "table"-items from the Wikipedia Page.

In [4]:
#show all tables that belong to the wikipedia page
all_tables = soup.find_all("table")
all_tables

[<table class="wikitable sortable">
 <tbody><tr>
 <th>Postal Code
 </th>
 <th>Borough
 </th>
 <th>Neighbourhood
 </th></tr>
 <tr>
 <td>M1A
 </td>
 <td>Not assigned
 </td>
 <td>Not assigned
 </td></tr>
 <tr>
 <td>M2A
 </td>
 <td>Not assigned
 </td>
 <td>Not assigned
 </td></tr>
 <tr>
 <td>M3A
 </td>
 <td>North York
 </td>
 <td>Parkwoods
 </td></tr>
 <tr>
 <td>M4A
 </td>
 <td>North York
 </td>
 <td>Victoria Village
 </td></tr>
 <tr>
 <td>M5A
 </td>
 <td>Downtown Toronto
 </td>
 <td>Regent Park, Harbourfront
 </td></tr>
 <tr>
 <td>M6A
 </td>
 <td>North York
 </td>
 <td>Lawrence Manor, Lawrence Heights
 </td></tr>
 <tr>
 <td>M7A
 </td>
 <td>Downtown Toronto
 </td>
 <td>Queen's Park, Ontario Provincial Government
 </td></tr>
 <tr>
 <td>M8A
 </td>
 <td>Not assigned
 </td>
 <td>Not assigned
 </td></tr>
 <tr>
 <td>M9A
 </td>
 <td>Etobicoke
 </td>
 <td>Islington Avenue, Humber Valley Village
 </td></tr>
 <tr>
 <td>M1B
 </td>
 <td>Scarborough
 </td>
 <td>Malvern, Rouge
 </td></tr>
 <tr>
 <td>M2B

From the output above we see that the relevant table is called "wikitable sortable". Using the function "find" we extract the items belonging to this table only.

In [5]:
#store the relevant table into variable
source_table = soup.find('table', class_ ="wikitable sortable")

#look at the table
source_table

<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3B
</td>
<td

From the output above we see that inside the table 'wikitable sortable' every row starts with a '\<tr>' and end with a '\</tr>'. <br>
Also, inside a single row every item starts with a '\<td>' and ends with a '\</td>' (we do not need the header items as we know each of the three columns stands for). <br>
The following code uses this information to extract all items inside of the table and saves them in three list (one for each column).

In [6]:
#create empty lists for columns
A = [] #list for the Postal Code Items
B = [] #list for the Borough Items
C = [] #list for the Neighbourhood Items

for row in source_table.findAll('tr'): #we use this for-loop to loop over every single row of the table
    items = row.findAll('td') #we use this variable to distinguish between the different items that belong to a single row
    if len(items)==3: #check whether there are 3 items (Postal Code, Borough, Neighbourhood) in the current row
        A.append(items[0].find(text=True))
        B.append(items[1].find(text=True))
        C.append(items[2].find(text=True))

In [7]:
#we know use the list created above to create a DataFrame
PostalCodes = pd.DataFrame(A, columns = ['PostalCode'])
PostalCodes['Borough'] = B
PostalCodes['Neighborhood'] = C

#let#s look at the dataframe
PostalCodes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


In [8]:
#we see that there a '\n'-endings we do not want, so we get rid of them using the function split
number_of_rows = PostalCodes.shape[0]

for i in range(number_of_rows): #loop over each row
    for j in range(3): #loop over each column
        PostalCodes.iloc[i, j] = PostalCodes.iloc[i,j].split('\n')[0]

PostalCodes.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


In [9]:
#We now ignore rows with a borough that is Not assigned
msk = []
for i in range(PostalCodes.shape[0]):
    if PostalCodes.loc[i, 'Borough'] == 'Not assigned':
        msk.append(False)
    else:
        msk.append(True)

msk[0:10]

[False, False, True, True, True, True, True, False, True, True]

In [10]:
PostalCodes = PostalCodes[msk]

In [11]:
PostalCodes.index = np.arange(PostalCodes.shape[0])

In [12]:
PostalCodes

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [13]:
len(PostalCodes['PostalCode'].unique())

103

In [14]:
PostalCodes[PostalCodes['Neighborhood']=='Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood


In [15]:
#number of rows in dataframe
PostalCodes.shape[0]

103

In [16]:
type(PostalCodes)

pandas.core.frame.DataFrame

Now that we have a dataframe of the Postal Codes of Toronto along with the borough name and neighborhood name, we want to add additional columns that list the geographic coordinates (latitude and longitude) to each postal code. <br>
Unfortunately, both - the geocoder and the link to the csv file (geospatial data) - didn't worked in my case. So, I decided to gather the geoinformation for the postal codes using the web service I found on the website https://www.nrcan.gc.ca/earth-sciences/geography/topographic-information/web-services/geolocation-service/17304.

In [17]:
#module for starting a get request
import requests
#import module for transforming the result of the get request
import json

In [18]:
#initialize the new columns for Latitude and Longitude
PostalCodes['Latitude'] = ["None" for item in PostalCodes['PostalCode']]
PostalCodes['Longitude'] = ["None" for item in PostalCodes['PostalCode']]

#create a list that is later used for looping
pc_list = PostalCodes['PostalCode']

#loop over all postal codes
for i, pc in enumerate(pc_list):
    #assign the API url that we use to find the coordinates of the current postal code
    url_pc = 'http://geogratis.gc.ca/services/geolocation/en/locate?q={}'.format(pc)
    #call the get request and save the result in a variable
    result = requests.get(url_pc)
    #transform the result of the get request into json-format
    result_json = result.json()
    #the following lines assign the coordinates to the corresponding Postal Code inside our dataframe
    #we need the "try: ... except: ..." as there is one Postal Code that does not gives a result using the web service
    try:
        location = result_json[0]['geometry']['coordinates']
        PostalCodes['Latitude'][i] = location[1]
        PostalCodes['Longitude'][i] = location[0]
    except:
        PostalCodes['Latitude'][i] = 0.00
        PostalCodes['Longitude'][i] = 0.00

PostalCodes

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7528,-79.3296
1,M4A,North York,Victoria Village,43.7234,-79.3129
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6474,-79.3529
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7248,-79.4514
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6637,-79.3921
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.6544,-79.5112
99,M4Y,Downtown Toronto,Church and Wellesley,43.6672,-79.3816
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.7218,-79.2841
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.6347,-79.4917


Let's see for which postal codes we could not find the geocoordinates.

In [19]:
PostalCodes[PostalCodes['Latitude']== 0.00]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
76,M7R,Mississauga,Canada Post Gateway Processing Centre,0,0


In the output above, we see that luckily there's only a single postal code for which the coordinates could not be found. <br>
We verify that this is the only Postal Code with the borough Mississauga with the following line of code:

In [20]:
PostalCodes[PostalCodes['Borough']=='Mississauga'].shape[0]

1

We could easily let this line out of the dataframe, but I decided to add the geocoordinates manually (using the help of Google Maps).

In [21]:
PostalCodes.iloc[76, 3] = 43.5897
PostalCodes.iloc[76, 4] = -79.6453

Now that our dataframe for the postal codes / neighborhoods of Toronto is ready, we start to explore and cluster the neighborhoods like we did in the Hands-on-Lab with the neighborhoods of Manhattan, New York.

In the first step, we define our Foursquare Credentials.

In [22]:
#define Foursquare Credentials
CLIENT_ID = '2ZEUWWXANBTVKSR5ZZ1VA0B44PMXJEZYNDLZ1KSNRMJ4Q3P3'
CLIENT_SECRET = 'TFWU2LVU042GXAPJ1IEOHD22Y1AON1PUVRAORD1AJITQQTD0 '
VERSION = '20201030'
LIMIT = 100

Now we define s function a function that communicates with the Foursquare API and explores the venues are nearby the coordinates we assigned to the postal codes.

In [23]:
def getNearbyVenues(boroughs, names, latitudes, longitudes, radius = 500):
    venues_list = [] #empty list that will be filled by the following loop
    for borough, name, lat, lon, in zip(boroughs, names, latitudes, longitudes):
        #asign the url with the Foursquare Credentials and the current latitude and longitude
        url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID,
                                                                                                                                  CLIENT_SECRET,
                                                                                                                                  VERSION,
                                                                                                                                  lat,
                                                                                                                                  lon,
                                                                                                                                  radius,
                                                                                                                                  LIMIT)
        #now call the get-request and save the relevant data in a variable called results
        results = requests.get(url).json()["response"]["groups"][0]["items"]
        #append the new venue names, location and category to the venues_list
        venues_list.append([(borough, name, lat, lon, v['venue']['name'], v['venue']['location']['lat'], v['venue']['location']['lng'], v['venue']['categories'][0]['name']) for v in results])
    #transform the data gathered in list venues_list into a pandas DataFrame
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    #rename the columns of the new dataframe
    nearby_venues.columns = ['Borough','Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']

    return(nearby_venues)                                   

We let the function above run for our PostalCodes-dataframe in order to retrieve venues for each PostalCode.

In [24]:
toronto_venues = getNearbyVenues(PostalCodes['Borough'], PostalCodes['Neighborhood'], PostalCodes['Latitude'], PostalCodes['Longitude'])

In [25]:
toronto_venues.head()

Unnamed: 0,Borough,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,North York,Parkwoods,43.752804,-79.32959,Brookbanks Park,43.751976,-79.33214,Park
1,North York,Parkwoods,43.752804,-79.32959,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,North York,Victoria Village,43.723358,-79.312927,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,North York,Victoria Village,43.723358,-79.312927,Portugril,43.725819,-79.312785,Portuguese Restaurant
4,North York,Victoria Village,43.723358,-79.312927,Tim Hortons,43.725517,-79.313103,Coffee Shop


To categorize and cluster the neigborhoods, we will now do a One-Hot-Encoding to the Dataframe above which we will later group by Borough/Neighborhood.<br>
In the following part, we will focus on the Boroughs that contain the word Toronto.

In [26]:
#check for every borough entry if it contains the word "Toronto" and save the result in a list
msk = []
for bor in toronto_venues['Borough']:
    test = "Toronto" in bor
    msk.append(test)

#exclude the boroughs that do no contain the word "Toronto"
toronto_venues = toronto_venues[msk]

In [27]:
#one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix = "", prefix_sep = "")
#add a column which display the Neigborhood
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood']
#reorder columns so that the 'Neighborhood'-column is the first (currently on position 175)
fixed_columns = [toronto_onehot.columns[175]]+list(toronto_onehot.columns[:175])+list(toronto_onehot.columns[176:])
toronto_onehot = toronto_onehot[fixed_columns]

In [28]:
toronto_onehot.head()

Unnamed: 0,Poutine Place,Adult Boutique,Afghan Restaurant,American Restaurant,Antique Shop,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
9,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
12,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
13,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we will group the results in the dataframe above by neighborhood, in order to retrieve the percentage that each category makes up for each neighborhood.

In [29]:
#group by neighborhood
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

In [30]:
toronto_grouped.head()

Unnamed: 0,Neighborhood,Poutine Place,Adult Boutique,Afghan Restaurant,American Restaurant,Antique Shop,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.012987,0.012987,0.012987,0.0,0.0,0.0,...,0.0,0.0,0.012987,0.0,0.0,0.0,0.0,0.0,0.0,0.012987
1,"Brockton, Parkdale Village, Exhibition Place",0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017857,...,0.0,0.017857,0.0,0.017857,0.0,0.0,0.0,0.017857,0.0,0.017857
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.013158,0.013158,0.0,0.0,...,0.0,0.0,0.0,0.0,0.013158,0.013158,0.013158,0.0,0.0,0.0


Let us analyze the result by printing each neighborhood along with its top 5 most common venues.

In [31]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----" + hood + "----")
    temp = toronto_grouped[toronto_grouped['Neighborhood']==hood].T.reset_index()
    temp.columns = ['venue', 'freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq':2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
                venue  freq
0         Coffee Shop  0.08
1          Restaurant  0.04
2               Hotel  0.04
3  Seafood Restaurant  0.04
4              Bakery  0.03


----Brockton, Parkdale Village, Exhibition Place----
                   venue  freq
0         Soccer Stadium  0.17
1  Performing Arts Venue  0.11
2          Poutine Place  0.06
3             Theme Park  0.06
4            Comedy Club  0.06


----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
              venue  freq
0    Hardware Store  0.29
1    Discount Store  0.14
2      Dessert Shop  0.14
3               Pub  0.14
4  Asian Restaurant  0.14


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
                  venue  freq
0           Coffee Shop  0.09
1                  Café  0.07
2    Italian Restaurant  0.07
3            Restaurant  0.05
4  Caribbean Restaurant  0.04


----Central Bay Stree

We now want to put these results in a dataframe. Therefore, we start by defining a function that sorts the venues for each neighborhood in descending order.

In [32]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [33]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd'] #extra indicators for 1st, 2nd and 3rd

columns = ['Neighborhood']

for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

In [34]:
neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Restaurant,Hotel,Seafood Restaurant,Creperie,Café,Bakery,Cocktail Bar,Beer Bar,Park
1,"Brockton, Parkdale Village, Exhibition Place",Soccer Stadium,Performing Arts Venue,Poutine Place,Athletics & Sports,Food Court,Food Truck,Intersection,Restaurant,Theater,Theme Park
2,"Business reply mail Processing Centre, South C...",Hardware Store,Discount Store,Sporting Goods Shop,Pub,Asian Restaurant,Dessert Shop,Falafel Restaurant,Ethiopian Restaurant,Escape Room,Electronics Store
3,"CN Tower, King and Spadina, Railway Lands, Har...",Coffee Shop,Café,Italian Restaurant,Restaurant,Sandwich Place,Caribbean Restaurant,Park,Market,Burrito Place,Seafood Restaurant
4,Central Bay Street,Coffee Shop,Italian Restaurant,Sandwich Place,Hotel,Middle Eastern Restaurant,Department Store,Café,Sushi Restaurant,Restaurant,Bubble Tea Shop


In the next step we cluster the Neighborhoods based on their top ten most common venues.

In [35]:
#import KMeans module
from sklearn.cluster import KMeans

#define the number of clusters we want to get
k_clusters = 5
#as we want to cluster the neighborhoods based on their venues, we drop the Neighborhood column from the dataframe to create the dataframe for clustering
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', axis = 1)
#initialize and fit the KMeans object
kmeans = KMeans(n_clusters = k_clusters, random_state = 0)
kmeans = kmeans.fit(toronto_grouped_clustering)

#look at the first five labels
kmeans.labels_[0:5]

array([2, 2, 2, 2, 2], dtype=int32)

We go on by adding the clustering labels to the dataframe with top 10 venues per neighborhood and merging the labels with the Postal Code dataframe which includes geo coordinates for the neighborhoods.

In [36]:
#neighborhoods_venues_sorted.drop('Cluster Labels', axis = 1, inplace = True)
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_merged = PostalCodes
toronto_merged = neighborhoods_venues_sorted.join(toronto_merged.set_index('Neighborhood'), on = 'Neighborhood')
toronto_merged.head()

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,PostalCode,Borough,Latitude,Longitude
0,2,Berczy Park,Coffee Shop,Restaurant,Hotel,Seafood Restaurant,Creperie,Café,Bakery,Cocktail Bar,Beer Bar,Park,M5E,Downtown Toronto,43.6462,-79.3735
1,2,"Brockton, Parkdale Village, Exhibition Place",Soccer Stadium,Performing Arts Venue,Poutine Place,Athletics & Sports,Food Court,Food Truck,Intersection,Restaurant,Theater,Theme Park,M6K,West Toronto,43.6321,-79.4217
2,2,"Business reply mail Processing Centre, South C...",Hardware Store,Discount Store,Sporting Goods Shop,Pub,Asian Restaurant,Dessert Shop,Falafel Restaurant,Ethiopian Restaurant,Escape Room,Electronics Store,M7Y,East Toronto,43.7218,-79.2841
3,2,"CN Tower, King and Spadina, Railway Lands, Har...",Coffee Shop,Café,Italian Restaurant,Restaurant,Sandwich Place,Caribbean Restaurant,Park,Market,Burrito Place,Seafood Restaurant,M5V,Downtown Toronto,43.6421,-79.3979
4,2,Central Bay Street,Coffee Shop,Italian Restaurant,Sandwich Place,Hotel,Middle Eastern Restaurant,Department Store,Café,Sushi Restaurant,Restaurant,Bubble Tea Shop,M5G,Downtown Toronto,43.6568,-79.3856


In [37]:
toronto_merged.shape

(39, 16)

Let us visualize the different neighborhoods and their clusters in a map.

In [38]:
#import folium and cm
!pip install folium
import folium

import matplotlib.cm as cm
import matplotlib.colors as colors



In [39]:
#we want the center of the map to be the center of Toronto, so let us assign this location first
latitude = 43.651644
longitude = -79.37167
map_clusters = folium.Map(location=[latitude, longitude], zoom_start = 12)

x = np.arange(k_clusters)
ys = [i + x + (i*x)**2 for i in range(k_clusters)]
colors_array = cm.rainbow(np.linspace(0,1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []

for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html = True)
    folium.CircleMarker([lat, lon],
                        radius = 5,
                       popup = label,
                       color = rainbow[cluster],
                       fill = True,
                       fill_color = rainbow[cluster],
                       fill_opacity = 0.7).add_to(map_clusters)
    
map_clusters

Now, we'll have at the clusters and see which venues are popular inside these clusters:

In [40]:
#examine clusters
cluster_0 = toronto_merged.loc[toronto_merged['Cluster Labels']==0, toronto_merged.columns[list(range(1, 12))]]
cluster_1 = toronto_merged.loc[toronto_merged['Cluster Labels']==1, toronto_merged.columns[list(range(1, 12))]]
cluster_2 = toronto_merged.loc[toronto_merged['Cluster Labels']==2, toronto_merged.columns[list(range(1, 12))]]
cluster_3 = toronto_merged.loc[toronto_merged['Cluster Labels']==3, toronto_merged.columns[list(range(1, 12))]]
cluster_4 = toronto_merged.loc[toronto_merged['Cluster Labels']==4, toronto_merged.columns[list(range(1, 12))]]
#cluster_5 = toronto_merged.loc[toronto_merged['Cluster Labels']==5, toronto_merged.columns[list(range(1, 12))]]
#cluster_6 = toronto_merged.loc[toronto_merged['Cluster Labels']==6, toronto_merged.columns[list(range(1, 12))]]
#cluster_7 = toronto_merged.loc[toronto_merged['Cluster Labels']==7, toronto_merged.columns[list(range(1, 12))]]
#cluster_8 = toronto_merged.loc[toronto_merged['Cluster Labels']==8, toronto_merged.columns[list(range(1, 12))]]
#cluster_9 = toronto_merged.loc[toronto_merged['Cluster Labels']==9, toronto_merged.columns[list(range(1, 12))]]

In [41]:
print('There are {} neighborhoods that belong to Cluster {}'.format(cluster_0.shape[0], '0'))
pd.DataFrame(cluster_0['1st Most Common Venue'].value_counts()).head()

There are 1 neighborhoods that belong to Cluster 0


Unnamed: 0,1st Most Common Venue
Boat or Ferry,1


In [42]:
print('There are {} neighborhoods that belong to Cluster {}'.format(cluster_1.shape[0], '1'))
pd.DataFrame(cluster_1['1st Most Common Venue'].value_counts()).head()

There are 4 neighborhoods that belong to Cluster 1


Unnamed: 0,1st Most Common Venue
Park,3
Playground,1


In [43]:
print('There are {} neighborhoods that belong to Cluster {}'.format(cluster_2.shape[0], '2'))
pd.DataFrame(cluster_2['1st Most Common Venue'].value_counts()).head()

There are 32 neighborhoods that belong to Cluster 2


Unnamed: 0,1st Most Common Venue
Coffee Shop,15
Park,4
Café,3
Playground,1
Dessert Shop,1


In [44]:
print('There are {} neighborhoods that belong to Cluster {}'.format(cluster_3.shape[0], '3'))
pd.DataFrame(cluster_3['1st Most Common Venue'].value_counts()).head()

There are 1 neighborhoods that belong to Cluster 3


Unnamed: 0,1st Most Common Venue
Pet Store,1


In [45]:
print('There are {} neighborhoods that belong to Cluster {}'.format(cluster_4.shape[0], '4'))
pd.DataFrame(cluster_4['1st Most Common Venue'].value_counts()).head()

There are 1 neighborhoods that belong to Cluster 4


Unnamed: 0,1st Most Common Venue
Photography Studio,1


Looking at the Map and the tables above, we quickly see two things: <br>
1 Most of the Neighborhoods belong to the same cluster. That means that many neighborhoods share the same type of venues. <br>
2 The Cluster 2 - that most of the neighborhoods belong to - consists of neighborhoods where mostly coffee shops are located.