## Toronto neighborhoods segmentation and clustering

In this notebook we will scrape data about Toronto neighborhoods from wikipedia, filter relevant information and divide neighborhoods into clusters.

In [224]:
import random
import numpy as np
import pandas as pd

import requests
from bs4 import BeautifulSoup

import matplotlib.pyplot as plt

# backend for rendering plots within the browser
%matplotlib inline 

from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs

from geopy.geocoders import Nominatim # import geocoder

import folium # map rendering library

CLIENT_ID = 'T333BSMX3WEFKRBFKD2APTRQWOXPQ4TFZGA52RBUBT5XBOTL'
CLIENT_SECRET = '1NK2F4NNK3LLFITZ5CG3G0PXISBXBE3TH2T4VTI4YX04DNEF'
VERSION = '20190531' # Foursquare API version

print('Libraries imported.')

Libraries imported.


First we need to scrape data from wikipedia page. To do so, we'll use libraries requests and bs4.

In [225]:
website = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

soup = BeautifulSoup(website, "lxml")

wiki_table = soup.find('table',{'class':'wikitable sortable'})
wiki_table = str(wiki_table)

And create a dataframe with pandas library.

In [226]:

nbh_table = pd.read_html(wiki_table)[0]
nbh_table.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


Let's filter data with not assigned borough

In [227]:
nbh_table.drop(nbh_table[nbh_table.Borough == "Not assigned"].index, inplace=True)
nbh_table.reset_index(drop=True, inplace=True)
nbh_table.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


And modify neighbourhoods with Not assigned value. For this purpose we use values from coresponding boroughs.

In [228]:
nbh_table.loc[nbh_table.Neighbourhood == "Not assigned", "Neighbourhood"] = nbh_table["Borough"]
nbh_table.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Finally let's group neighbourhoods with the same postcode.

In [229]:
nbh_table = nbh_table.groupby(['Postcode', 'Borough']).Neighbourhood.aggregate(lambda x: ", ".join(x)).reset_index()
nbh_table.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


We get a table with dimensions:

In [230]:
print(nbh_table.shape)

(103, 3)


Another task is to add latitude and longitude coordinates. For this task we could use library geocoder, but csv file with geolocations were provided for this purpose.

In [231]:
geodata = requests.get('http://cocl.us/Geospatial_data').content.decode()
geodata_table = pd.DataFrame([x.split(",") for x in geodata.split('\r\n')])
geodata_table.set_axis(["Postcode", "Latitude", "Longitude"], axis=1, inplace=True)
geodata_table = geodata_table.iloc[1:]
geodata_table.head(10)

Unnamed: 0,Postcode,Latitude,Longitude
1,M1B,43.8066863,-79.1943534
2,M1C,43.7845351,-79.1604971
3,M1E,43.7635726,-79.1887115
4,M1G,43.7709921,-79.2169174
5,M1H,43.773136,-79.2394761
6,M1J,43.7447342,-79.2394761
7,M1K,43.7279292,-79.2620294
8,M1L,43.7111117,-79.2845772
9,M1M,43.716316,-79.2394761
10,M1N,43.692657,-79.2648481


And merge these two tables.

In [232]:
nbh_table = pd.merge(nbh_table, geodata_table)
nbh_table.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.8066863,-79.1943534
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.7845351,-79.1604971
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.7635726,-79.1887115
3,M1G,Scarborough,Woburn,43.7709921,-79.2169174
4,M1H,Scarborough,Cedarbrae,43.773136,-79.2394761
5,M1J,Scarborough,Scarborough Village,43.7447342,-79.2394761
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.7279292,-79.2620294
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.7111117,-79.2845772
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.2394761
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.2648481


###Clustering neighbourhoods

Let's visualize neighbourhoods on a map of Toronto. To do so, fisrt we get the location of Toronto, that's where the center of the map is going to be.

In [233]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [234]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(nbh_table['Latitude'], nbh_table['Longitude'], nbh_table['Borough'], nbh_table['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

To ease the segmentation, and because of limits of requests against Foursquare API, 
we'll process just neighbourhoods that include Toronto in it's name. Let's drop the others.

In [235]:
nbh_table = nbh_table[nbh_table["Neighbourhood"].str.contains("Toronto")].reset_index(drop=True)
nbh_table.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3K,North York,"CFB Toronto, Downsview East",43.7374732,-79.4647633
1,M4J,East York,East Toronto,43.685347,-79.3381065
2,M4R,Central Toronto,North Toronto West,43.7153834,-79.4056784
3,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.6408157,-79.3817523
4,M5K,Downtown Toronto,"Design Exchange, Toronto Dominion Centre",43.6471768,-79.3815764


And visualize the data once again.

In [236]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(nbh_table['Latitude'], nbh_table['Longitude'], nbh_table['Borough'], nbh_table['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

For analysis of these neighbourhoods we will use Foursquare API. Let's define two methods that get category types of a venue and all nearby venues according to a location. These two methods were provided by the Capstone tutors.

In [237]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

def getNearbyVenues(names, latitudes, longitudes, radius=500, limit=100):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

And call the method.

In [238]:
toronto_venues = getNearbyVenues(names=nbh_table['Neighbourhood'],
                                   latitudes=nbh_table['Latitude'],
                                   longitudes=nbh_table['Longitude']
                                  )

CFB Toronto, Downsview East


East Toronto


North Toronto West


Harbourfront East, Toronto Islands, Union Station


Design Exchange, Toronto Dominion Centre


Harbord, University of Toronto


Humber Bay Shores, Mimico South, New Toronto


Now we have a table with nearby venues. Let's create a table that represents how many venues of a certain category are in each neighbourhood.

In [239]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

And print five most common ones from each neighbourhood.

In [240]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----CFB Toronto, Downsview East----
                     venue  freq
0                  Airport  0.33
1                     Park  0.33
2        Electronics Store  0.33
3  New American Restaurant  0.00
4                    Plaza  0.00


----Design Exchange, Toronto Dominion Centre----
                venue  freq
0         Coffee Shop  0.12
1                Café  0.07
2               Hotel  0.07
3          Restaurant  0.05
4  Italian Restaurant  0.03


----East Toronto----
               venue  freq
0               Park  0.50
1  Convenience Store  0.25
2      Metro Station  0.25
3        Yoga Studio  0.00
4          Nightclub  0.00




----Harbord, University of Toronto----
                venue  freq
0                Café  0.11
1           Bookstore  0.09
2              Bakery  0.06
3  Italian Restaurant  0.06
4                 Bar  0.06


----Harbourfront East, Toronto Islands, Union Station----
                venue  freq
0         Coffee Shop  0.12
1               Hotel  0.05
2            Aquarium  0.05
3  Italian Restaurant  0.04
4                Café  0.04


----Humber Bay Shores, Mimico South, New Toronto----
                  venue  freq
0              Pharmacy  0.07
1           Flower Shop  0.07
2   Fried Chicken Joint  0.07
3  Fast Food Restaurant  0.07
4    Seafood Restaurant  0.07


----North Toronto West----
                 venue  freq
0       Clothing Store  0.15
1          Coffee Shop  0.10
2          Yoga Studio  0.05
3                  Spa  0.05
4  Rental Car Location  0.05




In [241]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Finally we put this information into pandas dataframe format.

In [250]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"CFB Toronto, Downsview East",Airport,Park,Electronics Store,Wine Bar,Fast Food Restaurant,Concert Hall,Convenience Store,Dance Studio,Deli / Bodega,Dessert Shop
1,"Design Exchange, Toronto Dominion Centre",Coffee Shop,Hotel,Café,Restaurant,Steakhouse,Italian Restaurant,Gastropub,Deli / Bodega,Seafood Restaurant,American Restaurant
2,East Toronto,Park,Convenience Store,Metro Station,Wine Bar,Fast Food Restaurant,Concert Hall,Dance Studio,Deli / Bodega,Dessert Shop,Diner
3,"Harbord, University of Toronto",Café,Bookstore,Japanese Restaurant,Restaurant,Bakery,Italian Restaurant,Bar,Beer Bar,Chinese Restaurant,Sandwich Place
4,"Harbourfront East, Toronto Islands, Union Station",Coffee Shop,Hotel,Aquarium,Italian Restaurant,Café,Brewery,Pizza Place,Fried Chicken Joint,Restaurant,Scenic Lookout


Now it remains to create clusters with k-means. We chose 3 clusters.

In [251]:
# set number of clusters
kclusters = 3

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 0, 1, 0, 0, 0, 0])

In [252]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = nbh_table

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

The last step is to create a map with colorized neighbourhoods according to the cluster they belong to.

In [253]:
# create map
from matplotlib import colors, cm

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    if np.isnan(cluster-1):
        num = 2
    else:
        num = int(cluster-1)
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[num],
        fill=True,
        fill_color=rainbow[num],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Now we can see neighbourhoods devided into three clusters. Toronto downtown together with tho another neighbourhoods were evaluated to be similar. Airport neighbourhood has its own cluster and East Toronto was evaluated to be different as well.