# Segmenting and Clustering Neighborhoods in Toronto

In this Jupyter notebook we will explore the neighberhoods in Toronto.
We first start by importing the necessary libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import folium
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
import pgeocode
from geopy.geocoders import Nominatim

In this section we will group the different functions used in this code.

In [2]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [3]:
CLIENT_ID = 'MB5ZSSZNHKCSHXYIQCYBKIRCCUAEB5EE2GRK5LVGZOE120US' # your Foursquare ID
CLIENT_SECRET = 'GLBR45QLOFUOGVVOJIQXO4ABYLM3NBPXAA3PNHDAZRNGS5FT' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

LIMIT = 100 

radius = 500 

We use the library BeautifulSoup in order to scrape the following Wikipedia page : https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M


In [4]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 

The result is an html, so we convert it into a dataframe. 
In neighberhoods where there is the same postal code area, they are in the same row seperated with a coma.

In [5]:
neighborhoods = pd.read_html(str(table))[0]
neighborhoods['Neighborhood'] = neighborhoods['Neighborhood'].str.replace('/',',')
neighborhoods = neighborhoods[neighborhoods['Borough'] != 'Not assigned']
neighborhoods.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"
5,M6A,North York,"Lawrence Manor , Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"


In order to check that there is no cell that has a borough but a Not assigned neighborhood, if it's the case then the neighborhood will be the same as the borough.

In [6]:
neighborhoods[neighborhoods['Neighborhood'] == 'Not assigned']

Unnamed: 0,Postal code,Borough,Neighborhood


In [7]:
neighborhoods.shape

(103, 3)

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In [8]:
nomi = pgeocode.Nominatim('ca')
postal_code_lat, postal_code_lon = [], [] 

for post in neighborhoods['Postal code']:
    postal_code_lat.append(nomi.query_postal_code(post).latitude)
    postal_code_lon.append(nomi.query_postal_code(post).longitude)

neighborhoods['Latitude'], neighborhoods['Longitude'] = postal_code_lat, postal_code_lon

neighborhoods.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.7545,-79.33
3,M4A,North York,Victoria Village,43.7276,-79.3148
4,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.6555,-79.3626
5,M6A,North York,"Lawrence Manor , Lawrence Heights",43.7223,-79.4504
6,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.6641,-79.3889


In [9]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighborhoods.


Now we will explore and cluster the neighborhoods in Toronto. We decided to work with only boroughs that contain the word Toronto and then replicate the same analysis we the remaining boroughs.

In [10]:
unique_borough = list(neighborhoods['Borough'].unique())
borough_toronto = [s for s in unique_borough if "Toronto" in s]
borough_toronto

['Downtown Toronto', 'East Toronto', 'West Toronto', 'Central Toronto']

To avoid repeating the same thing for each borough, we will use it as an input and we implement a function that segments and creates clusters for each input.

In [11]:
def map_clusters_borough(borough): 
    
    data = neighborhoods[neighborhoods['Borough'] == borough].reset_index(drop=True)
    address = borough

    geolocator = Nominatim(user_agent="tr_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    print('The geograpical coordinate of '+str(borough) + ' are {}, {}.'.format(latitude, longitude))
    # create map of toronto using latitude and longitude values
    map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

    # add markers to map
    for lat, lng, label in zip(data['Latitude'], data['Longitude'], data['Neighborhood']):
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_toronto)  


    venues = getNearbyVenues(names=data['Neighborhood'],
                                       latitudes=data['Latitude'],
                                       longitudes=data['Longitude']
                                      )

    # one hot encoding
    toronto_onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")

    # add neighborhood column back to dataframe
    toronto_onehot['Neighborhood'] = venues['Neighborhood'] 

    # move neighborhood column to the first column
    fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
    toronto_onehot = toronto_onehot[fixed_columns]

    toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
    num_top_venues = 5

    for hood in toronto_grouped['Neighborhood']:
        print("----"+hood+"----")
        temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
        temp.columns = ['venue','freq']
        temp = temp.iloc[1:]
        temp['freq'] = temp['freq'].astype(float)
        temp = temp.round({'freq': 2})
        print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
        print('\n')



    num_top_venues = 10

    indicators = ['st', 'nd', 'rd']

    # create columns according to number of top venues
    columns = ['Neighborhood']
    for ind in np.arange(num_top_venues):
        try:
            columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        except:
            columns.append('{}th Most Common Venue'.format(ind+1))

    # create a new dataframe
    neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
    neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

    for ind in np.arange(toronto_grouped.shape[0]):
        neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)


    # set number of clusters
    kclusters = 5

    toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

    # run k-means clustering
    kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)


    # add clustering labels
    neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

    toronto_merged = data

    # merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
    toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

    # create map
    map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

    # set color scheme for the clusters
    x = np.arange(kclusters)
    ys = [i + x + (i*x)**2 for i in range(kclusters)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    # add markers to the map
    markers_colors = []
    for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7).add_to(map_clusters)

    return toronto_merged, map_clusters


In [12]:
borough_toronto

['Downtown Toronto', 'East Toronto', 'West Toronto', 'Central Toronto']

You can now browse through the borough_toronto list and store the information in a list.

In [13]:
all_df_borough, all_map_borough = [], []

for i in borough_toronto:
    df, map_city = map_clusters_borough(i)
    all_df_borough.append(df)
    all_map_borough.append(map_city)

The geograpical coordinate of Downtown Toronto are 43.6541737, -79.38081164513409.
----Berczy Park----
         venue  freq
0  Coffee Shop  0.09
1         Café  0.05
2        Hotel  0.05
3       Bakery  0.04
4   Restaurant  0.03


----CN Tower , King and Spadina , Railway Lands , Harbourfront West , Bathurst Quay , South Niagara , Island airport----
                venue  freq
0         Coffee Shop  0.09
1          Restaurant  0.07
2  Italian Restaurant  0.05
3                Café  0.05
4                 Bar  0.05


----Central Bay Street----
                       venue  freq
0                Coffee Shop  0.25
1         Italian Restaurant  0.04
2            Bubble Tea Shop  0.04
3  Middle Eastern Restaurant  0.04
4             Breakfast Spot  0.04


----Christie----
           venue  freq
0           Café  0.25
1  Grocery Store  0.25
2     Playground  0.08
3    Candy Store  0.08
4           Park  0.08


----Church and Wellesley----
                 venue  freq
0     Sushi Restaurant  

The results of the study can be displayed for each borough 

In [14]:
all_df_borough[0].head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.6555,-79.3626,3,Coffee Shop,Breakfast Spot,Yoga Studio,Distribution Center,Italian Restaurant,Beer Store,Food Truck,Spa,Playground,Bakery
1,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.6641,-79.3889,0,Sushi Restaurant,Gym,Italian Restaurant,Beer Bar,Café,Ramen Restaurant,Burrito Place,Coffee Shop,Mexican Restaurant,Bubble Tea Shop
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783,0,Coffee Shop,Clothing Store,Middle Eastern Restaurant,Café,Cosmetics Shop,Japanese Restaurant,Restaurant,Bar,Lingerie Store,Italian Restaurant
3,M5C,Downtown Toronto,St. James Town,43.6513,-79.3756,0,Coffee Shop,Café,Seafood Restaurant,Italian Restaurant,Cocktail Bar,American Restaurant,Gastropub,Restaurant,Creperie,Farmers Market
4,M5E,Downtown Toronto,Berczy Park,43.6456,-79.3754,0,Coffee Shop,Hotel,Café,Bakery,Cocktail Bar,Japanese Restaurant,Restaurant,Seafood Restaurant,Beer Bar,Italian Restaurant


In [15]:
all_map_borough[0]