<H1>Notebook for Segmenting and clustering neighborhoods in Toronto</H1>

<H2>Part 1</H2>

<H3>Notes and assumptions</H3>
Note 1: The Wikipedia page is use to create the dataframe<br>
Note 2: Scrapping of the page is done using BeautifulSoup<br>
<br>
Assumption 1: The table cells with content have a borough and a neighborhood<br>
Assumption 2: The first data on the cell is the postal code<br>
Assumption 3: The first hyperlink in the cell is the borough name<br>
Assumption 4: The subsequent hypenlinks in the cell are neighborhoods<br>

In [1]:
# Libraries to be used
from bs4 import BeautifulSoup
import pandas as pd
import requests

In [2]:
#Cell for creating the dataframe of Toronto neighborhood

#Use beautifulSoup to scrap the web page
webPage = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(webPage.text)
postalCodeTable = soup.tbody
postalCode = ''
borough = ''
neighborhood = ''
torontoNeig = []
cont=0
#The following loop reviews only the information within the table
for child in postalCodeTable.descendants:
#The b element contains the postal code
    if child.name == 'b':
        postalCode = child.string
#The span element contains the information within the cell
    if child.name == 'span':
#Review if the cell contains an hyperlink
        if child.next_element.name == 'a':
            cont=0
            borough=''
            neighborhood=''
#As the cell contains an hyperlink a loop is done to review inner elements
            for neigs in child.descendants:
                if neigs.name == 'a':
#Assumption is made that the first element of the cell is the name of borough
                    if cont == 0:
                        borough = neigs.string
                        cont = cont+1
#The following elements of the cell are neighborhoods
                    else:
                        neighborhood = neighborhood+neigs.string+'/'
            torontoNeig.append([postalCode,borough,neighborhood])
        else:
            torontoNeig.append([postalCode,'',''])

#Loop for cleaning matrix (remove postal codes without boroughs) read from the Wikipedia Table
to_df=[]
i=0
for i in range(len(torontoNeig)):
    if torontoNeig[i][1] != '':
        to_df.append([torontoNeig[i][0],torontoNeig[i][1],torontoNeig[i][2]])
        i=i+1
    else:
        i=i+1

df_torontoNeigs = pd.DataFrame(to_df,columns=['PostalCode','Borough','Neighborhood'])
print(df_torontoNeigs.shape)
df_torontoNeigs.head()


(83, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods/
1,M4A,North York,Victoria Village/
2,M5A,Downtown Toronto,Regent Park/Harbourfront/
3,M6A,North York,Lawrence Manor/Lawrence Heights/
4,M7A,Queen's Park,Ontario Provincial Government/


<H2>Part 2</H2>
Adding latitud and longitude to dataframe

In [3]:
#Pass csv into new dataframe
postCodeCoor = pd.read_csv("http://cocl.us/Geospatial_data")
print("CSV loaded")

CSV loaded


In [4]:
#Correct column name in postCodeCoor to allow merge
postCodeCoor = postCodeCoor.rename(columns={'Postal Code':'PostalCode'})
#Merge of dataframes
df_torontoNeigFull = pd.merge(df_torontoNeigs,postCodeCoor,on='PostalCode')
print(df_torontoNeigFull.shape)
df_torontoNeigFull.head()

(83, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods/,43.753259,-79.329656
1,M4A,North York,Victoria Village/,43.725882,-79.315572
2,M5A,Downtown Toronto,Regent Park/Harbourfront/,43.65426,-79.360636
3,M6A,North York,Lawrence Manor/Lawrence Heights/,43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government/,43.662301,-79.389494


<H2>Part three</H2>
Explore and cluster neighborhoods

In [5]:
#New libraries to add
import numpy as np 
import json5 as json # library to handle JSON files
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium
print('Libraries imported.')

Libraries imported.


In [6]:
#Delete rows without neighborhoods
indexDel = df_torontoNeigFull[df_torontoNeigFull['Neighborhood'] == ''].index
df_torontoNeigFull.drop(indexDel,inplace=True)

#Filter neighborhoods from borughs cointaining Toronto
toronto_data = df_torontoNeigFull[df_torontoNeigFull['Borough'].str.contains('Toronto')]
print(toronto_data.shape)
toronto_data.head()

(15, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,Regent Park/Harbourfront/,43.65426,-79.360636
9,M5B,Downtown Toronto,Garden District/Ryerson/,43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town/,43.651494,-79.375418
23,M5G,Downtown Toronto,Bay Street/,43.657952,-79.387383
29,M5H,Downtown Toronto,Richmond/King/,43.650571,-79.384568


The following cell extract the venues information from Foursquare of the given postal codes coordinates assigned as neighborhoods

In [7]:
# The code was removed by Watson Studio for sharing.

Variables for Foursquare calls and Toronto coordinates set


In [8]:
#This cell is used to get all the nearby venues using Foursquare API

#Reused procedure from lab Neighborhoods New York adapted to this Notebook
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print("Data will be acquaride for neighborhood "+name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )
print('')
print('Toronto venues adquire')

Data will be acquaride for neighborhood Regent Park/Harbourfront/
Data will be acquaride for neighborhood Garden District/Ryerson/
Data will be acquaride for neighborhood St. James Town/
Data will be acquaride for neighborhood Bay Street/
Data will be acquaride for neighborhood Richmond/King/
Data will be acquaride for neighborhood Harbourfront/Union Station/Toronto Islands/
Data will be acquaride for neighborhood Toronto Dominion Centre/Design Exchange/
Data will be acquaride for neighborhood Commerce Court/Victoria Hotel/
Data will be acquaride for neighborhood University of Toronto/
Data will be acquaride for neighborhood Kensington Market/Chinatown/Grange Park/
Data will be acquaride for neighborhood CN Tower/King and Spadina/Railway Lands/Harbourfront/South Niagara/Island airport/
Data will be acquaride for neighborhood Rosedale/
Data will be acquaride for neighborhood St. James Town/Cabbagetown/
Data will be acquaride for neighborhood First Canadian Place/Underground city/
Data w

The following cell are used for analize the nieghborhoods

In [9]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 
# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]
#Group rows by neighborhoods
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

#Function to sort venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']
for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bay Street/,Coffee Shop,Italian Restaurant,Burger Joint,Japanese Restaurant,Sandwich Place,Ice Cream Shop,Bubble Tea Shop,Thai Restaurant,Restaurant,Juice Bar
1,CN Tower/King and Spadina/Railway Lands/Harbou...,Airport Lounge,Airport Service,Airport Terminal,Boat or Ferry,Harbor / Marina,Boutique,Sculpture Garden,Rental Car Location,Coffee Shop,Plane
2,Church and Wellesley/,Coffee Shop,Japanese Restaurant,Gay Bar,Restaurant,Sushi Restaurant,Pub,Gastropub,Hotel,Café,Bubble Tea Shop
3,Commerce Court/Victoria Hotel/,Coffee Shop,Restaurant,Café,Hotel,Gym,American Restaurant,Deli / Bodega,Gastropub,Seafood Restaurant,Japanese Restaurant
4,First Canadian Place/Underground city/,Coffee Shop,Café,Restaurant,American Restaurant,Seafood Restaurant,Gastropub,Asian Restaurant,Steakhouse,Gym,Hotel


The following cell is used to cluster the neighborhoods and display the correspondent map of clusters

In [10]:
# set number of clusters
kclusters = 5
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_merged = toronto_data
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,M5A,Downtown Toronto,Regent Park/Harbourfront/,43.65426,-79.360636,4,Coffee Shop,Park,Pub,Bakery,Mexican Restaurant,Café,Theater,Breakfast Spot,Restaurant,Chocolate Shop
9,M5B,Downtown Toronto,Garden District/Ryerson/,43.657162,-79.378937,0,Coffee Shop,Clothing Store,Japanese Restaurant,Café,Middle Eastern Restaurant,Cosmetics Shop,Electronics Store,Diner,Theater,Bookstore
15,M5C,Downtown Toronto,St. James Town/,43.651494,-79.375418,0,Coffee Shop,Restaurant,Café,Hotel,Italian Restaurant,Breakfast Spot,Clothing Store,Cosmetics Shop,Beer Bar,Bakery
23,M5G,Downtown Toronto,Bay Street/,43.657952,-79.387383,0,Coffee Shop,Italian Restaurant,Burger Joint,Japanese Restaurant,Sandwich Place,Ice Cream Shop,Bubble Tea Shop,Thai Restaurant,Restaurant,Juice Bar
29,M5H,Downtown Toronto,Richmond/King/,43.650571,-79.384568,0,Coffee Shop,Restaurant,Thai Restaurant,Café,Bar,Steakhouse,Sushi Restaurant,Lounge,Hotel,Gastropub


In [11]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=13)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters