# Segmenting and Clustering Neighborhoods in Toronto

## Introduction

In this notebook we replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto, so we will first scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

## Imports

In [1]:
import pandas as pd
import numpy as np

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import folium # map rendering library

import requests # library to handle requests

# import k-means from clustering stage
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

## Functions
Here we define the custom functions that will be used in the script.

### getNearbyVenues
A function to fetch from Foursquares the venues nearby a series of places:

In [2]:
def getNearbyVenues(CLIENT_ID, CLIENT_SECRET, VERSION, names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

## Part 1: Scrape the Wikipedia page with the table of Toronto neighborhoods

### Scrape the wikipedia page
The information about the Toronto neighbhoods is containied in a table in the wikipedia page [https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M).

We will proceed in several steps by first scaping the whole Wikipedia table into a pandas dataframe and then cleaning it:

1. Scrape the raw Wikipedia table

In [3]:
neighborhoods_raw = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

2. Set the column names to: PostalCode, Borough, and Neighborhood

In [4]:
neighborhoods = neighborhoods_raw
neighborhoods.columns = ['PostalCode', 'Borough', 'Neighbourhood']

3. Select only the rows that have an assigned borough. Ignore cells with a *borough* that is 'Not assigned'.

In [5]:
neighborhoods = neighborhoods[neighborhoods['Borough'] != 'Not assigned'].reset_index(drop=True)

4. Combine the neighborhoods with the same postal code in a single row with the neighborhhods separated with a comma.  
We assume that different *borhhods* have different postal codes.

In [6]:
neighborhoods = neighborhoods.groupby(
    ['PostalCode','Borough'], sort=False
)['Neighbourhood'].apply(lambda x: ', '.join(x)).reset_index()

5. If now a row has a Not assigned neighborhood, then the neighborhood name will be the same as the borough.



In [7]:
neighborhoods['Neighbourhood'][neighborhoods['Neighbourhood'] == 'Not assigned'] = neighborhoods['Borough'][neighborhoods['Neighbourhood'] == 'Not assigned']

6. Check with the image provided in the course, under *My submission* instructions

In [8]:
pCodesToShow = ['M5G','M2H','M4B','M1J','M4G','M4M','M1R','M9V','M9L','M5V','M1B','M5A']
neighborhoods[neighborhoods['PostalCode'].isin(pCodesToShow)]

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M5A,Downtown Toronto,Harbourfront
6,M1B,Scarborough,"Rouge, Malvern"
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
23,M4G,East York,Leaside
24,M5G,Downtown Toronto,Central Bay Street
27,M2H,North York,Hillcrest Village
32,M1J,Scarborough,Scarborough Village
50,M9L,North York,Humber Summit
54,M4M,East Toronto,Studio District
71,M1R,Scarborough,"Maryvale, Wexford"


**M5A** row is different from that in *My Submission*, but the result here is correct:

In [9]:
neighborhoods_raw[neighborhoods_raw['PostalCode']=='M5A']

Unnamed: 0,PostalCode,Borough,Neighbourhood
4,M5A,Downtown Toronto,Harbourfront


**M5V** and **M9V** cennot be compared because not all the neighborhoods are shown.

The other rows are the same are and in *My Submission* instructions.

7. Print the number of rows in the dataframe

In [10]:
neighborhoods.shape[0]

103

## Part 2: get the latitude and longitude of each Toronto postal code.

In the instructions there are two options. We will get the data from the provided CSV file, at [https://cocl.us/Geospatial_data](https://cocl.us/Geospatial_data).

In [11]:
lat_lon = pd.read_csv('https://cocl.us/Geospatial_data') #, index_col=0
lat_lon.columns = ['PostalCode', 'Latitude', 'Longitude']

neighborhoods = pd.merge( neighborhoods, lat_lon, on='PostalCode')

The order of the **neighborhood** columns is not the same as that of the image shown in the *My Submission* instruction. However we can look at the same 12 rows (in a different order). The latitude and longitude values are the same as in the instructions.

In [12]:
neighborhoods[neighborhoods['PostalCode'].isin(pCodesToShow)]

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
23,M4G,East York,Leaside,43.70906,-79.363452
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
27,M2H,North York,Hillcrest Village,43.803762,-79.363452
32,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
50,M9L,North York,Humber Summit,43.756303,-79.565963
54,M4M,East Toronto,Studio District,43.659526,-79.340923
71,M1R,Scarborough,"Maryvale, Wexford",43.750072,-79.295849


## Part 3: explore and cluster the neighborhoods in Toronto

1. As suggested in the instrunctions will **consider only the borhoods that contain the word 'Toronto'**.

Notice: we call the datframe **manhattan_data** so we can copy exactly the code in the lab.

In [13]:
manhattan_data = neighborhoods[neighborhoods['Borough'].str.contains("Toronto")]
manhattan_data.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
5,M9A,Downtown Toronto,Queen's Park,43.667856,-79.532242
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031


2. Get the cohordinates of Toronto

In [14]:
address = 'Toronto, TO, CA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


3. Visualize the neighborhoods of Toronto on the map

In [15]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

4. For each neighbourhood of Toronto, get the closest venues from the Foursquare API.

Use a radius of 1500 instead of the default 500 because with radius 500 some Neighborhoods have only a few Venues closeby.

In [16]:
credentials = pd.read_csv('Foursquare_Credentials.csv')

CLIENT_ID = credentials.at[0, 'CLIENT_ID']
CLIENT_SECRET = credentials.at[0,'CLIENT_SECRET']
VERSION = '20180605' # Foursquare API version

In [17]:
manhattan_venues = getNearbyVenues(
    CLIENT_ID=CLIENT_ID,
    CLIENT_SECRET=CLIENT_SECRET,
    VERSION=VERSION,    
    names=manhattan_data['Neighbourhood'],
    latitudes=manhattan_data['Latitude'],
    longitudes=manhattan_data['Longitude'],
    radius=1500
)

Harbourfront
Queen's Park
Ryerson, Garden District
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Adelaide, King, Richmond
Dovercourt Village, Dufferin
Harbourfront East, Toronto Islands, Union Station
Little Portugal, Trinity
The Danforth West, Riverdale
Design Exchange, Toronto Dominion Centre
Brockton, Exhibition Place, Parkdale Village
The Beaches West, India Bazaar
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North, Forest Hill West
High Park, The Junction South
North Toronto West
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
Harbord, University of Toronto
Runnymede, Swansea
Moore Park, Summerhill East
Chinatown, Grange Park, Kensington Market
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Rosedale
Stn A PO Boxes 25 The Esplanade
Cabbagetown, St. James Town
Fir

In [18]:
manhattan_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Berczy Park,100,100,100,100,100,100
"Brockton, Exhibition Place, Parkdale Village",100,100,100,100,100,100
Business Reply Mail Processing Centre 969 Eastern,100,100,100,100,100,100
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",69,69,69,69,69,69
"Cabbagetown, St. James Town",100,100,100,100,100,100
Central Bay Street,100,100,100,100,100,100
"Chinatown, Grange Park, Kensington Market",100,100,100,100,100,100
Christie,100,100,100,100,100,100
Church and Wellesley,100,100,100,100,100,100


In [19]:
print('There are {} uniques categories.'.format(len(manhattan_venues['Venue Category'].unique())))

There are 278 uniques categories.


5. Get the distribution of the *Venue Categories* around each neighborhood.

This is done by first one hot encoding the *Venue Category* column and than grouping by *Neighborhood* and computing the *mean* of eachh group. The resulting dataframe will have one row for each neighborhood, and one column ofr each venue category. Each row of the dataframe will add up to 1, meaning that it represent a distribution.

**Notice**: there is a *Venue Category* called *Neighborhood*! This messes up the things completely when adding the Neighborhood name to the one hot encoded *Venue Category* column. Pandas *merge* method allows to manage this by providing a *suffixes* argument to append to the names of columns with the same name. We will just change the *Venue Category* '*Neighborhood*' to '*Neighborhood Category*'.

In [20]:
# one hot encoding
manhattan_onehot = manhattan_venues[['Neighborhood']].merge(
    pd.get_dummies(manhattan_venues['Venue Category']),
    left_index=True, right_index=True,
    suffixes=('', ' Category')
)

# grouping
manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()

6. Get the 10 top venue categories for each neighborhood.

In [21]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Hotel,Japanese Restaurant,Theater,Gastropub,Pizza Place,Monument / Landmark,Movie Theater,Concert Hall
1,Berczy Park,Coffee Shop,Café,Hotel,Japanese Restaurant,Italian Restaurant,Gastropub,Restaurant,Beer Bar,Park,Farmers Market
2,"Brockton, Exhibition Place, Parkdale Village",Café,Coffee Shop,Restaurant,Bar,Furniture / Home Store,Bakery,Soccer Stadium,Theme Park,Theater,Tea Room
3,Business Reply Mail Processing Centre 969 Eastern,Coffee Shop,Indian Restaurant,Café,Brewery,Italian Restaurant,Park,Pizza Place,Sushi Restaurant,Bakery,Beach
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Park,Coffee Shop,Café,Harbor / Marina,Gym,Track,Pizza Place,Restaurant,Scenic Lookout,Brewery


7. Clustering with KMeans and add the cluster label to the data frame

In [22]:
# set number of clusters
kclusters = 5

manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

manhattan_merged = manhattan_data
# Change 'Neighbourhood' to 'Neighborhood'
manhattan_merged.columns = ['PostalCode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude']

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
manhattan_merged = manhattan_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

manhattan_merged.head() # check the 'Cluster Labels' columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636,2,Coffee Shop,Italian Restaurant,Bakery,Café,Park,Farmers Market,Theater,Thai Restaurant,Bar,Pub
5,M9A,Downtown Toronto,Queen's Park,43.667856,-79.532242,4,Shopping Mall,Bank,Pharmacy,Liquor Store,Bakery,Golf Course,Japanese Restaurant,Supermarket,Grocery Store,Café
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937,1,Coffee Shop,Gastropub,Japanese Restaurant,Café,Gym,Italian Restaurant,Cosmetics Shop,Theater,Ramen Restaurant,Hotel
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Coffee Shop,Café,Hotel,Italian Restaurant,Pizza Place,Steakhouse,Beer Bar,Cosmetics Shop,Japanese Restaurant,Seafood Restaurant
19,M4E,East Toronto,The Beaches,43.676357,-79.293031,2,Coffee Shop,Pub,Breakfast Spot,Japanese Restaurant,Grocery Store,Beach,Bar,Bakery,BBQ Joint,Sandwich Place


9. Visualize the clustering

In [23]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters