# Segmenting and Clustering Neighbourhoods in Toronto

## Introduction

In this notebook, I convert a collection of Toronto addresses (obtained via web parsing) into their equivalent latitude and longitude values using a geocoding API. Secondly, I use the Foursquare API to explore neighbourhoods in Toronto. Then I use the explore function to get the most common venue categories in each neighbourhood, and additionally use this feature to group the neighbourhoods into clusters. To compute these clusters, the k-means clustering algorithm will be used. Finally, I use the Folium library to visualize the neighbourhoods in Toronto and their emerging clusters.

### Part 1: Web Parsing Postal Codes from Wikipedia
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [1]:
import pandas as pd

In [2]:
postal_codes = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M", header=0)[0]
postal_codes.head() # Display first 5 rows

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Size of retrieved dataframe from the Wikipedia article:

In [3]:
postal_codes.shape

(287, 3)

We ignore entries with an unassigned borough, and replace any unassigned neighbourhood values with their respective borough

In [4]:
postal_codes = postal_codes[postal_codes['Borough'] != 'Not assigned'] 

def fill_unassigned_neighbourhoods(row): # Function to fill in any unassigned neighbourhood values with their borough
    if row['Neighbourhood'] == 'Not assigned':
        row['Neighbourhood'] = row['Borough']
    return row

postal_codes.apply(fill_unassigned_neighbourhoods, axis=1)
postal_codes.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Since more than one neighbourhood can exist in one postal code area, any of these relevant rows will be combined into one row with the neighbourhoods comma-separated as shown in the following table

In [5]:
postal_codes = postal_codes.groupby(['Postcode', 'Borough'])['Neighbourhood'].agg([('Neighbourhood', ', '.join)]).reset_index()
postal_codes.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Size of cleaned dataframe:

In [6]:
postal_codes.shape

(103, 3)

## Part 2: Adding Latitude and Longitude Coordinates