# Segmenting and Clustering Neighborhoods in Toronto

In this assignment, we will be scraping postal code data from Wikipedia for the city of Toronto. In part 1, we will then wrangle and clean the data in order to get information about venues in the city's neighborhoods using the __Foursquare API__. Then, in part 2, we will use a _k_-means algorithm to cluster the neighborhoods by most popular venues. Finally, in part 3, we will visualize the clusters on a map of Toronto.

## Part 0 - Load Necessary Libraries

In [1]:
import pandas as pd
import numpy as np

import pgeocode  # get latitude and longitude from a postal code

import requests  # handle requests
from bs4 import BeautifulSoup  # web scraper

from pandas import json_normalize  # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

import folium  # map rendering library

## Part 1 - Web Scraping

Scrape the Wikipedia page using an HTML parser.

In [2]:
url = 'https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=862527922'

req = requests.get(url)

soup = BeautifulSoup(req.content, 'html.parser')

Find the table in the HTML, and extract the data in each column. Then create a DataFrame to store all the data.

In [3]:
soup_table = soup.find('table')

# function to get the text from a list of HMTL tags
def get_text_map(lst):
    lst = list(map(lambda s: s.get_text(strip=True), lst))
    
    return lst

# get all the column data
data = soup_table.find_all('td')
data = get_text_map(data)

# get the data for each column
p_codes = data[::3]
bghs = data[1::3]
nbhs = data[2::3]

toronto_df = pd.DataFrame(zip(p_codes, bghs, nbhs), columns=['PostalCode', 'Borough', 'Neighborhood'])
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Filter out all the rows that have value of 'Not assigned' in the __Borough__ column. Also, if a neighborhood has a value of 'Not assigned', replace it with its borough.

In [4]:
# drop rows with no boroughs
toronto_df = toronto_df[toronto_df.Borough != 'Not assigned']

# confirm that there are no boroughs with the value 'Not assigned'
print(toronto_df['Borough'].value_counts())

Etobicoke           45
Scarborough         38
North York          38
Downtown Toronto    37
Central Toronto     17
West Toronto        13
York                 9
East Toronto         7
East York            6
Mississauga          1
Queen's Park         1
Name: Borough, dtype: int64


In [5]:
# replace 'Not assigned' neighborhoods with their boroughs
num_na_before = toronto_df[toronto_df['Neighborhood'] == 'Not assigned'].shape[0]

toronto_df['Neighborhood'] = np.where(toronto_df['Neighborhood'] == 'Not assigned', 
                                      toronto_df['Borough'], 
                                      toronto_df['Neighborhood'])
toronto_df.reset_index(inplace=True, drop=True)

num_na_after = toronto_df[toronto_df['Neighborhood'] == toronto_df['Borough']].shape[0]

# confirm that there are no neighborhoods with the value 'Not assigned'
# and that the count before and after is the same
print("Number of not assigned neighborhoods before:", num_na_before)
print("Number of not assigned neighborhoods after:", num_na_after)
print('\n')
print(toronto_df['Neighborhood'].value_counts())

Number of not assigned neighborhoods before: 1
Number of not assigned neighborhoods after: 1


St. James Town             2
Runnymede                  2
CFB Toronto                1
Fairview                   1
Scarborough Town Centre    1
                          ..
Central Bay Street         1
Golden Mile                1
Islington                  1
Port Union                 1
Queen's Park               1
Name: Neighborhood, Length: 210, dtype: int64


For repeated boroughs, append all neighborhoods and separate them with a comma.

In [6]:
toronto_df = toronto_df.groupby(['PostalCode', 'Borough']).apply(lambda df: ", ".join(df['Neighborhood'])).reset_index(name='Neighborhood')
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Display the DataFrame as given by the assignment instructions.

In [7]:
# this is for the purpose of the assignment only
p_codes_lst = ['M5G', 'M2H', 'M4B', 'M1J', 'M4G', 'M4M', 'M1R', 'M9V', 'M9L', 'M5V', 'M1B', 'M5A']
toronto_df[toronto_df['PostalCode'].isin(p_codes_lst)].set_index('PostalCode').loc[p_codes_lst].reset_index()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Woodbine Gardens, Parkview Hill"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Maryvale, Wexford"
7,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo..."


In [8]:
print("Number of rows in the full table:", toronto_df.shape[0])

Number of rows in the full table: 103


## Part 2 - Getting the Coordinates of Each Neighborhood
In this part we will be using the __Foursquare API__ along with the __pgeocode__ package to get the latitude and longitude values for each postal code.

In [9]:
# get the latitude and longitude of all the postal codes
country = 'CA'  # Canada
nomi = pgeocode.Nominatim(country)
lat_lng_df = nomi.query_postal_code(toronto_df.PostalCode.values)[['postal_code', 'latitude', 'longitude']]
lat_lng_df.drop_duplicates(inplace=True)
lat_lng_df.columns = ['PostalCode', 'Latitude', 'Longitude']

# join the two tables to append the latitude and longitude to our main table
toronto_df = toronto_df.join(lat_lng_df.set_index('PostalCode'), on='PostalCode')
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.8113,-79.193
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.7878,-79.1564
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.7678,-79.1866
3,M1G,Scarborough,Woburn,43.7712,-79.2144
4,M1H,Scarborough,Cedarbrae,43.7686,-79.2389


Display the DataFrame as given by the instructions of the assignment.

In [10]:
toronto_df[toronto_df['PostalCode'].isin(p_codes_lst)].set_index('PostalCode').loc[p_codes_lst].reset_index()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5G,Downtown Toronto,Central Bay Street,43.6564,-79.386
1,M2H,North York,Hillcrest Village,43.8015,-79.3577
2,M4B,East York,"Woodbine Gardens, Parkview Hill",43.7063,-79.3094
3,M1J,Scarborough,Scarborough Village,43.7464,-79.2323
4,M4G,East York,Leaside,43.7124,-79.3644
5,M4M,East Toronto,Studio District,43.6561,-79.3406
6,M1R,Scarborough,"Maryvale, Wexford",43.7507,-79.3003
7,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.7432,-79.5876
8,M9L,North York,Humber Summit,43.7598,-79.5565
9,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo...",43.6404,-79.3995
