# Segmenting and Clustering Neighbourhoods in the city of Toronto, Canada

In this notebook, geographical data about neighbourhoods will be loaded and the neighbourhoods clustered to find similar neighbourhoods.

First import some python packages for loading and processing the data

In [1]:
from bs4 import BeautifulSoup # to read post code table
import numpy as np 
import pandas as pd
import requests
from urllib.request import urlopen # to allow opening of wikipedia page

### Load table data with beautiful soup

In [2]:
url = r'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' #post code info
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser') # read the page

### Process table data with beautiful soup and put it into a dataframe

In [3]:
table = soup.find('table', attrs={'class':'wikitable sortable'}) # table has class wikitable
table_data = table.find_all('tr') #gets all data from rows of the table including header
#print(table_data)
headings = [th.text.strip() for th in table.find_all('th')] # get the headings, this works keep this
#print(headings)
nbors = pd.DataFrame(columns=headings) # create empty dataframe with just columns
for tr in table.find_all('tr'): #iterate through the rows
    tds = tr.find_all('td') # find td tag which is the table data
    if not tds: # header doesn't have td tag so skip
        continue
    Postal_Code, Borough, Neighbourhood = [td.text.strip() for td in tds[:]] # extracts the data, tds[:] for all columns
    
    nbors=nbors.append(pd.DataFrame(data=np.array([[Postal_Code, Borough, Neighbourhood]]), columns = headings), ignore_index=True) #append row of data to original dataframe

In [4]:
print(nbors.shape)
nbors.head(10)

(180, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


After iterating through the table we now have a dataframe with 180 rows which replicates the table on the Wikipedia page. To clean the data we can change the Neighbourhood values to the name of the borough. From looking at Wikipedia there is no case where a borough is not assigned and a neighbourhood is assigned. The cell below shows the code to do this anyway.

In [5]:
def change_neighbourhood(x): #function to apply to the dataframe
    if x['Neighbourhood'] == 'Not assigned':
        x['Neighbourhood'] = x['Borough']
        
nbors_temp = nbors.copy() #make a copy of the original dataframe
nbors_temp.apply(change_neighbourhood, axis=1) #apply the function
print(nbors_temp.shape) # should be the same as before
nbors_temp.head(10)

(180, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


Neighbourhoods that were *Not assigned* are now named the same as the Borough. For this set of data there has been no changes. 

The cell below removes entries from the dataframe where a borough has not been assigned to a postal code.

In [6]:
nbors_temp.drop(nbors_temp[nbors_temp['Borough'] == 'Not assigned'].index, axis=0, inplace=True) # Drop entries where the borough is not assigned
nbors_temp.reset_index(drop=True) # reset the index of the dataframe
print(nbors_temp.shape)
nbors_temp.head(10)

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


Now nbors_temp is a dataframe with only populated rows and all values of *Not assigned* have been removed. Now look for duplicates of the **Postal Code**.

In [7]:
nbors_temp['Postal Code'].value_counts(sort=True)

M7A    1
M5A    1
M4L    1
M6B    1
M5B    1
      ..
M4J    1
M3C    1
M1B    1
M6H    1
M5X    1
Name: Postal Code, Length: 103, dtype: int64

We've found that all postal codes only occur once. It seems that someone has editted the Wikipedia table to only have one postal code and list neighbourhoods where relevant, thus the instructions on the Coursera submission instructions are no longer relevant. But for the sake of the assignment, we'll do it anyway.

Using the example of **M5A**, we'll make separate rows for **Harbourfront** and **Regent Park**

In [8]:
m5a = nbors_temp[nbors_temp['Postal Code'] == 'M5A']  # find occurence of postal code m5a
names = m5a['Neighbourhood'].values[0] # get list of neighbourhoods
print(names)
names = names.split(',')
print(names)
m5a = m5a.append(m5a)
m5a['Neighbourhood'] = names
m5a # Dataframe to insert into the main dataframe

Regent Park, Harbourfront
['Regent Park', ' Harbourfront']


Unnamed: 0,Postal Code,Borough,Neighbourhood
4,M5A,Downtown Toronto,Regent Park
4,M5A,Downtown Toronto,Harbourfront


Now have a list of neighbourhoods for M5A, insert them back into nbors_temp. First we have to drop the original entry from postal code m5a.

In [9]:
index = nbors_temp[nbors_temp['Postal Code'] == 'M5A'].index
index

Int64Index([4], dtype='int64')

In [10]:
nbors_temp.drop(index, axis=0, inplace=True)
nbors_temp= nbors_temp.reset_index(drop=True)
nbors_temp.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M6A,North York,"Lawrence Manor, Lawrence Heights"
3,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
4,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
5,M1B,Scarborough,"Malvern, Rouge"
6,M3B,North York,Don Mills
7,M4B,East York,"Parkview Hill, Woodbine Gardens"
8,M5B,Downtown Toronto,"Garden District, Ryerson"
9,M6B,North York,Glencairn


In [11]:
print(nbors_temp.shape)
nbors_temp=nbors_temp.append(m5a, ignore_index=True) # append dataframe with postal code m5a
print(nbors_temp.shape)
nbors_temp.sort_values('Postal Code', inplace=True)
nbors_temp=nbors_temp.reset_index(drop=True)
nbors_temp.head(15)

(102, 3)
(104, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Now have a dataframe with two entries for postal code M5a. Notince in the previous cell that the shape of the dataframe increased by 2 when appending the dataframe M5a, and overall the length is one greater than the original dataframe.

In [12]:
nbors_temp[nbors_temp['Postal Code'] == 'M5A']

Unnamed: 0,Postal Code,Borough,Neighbourhood
53,M5A,Downtown Toronto,Regent Park
54,M5A,Downtown Toronto,Harbourfront


Now the code that would replicate the process in combining postal codes, esentially the reverse of what we've just done.

In [13]:
#Check for duplicate values of Postal Code
nbors_temp['Postal Code'].value_counts()

M5A    2
M7A    1
M5G    1
M5J    1
M5S    1
      ..
M4T    1
M4J    1
M3C    1
M1B    1
M2J    1
Name: Postal Code, Length: 103, dtype: int64

Can see that there is a count of 2 for M5A, the rest have a count of 1 (as expected)

In [14]:
pc = nbors_temp['Postal Code'].value_counts() # store counts as a series
pc=pc[pc > 1] # only keep values where the count is greater than 1
pc

M5A    2
Name: Postal Code, dtype: int64

Now we can iterate though pc, getting the values for all the neighbourhoods and compiling them into a single string. In this example only postal code m5a will be effected.

In [15]:
for i in pc.iteritems(): # iterate though all the postcodes stored in pc
    i = ('M5A', 2) # values for i
    # now get all the neighbourhoods for a single postcode
    nb = nbors_temp[nbors_temp['Postal Code'] == i[0]]['Neighbourhood'].values
    #print(nb)
    nb = ', '.join(nb) # create new entry for neighbourhoods
    #print(nb)
    #now need to change value of neighbourhood to the new string nb
    # set all neighbourhood values to nb where the postcode matches i[0]
    nbors_temp['Neighbourhood'][nbors_temp['Postal Code'] == i[0]] = nb
    #print(nbors_temp[nbors_temp['Postal Code'] == i[0]]) 
#now remove duplicate rows
nbors_temp.drop_duplicates(inplace=True)
nbors_temp.sort_values('Postal Code', inplace = True)
nbors_temp.reset_index(drop=True, inplace =True)
print(nbors_temp.shape)
nbors_temp.head(10)

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


We have now gone through the procedure of combining neighbourhoods in a single data entry for the case of M5A. Note that the shape of the dataframe is now back to 103, 3 which is what it was before the process.

In [16]:
nbors_temp[nbors_temp['Postal Code'] == i[0]].Neighbourhood

53    Regent Park,  Harbourfront
Name: Neighbourhood, dtype: object

## Final answer for segmenting the postal codes

In [17]:
print('The size of the final dataframe is: ' , nbors_temp.shape)
nbors_temp.head(20)

The size of the final dataframe is:  (103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [18]:
# save to .csv
nbors_temp.to_csv('Neighbourhoods_of_Toronto.csv')