# Explore and Clustering the Neighborhoods in Toronto

This notebook explores and clusters the neighborhoods in Toronto

In [61]:
#Import necessary libraries
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from urllib.request import urlopen

We save our url and use the BeautifulSoup library to parse it

In [62]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' 
html = urlopen(url) 
soup = BeautifulSoup(html, 'html.parser')

The values we need are found between HTML class tags in the table. So specify that to find every value within the open and close class tags

In [63]:
my_table = soup.find('table',{'class':'wikitable sortable'})
my_table

<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3B
</td>
<td

Within the class tags are also the 'tr' and 'td' tags, so we find all of such tags as the actual values are embedded within these tags

In [64]:
links = my_table.findAll('tr')

In [65]:
links

[<tr>
 <th>Postal Code
 </th>
 <th>Borough
 </th>
 <th>Neighbourhood
 </th></tr>,
 <tr>
 <td>M1A
 </td>
 <td>Not assigned
 </td>
 <td>Not assigned
 </td></tr>,
 <tr>
 <td>M2A
 </td>
 <td>Not assigned
 </td>
 <td>Not assigned
 </td></tr>,
 <tr>
 <td>M3A
 </td>
 <td>North York
 </td>
 <td>Parkwoods
 </td></tr>,
 <tr>
 <td>M4A
 </td>
 <td>North York
 </td>
 <td>Victoria Village
 </td></tr>,
 <tr>
 <td>M5A
 </td>
 <td>Downtown Toronto
 </td>
 <td>Regent Park, Harbourfront
 </td></tr>,
 <tr>
 <td>M6A
 </td>
 <td>North York
 </td>
 <td>Lawrence Manor, Lawrence Heights
 </td></tr>,
 <tr>
 <td>M7A
 </td>
 <td>Downtown Toronto
 </td>
 <td>Queen's Park, Ontario Provincial Government
 </td></tr>,
 <tr>
 <td>M8A
 </td>
 <td>Not assigned
 </td>
 <td>Not assigned
 </td></tr>,
 <tr>
 <td>M9A
 </td>
 <td>Etobicoke
 </td>
 <td>Islington Avenue, Humber Valley Village
 </td></tr>,
 <tr>
 <td>M1B
 </td>
 <td>Scarborough
 </td>
 <td>Malvern, Rouge
 </td></tr>,
 <tr>
 <td>M2B
 </td>
 <td>Not assigned
 </td>

In [66]:
links2 = my_table.findAll('td')
links2

[<td>M1A
 </td>,
 <td>Not assigned
 </td>,
 <td>Not assigned
 </td>,
 <td>M2A
 </td>,
 <td>Not assigned
 </td>,
 <td>Not assigned
 </td>,
 <td>M3A
 </td>,
 <td>North York
 </td>,
 <td>Parkwoods
 </td>,
 <td>M4A
 </td>,
 <td>North York
 </td>,
 <td>Victoria Village
 </td>,
 <td>M5A
 </td>,
 <td>Downtown Toronto
 </td>,
 <td>Regent Park, Harbourfront
 </td>,
 <td>M6A
 </td>,
 <td>North York
 </td>,
 <td>Lawrence Manor, Lawrence Heights
 </td>,
 <td>M7A
 </td>,
 <td>Downtown Toronto
 </td>,
 <td>Queen's Park, Ontario Provincial Government
 </td>,
 <td>M8A
 </td>,
 <td>Not assigned
 </td>,
 <td>Not assigned
 </td>,
 <td>M9A
 </td>,
 <td>Etobicoke
 </td>,
 <td>Islington Avenue, Humber Valley Village
 </td>,
 <td>M1B
 </td>,
 <td>Scarborough
 </td>,
 <td>Malvern, Rouge
 </td>,
 <td>M2B
 </td>,
 <td>Not assigned
 </td>,
 <td>Not assigned
 </td>,
 <td>M3B
 </td>,
 <td>North York
 </td>,
 <td>Don Mills
 </td>,
 <td>M4B
 </td>,
 <td>East York
 </td>,
 <td>Parkview Hill, Woodbine Gardens
 </td>,


We remove every surrounding tags on our data values

In [67]:
postalCode = []
for link in links:
    postalCode.append(link.text.strip())

In [68]:
print(postalCode)

['Postal Code\n\nBorough\n\nNeighbourhood', 'M1A\n\nNot assigned\n\nNot assigned', 'M2A\n\nNot assigned\n\nNot assigned', 'M3A\n\nNorth York\n\nParkwoods', 'M4A\n\nNorth York\n\nVictoria Village', 'M5A\n\nDowntown Toronto\n\nRegent Park, Harbourfront', 'M6A\n\nNorth York\n\nLawrence Manor, Lawrence Heights', "M7A\n\nDowntown Toronto\n\nQueen's Park, Ontario Provincial Government", 'M8A\n\nNot assigned\n\nNot assigned', 'M9A\n\nEtobicoke\n\nIslington Avenue, Humber Valley Village', 'M1B\n\nScarborough\n\nMalvern, Rouge', 'M2B\n\nNot assigned\n\nNot assigned', 'M3B\n\nNorth York\n\nDon Mills', 'M4B\n\nEast York\n\nParkview Hill, Woodbine Gardens', 'M5B\n\nDowntown Toronto\n\nGarden District, Ryerson', 'M6B\n\nNorth York\n\nGlencairn', 'M7B\n\nNot assigned\n\nNot assigned', 'M8B\n\nNot assigned\n\nNot assigned', 'M9B\n\nEtobicoke\n\nWest Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale', 'M1C\n\nScarborough\n\nRouge Hill, Port Union, Highland Creek', 'M2C\n\nNot assigned\

In [69]:
#Put our data into a dataframe
table=pd.DataFrame(postalCode)
table

Unnamed: 0,0
0,Postal Code\n\nBorough\n\nNeighbourhood
1,M1A\n\nNot assigned\n\nNot assigned
2,M2A\n\nNot assigned\n\nNot assigned
3,M3A\n\nNorth York\n\nParkwoods
4,M4A\n\nNorth York\n\nVictoria Village
...,...
176,M5Z\n\nNot assigned\n\nNot assigned
177,M6Z\n\nNot assigned\n\nNot assigned
178,M7Z\n\nNot assigned\n\nNot assigned
179,"M8Z\n\nEtobicoke\n\nMimico NW, The Queensway W..."


We need to split our new column into PostalCode, Borough and Neighborhood columns

In [70]:
# new data frame with split value columns 
new = table[0].str.split("\n\n", n = 3, expand = True) 
  
# making separate PostalCode column from new data frame 
table["PostalCode"]= new[0] 
  
# making separate Borough column from new data frame 
table["Borough"]= new[1] 

# making separate Neighborhood column from new data frame 
table["Neighborhood"]= new[2] 
  
# Dropping old Name columns 
table.drop(columns =[0], inplace = True) 
  
# table display 
table

Unnamed: 0,PostalCode,Borough,Neighborhood
0,Postal Code,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
...,...,...,...
176,M5Z,Not assigned,Not assigned
177,M6Z,Not assigned,Not assigned
178,M7Z,Not assigned,Not assigned
179,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [71]:
#Remove the first row
table = table.iloc[1:]
table

Unnamed: 0,PostalCode,Borough,Neighborhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
176,M5Z,Not assigned,Not assigned
177,M6Z,Not assigned,Not assigned
178,M7Z,Not assigned,Not assigned
179,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [72]:
table.shape

(180, 3)

In [73]:
#Drop every Borough that is not assigned
table.drop(table[table['Borough'] == 'Not assigned'].index, inplace=True)
table.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [74]:
table.shape

(103, 3)

In [75]:
# Replace every Neighborhood that is not assigned with its Borough name
table.loc[(table['Neighborhood']==('Not available')) &(table['Borough'].notnull()) , 'Neighborhood'] = table['Borough']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


In [76]:
table

Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
161,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
166,M4Y,Downtown Toronto,Church and Wellesley
169,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
170,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [77]:
#Check check the shape
table.shape

(103, 3)