### Week 3

# Clustering Neighborhoods in Toronto

## Table of Contents

1.  <a href="#item1">Part 1</a>    
2.  <a href="#item2">Part 2</a>
2.  <a href="#item3">Part 3</a>


<h3 id="item1">Part 1</h3>

In order to obtain data of postal codes of neighborhoods in Toronto we will scrape the following Wikipedia page: 
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

I decided to use the BeautifulSoup package for scraping

In [1]:
#install the bs4 library
!pip install bs4

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1272 sha256=5cc60d76b0cb0f890ca238cf359bb944096e42e623d71b024eb04264f2cfe177
  Stored in directory: /tmp/wsuser/.cache/pip/wheels/0a/9e/ba/20e5bbc1afef3a491f0b3bb74d508f99403aabe76eda2167ca
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1


In [2]:
#import packages
from bs4 import BeautifulSoup
import requests
import urllib.request

In [3]:
#create a variable for the Wikipedia link
wiki_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [4]:
#create a session and make a get request
s = requests.Session()
response = s.get(wiki_url, timeout = 10)
print(response)
print("Success!")

<Response [200]>
Success!


In [5]:
#use the BeautifulSoup function
page = urllib.request.urlopen(wiki_url)
soup = BeautifulSoup(page, 'html')

In [6]:
#get the page title
soup.title.string

'List of postal codes of Canada: M - Wikipedia'

In [9]:
#retrieve the data from a table, <table> tag
data_table = soup.find('table', {"class":'wikitable sortable'})
data_table

<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3B
</td>
<td

In [10]:
# find the columns names of the table
header = [th.text.rstrip() for th in data_table.findAll('th')]
header

['Postal Code', 'Borough', 'Neighbourhood']

In [11]:
# find the cells of the table
cell = [c.text.rstrip() for c in data_table.findAll('td')]
cell

['M1A',
 'Not assigned',
 'Not assigned',
 'M2A',
 'Not assigned',
 'Not assigned',
 'M3A',
 'North York',
 'Parkwoods',
 'M4A',
 'North York',
 'Victoria Village',
 'M5A',
 'Downtown Toronto',
 'Regent Park, Harbourfront',
 'M6A',
 'North York',
 'Lawrence Manor, Lawrence Heights',
 'M7A',
 'Downtown Toronto',
 "Queen's Park, Ontario Provincial Government",
 'M8A',
 'Not assigned',
 'Not assigned',
 'M9A',
 'Etobicoke',
 'Islington Avenue, Humber Valley Village',
 'M1B',
 'Scarborough',
 'Malvern, Rouge',
 'M2B',
 'Not assigned',
 'Not assigned',
 'M3B',
 'North York',
 'Don Mills',
 'M4B',
 'East York',
 'Parkview Hill, Woodbine Gardens',
 'M5B',
 'Downtown Toronto',
 'Garden District, Ryerson',
 'M6B',
 'North York',
 'Glencairn',
 'M7B',
 'Not assigned',
 'Not assigned',
 'M8B',
 'Not assigned',
 'Not assigned',
 'M9B',
 'Etobicoke',
 'West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale',
 'M1C',
 'Scarborough',
 'Rouge Hill, Port Union, Highland Creek',
 'M2C',


In [12]:
#create 3 lists for columns data
c0 = []
c1 = []
c2 = []
for row in data_table.findAll('tr'):
    cells = row.findAll('td')
    if len(cells)==3:
        c0.append(cells[0].find(text = True).rstrip())
        c1.append(cells[1].find(text = True).rstrip())
        c2.append(cells[2].find(text = True).rstrip())   

In [13]:
#time to create a DataFrame, lets import pandas ad numpy
import pandas as pd
import numpy as np

In [15]:
#DataFrame
df = pd.DataFrame(c0, columns = ["PostalCode"])
df["Borough"] = c1
df["Neighbourhood"] = c2
df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [16]:
#drop the rows that contains 'Not assigned' values in 'Borough' column
index_names = df[ df['Borough'] == 'Not assigned' ].index 
df.drop(index_names, inplace = True) 
df

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [17]:
#reset the index
df.reset_index(inplace = True)
df

Unnamed: 0,index,PostalCode,Borough,Neighbourhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,5,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...,...
98,160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,165,M4Y,Downtown Toronto,Church and Wellesley
100,168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [18]:
#delete the previous index
df.drop(columns=['index'], inplace = True)
df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [19]:
#determine the shape of the DataFrame
df.shape

(103, 3)

<h3 id="item2">Part 2</h3>

In order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. 

In [21]:
#use the link to a csv file that has the geographical coordinates
url_csv = "https://cocl.us/Geospatial_data"
#convert the csv data to a pandas DataFrame
gc_df = pd.read_csv(url_csv)
gc_df

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


We have got a latitude and a longitude for every Postal Code for Toronto's Boroughts. Note that the Postal Code column is sortet in Ascending order!

In [22]:
#sort our df data frame
sorted_df = df.sort_values(by = 'PostalCode')
sorted_df

Unnamed: 0,PostalCode,Borough,Neighbourhood
6,M1B,Scarborough,"Malvern, Rouge"
12,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
18,M1E,Scarborough,"Guildwood, Morningside, West Hill"
22,M1G,Scarborough,Woburn
26,M1H,Scarborough,Cedarbrae
...,...,...,...
64,M9N,York,Weston
70,M9P,Etobicoke,Westmount
77,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
89,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [23]:
#retrieve the Latitude values from gc_df table
lat = gc_df['Latitude'].values
lat

array([43.8066863, 43.7845351, 43.7635726, 43.7709921, 43.773136 ,
       43.7447342, 43.7279292, 43.7111117, 43.716316 , 43.692657 ,
       43.7574096, 43.7500715, 43.7942003, 43.7816375, 43.8152522,
       43.7995252, 43.8361247, 43.8037622, 43.7785175, 43.7869473,
       43.7574902, 43.789053 , 43.7701199, 43.7527583, 43.7827364,
       43.7532586, 43.7459058, 43.7258997, 43.7543283, 43.7679803,
       43.7374732, 43.7390146, 43.7284964, 43.7616313, 43.7258823,
       43.7063972, 43.6953439, 43.6763574, 43.7090604, 43.7053689,
       43.685347 , 43.6795571, 43.6689985, 43.6595255, 43.7280205,
       43.7127511, 43.7153834, 43.7043244, 43.6895743, 43.6864123,
       43.6795626, 43.667967 , 43.6658599, 43.6542599, 43.6571618,
       43.6514939, 43.6447708, 43.6579524, 43.6505712, 43.6408157,
       43.6471768, 43.6481985, 43.7332825, 43.7116948, 43.6969476,
       43.6727097, 43.6626956, 43.6532057, 43.6289467, 43.6464352,
       43.6484292, 43.718518 , 43.709577 , 43.6937813, 43.6890

In [24]:
#retrieve the Longitude values from gc_df table
long = gc_df['Longitude'].values
long

array([-79.1943534, -79.1604971, -79.1887115, -79.2169174, -79.2394761,
       -79.2394761, -79.2620294, -79.2845772, -79.2394761, -79.2648481,
       -79.273304 , -79.2958491, -79.2620294, -79.3043021, -79.2845772,
       -79.3183887, -79.2056361, -79.3634517, -79.3465557, -79.385975 ,
       -79.3747141, -79.4084928, -79.4084928, -79.4000493, -79.4422593,
       -79.3296565, -79.352188 , -79.340923 , -79.4422593, -79.4872619,
       -79.4647633, -79.5069436, -79.4956974, -79.5209994, -79.3155716,
       -79.309937 , -79.3183887, -79.2930312, -79.3634517, -79.3493719,
       -79.3381065, -79.352188 , -79.3155716, -79.340923 , -79.3887901,
       -79.3901975, -79.4056784, -79.3887901, -79.3831599, -79.4000493,
       -79.3775294, -79.3676753, -79.3831599, -79.3606359, -79.3789371,
       -79.3754179, -79.3733064, -79.3873826, -79.3845675, -79.3817523,
       -79.3815764, -79.3798169, -79.4197497, -79.4169356, -79.4113072,
       -79.4056784, -79.4000493, -79.4000493, -79.3944199, -79.3

In [25]:
#add columns Latitude and Longitude to df table
sorted_df['Latitude'] = lat
sorted_df['Longitude'] = long
sorted_df

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
12,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
18,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
22,M1G,Scarborough,Woburn,43.770992,-79.216917
26,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
64,M9N,York,Weston,43.706876,-79.518188
70,M9P,Etobicoke,Westmount,43.696319,-79.532242
77,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
89,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


In [31]:
#reset the index
sorted_df.reset_index(inplace = True)
sorted_df

Unnamed: 0,index,level_0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,0,0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,1,1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,2,2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,3,3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,4,4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...,...,...
98,98,98,M9N,York,Weston,43.706876,-79.518188
99,99,99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,100,100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,101,101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


In [32]:
sorted_df.drop(columns=['index'], inplace = True)
sorted_df

Unnamed: 0,level_0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...,...
98,98,M9N,York,Weston,43.706876,-79.518188
99,99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


In [33]:
sorted_df.dtypes

level_0            int64
PostalCode        object
Borough           object
Neighbourhood     object
Latitude         float64
Longitude        float64
dtype: object

In [34]:
sorted_df.drop(columns=['level_0'], inplace = True)
sorted_df

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


In [35]:
sorted_df.shape

(103, 5)