# Segmenting and Clustering Neighborhoods in Toronto | Oleksandr Tsapin
Peer-graded Assignment, 13.11.2020

### PART 1 (QUESTION 1)
#### Scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Step 1. Fetch the HTML from the URL using Urllib.request

In [1]:
# import the library we will be using to connect to the Wikipedia page and fetch the contents of that page
import urllib.request

In [2]:
# specify the URL of the Wikipedia page we are looking to scrape
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [3]:
# open the url using urllib.request and put the HTML into the page variable
page = urllib.request.urlopen(url)

Step 2. Use the power of BeautifulSoup to extract and work with the data

In [4]:
# import the BeautifulSoup library so we can parse HTML and XML documents
from bs4 import BeautifulSoup

In [5]:
# we use Beautiful Soup to parse the HTML data we stored in our ‘url’ variable and store it in a new variable 
# called ‘soup’ in the Beautiful Soup format. Jupyter Notebook prefers we specify a parser format so we use 
# the “lxml” library option

soup = BeautifulSoup(page, "lxml")

In [6]:
# To get an idea of the structure of the underlying HTML in our web page, we can view the code in 
# two ways: a) right click on the web page itself and click View Source 
# or b) use Beautiful Soup’s prettify function and check it out right there in our Jupyter Notebook.

#print(soup.prettify())

In [7]:
# use the 'find_all' function to bring back all instances of the 'table' tag in the HTML 
# and store in 'all_tables' variable
all_tables = soup.find_all("table")
#all_tables

In [8]:
# Looking through the output of ”all_tables” we can again see that the class id of our chosen table 
# is ”wikitable sortable”. We can use this to get BS to only bring back the table data for this particular table 
# and keep that in a variable called ”right_table“
right_table = soup.find('table', class_='wikitable sortable')
#right_table

In [9]:
# We know that the table is set up in rows (starting with <tr> tags) with the data sitting 
# within <td> tags in each row. We aren’t too worried about the header row with the <th> elements 
# as we know what each of the columns represent by looking at the table.

# Let's start looping through the rows
# There are three columns in our table that we want to scrape the data from so we will set up 
# three empty lists (A, B, and C) to store our data in.

# To start with, we want to use the Beautiful Soup ‘find_all’ function again and set it to look for 
# the string ‘tr’. We will then set up a FOR loop for each row within that array and set Python to loop through 
# the rows, one by one.

# Within the loop we are going to use find_all again to search each row for <td> tags with the ‘td’ string. 
# We will add all of these to a variable called ‘cells’ and then check to make sure that there are 3 items 
# in our ‘cells’ array (i.e. one for each column).

# If there are then we use the find(text=True)) option to extract the content string from within each <td> element 
# in that row and add them to the A-C lists we created at the start of this step. Let’s have a look at the code:

A = []
B = []
C = []

for row in right_table.findAll('tr'):
    cells = row.findAll('td')
    if len(cells) == 3:
        A.append(cells[0].find(text = True))
        B.append(cells[1].find(text = True))
        C.append(cells[2].find(text = True))

In [10]:
print(A[:10], B[:10], C[:10])

['M1A\n', 'M2A\n', 'M3A\n', 'M4A\n', 'M5A\n', 'M6A\n', 'M7A\n', 'M8A\n', 'M9A\n', 'M1B\n'] ['Not assigned\n', 'Not assigned\n', 'North York\n', 'North York\n', 'Downtown Toronto\n', 'North York\n', 'Downtown Toronto\n', 'Not assigned\n', 'Etobicoke\n', 'Scarborough\n'] ['Not assigned\n', 'Not assigned\n', 'Parkwoods\n', 'Victoria Village\n', 'Regent Park, Harbourfront\n', 'Lawrence Manor, Lawrence Heights\n', "Queen's Park, Ontario Provincial Government\n", 'Not assigned\n', 'Islington Avenue, Humber Valley Village\n', 'Malvern, Rouge\n']


###### Achtung! 

We see the unwanted \n near each item in the lists. This is a Python new line character. Let's remove it before converting data into pandas data frame.

In [11]:
A_clean = []
B_clean = []
C_clean = []

for x in A:
    y = x.strip('\n')
    A_clean.append(y)

for x in B:
    y = x.strip('\n')
    B_clean.append(y)
    
for x in C:
    y = x.strip('\n')
    C_clean.append(y)
    
print(A_clean[:10], B_clean[:10], C_clean[:10])

['M1A', 'M2A', 'M3A', 'M4A', 'M5A', 'M6A', 'M7A', 'M8A', 'M9A', 'M1B'] ['Not assigned', 'Not assigned', 'North York', 'North York', 'Downtown Toronto', 'North York', 'Downtown Toronto', 'Not assigned', 'Etobicoke', 'Scarborough'] ['Not assigned', 'Not assigned', 'Parkwoods', 'Victoria Village', 'Regent Park, Harbourfront', 'Lawrence Manor, Lawrence Heights', "Queen's Park, Ontario Provincial Government", 'Not assigned', 'Islington Avenue, Humber Valley Village', 'Malvern, Rouge']


Step 3. Transform the data into a pandas dataframe

In [12]:
# Pandas lets us convert lists into dataframes which are 2 dimensional data structures with rows and 
# columns, very much like spreadsheets or SQL tables.

# We’ll import pandas and create a dataframe with it, assigning each of the lists A-C into a column 
# with the name of our source table columns i.e. Postal_Code, Borough, Neighbourhood.

import pandas as pd
df = pd.DataFrame(A_clean,columns=['Postal_Code'])
df['Borough'] = B_clean
df['Neighbourhood'] = C_clean
df

Unnamed: 0,Postal_Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Step 4. Clean the data

In [13]:
# Drop raws if 'Borough' column is 'Not assigned'
Not_assigned = df[df['Borough'] == 'Not assigned'].index
df.drop(Not_assigned, inplace = True)
df

Unnamed: 0,Postal_Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [14]:
# reset the index and assigned it to df2, main dataframe we are going to work with.
df2 = df.reset_index(drop = True)
df2

Unnamed: 0,Postal_Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [15]:
# Assignment: If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be 
# the same as the borough.

# Let's check if there are many Not assigned neighborhoods in dataframe.
if 'Not assigned' not in df2.values:
    print('Element does not exists in Dataframe')

Element does not exists in Dataframe


In [16]:
# Use the .shape method to print the number of rows of your dataframe.
df2.shape

(103, 3)

### PART 2 (QUESTION 2)
#### Use the Geocoder Python package to add the latitude and the longitude coordinates of each neighborhood to the dataframe.

Step 1. Get geospatial data

In [17]:
# import geocoder
import geocoder 

latitude=[]
longitude=[]

# google API doesn't work, I use arcgis API instead.
for postal_code in df2['Postal_Code']:
    g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
    print(postal_code, g.latlng)
    while (g.latlng is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
        print(postal_code, g.latlng)
    latlng = g.latlng
    latitude.append(latlng[0])
    longitude.append(latlng[1])

M3A [43.75245000000007, -79.32990999999998]
M4A [43.73057000000006, -79.31305999999995]
M5A [43.65512000000007, -79.36263999999994]
M6A [43.72327000000007, -79.45041999999995]
M7A [43.66253000000006, -79.39187999999996]
M9A [43.662630000000036, -79.52830999999998]
M1B [43.811390000000074, -79.19661999999994]
M3B [43.74923000000007, -79.36185999999998]
M4B [43.70718000000005, -79.31191999999999]
M5B [43.65739000000008, -79.37803999999994]
M6B [43.70687000000004, -79.44811999999996]
M9B [43.65034000000003, -79.55361999999997]
M1C [43.78574000000003, -79.15874999999994]
M3C [43.72168000000005, -79.34351999999996]
M4C [43.68970000000007, -79.30681999999996]
M5C [43.65215000000006, -79.37586999999996]
M6C [43.69211000000007, -79.43035999999995]
M9C [43.64857000000006, -79.57824999999997]
M1E [43.765750000000025, -79.17469999999997]
M4E [43.67709000000008, -79.29546999999997]
M5E [43.64536000000004, -79.37305999999995]
M6E [43.68784000000005, -79.45045999999996]
M1G [43.76812000000007, -79.2

Step 2. Add new geospatial data to our dataframe df2

In [18]:
df2['Latitude'] = latitude
df2['Longitude'] = longitude
df2

Unnamed: 0,Postal_Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75245,-79.32991
1,M4A,North York,Victoria Village,43.73057,-79.31306
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.65319,-79.51113
99,M4Y,Downtown Toronto,Church and Wellesley,43.66659,-79.38133
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.64869,-79.38544
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.63278,-79.48945


In [19]:
# Use the .shape method to print the number of rows of your dataframe.
df2.shape

(103, 5)

### PART 3 (QUESTION 3)
#### Explore and cluster the neighborhoods in Toronto. 

Step 1. Create a map of Toronto with neighborhoods superimposed on top.

In [20]:
# Print number of unique boroughs and neighborhoods in Toronto
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df2['Borough'].unique()),
        len(df2['Neighbourhood'].unique())
    )
)

The dataframe has 10 boroughs and 99 neighborhoods.


In [21]:
# Use geopy library to get the latitude and longitude values of Toronto

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [22]:
# Install Folium to visualize the map of Toronto
!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df2['Latitude'], df2['Longitude'], df2['Borough'], df2['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



Step 2. I decided to segment and cluster only boroughs that contain the word Toronto. So let's slice the original dataframe and create a new dataframe of the 'Toronto data'.

In [23]:
toronto_data = df2[df2['Borough'].str.contains('Toronto')].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,Postal_Code,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.65739,-79.37804
3,M5C,Downtown Toronto,St. James Town,43.65215,-79.37587
4,M4E,East Toronto,The Beaches,43.67709,-79.29547


In [24]:
# Let's get the geographical coordinates of Downtown Toronto.
address = 'Downtown Toronto, Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Toronto are 43.6563221, -79.3809161.


In [25]:
# Let's visualize the Downtown Toronto and all the neighborhoods that contain the word Toronto superimposed on top.

# create map of Downtown Toronto using latitude and longitude values
map_DToronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighbourhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_DToronto)  
    
map_DToronto

Step 3. Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.