# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto - Notebook I

## Introduction

In this notebook we first scrape a Wiki page to get a table of neighborhoods in Toronto and create a data frame from it. In a second notebook, we will use the Foursquare location data to complete this table. In a third notebook, we will create maps to explore the data.

We scrape the data on "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" using the beautifulsoup4 package.

## Scraping the HTML Wiki page

So we start by importing basic Python libraries:

In [1]:
import pandas as pd
import numpy as np

And now import the whole HTML Wiki page using BeautifulSoup4:

In [2]:
from bs4 import BeautifulSoup
import requests

source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml') # We use the lxml parser

We can test a few functionalities of BeautifulSoup if we want (click **here** to see some commands).
<!--
# print(soup.prettify())
# print(soup.title)
# print(soup.title.text)
# print(soup.find('div', class_='printfooter').text)

// Nice video about how to use BeautifulSoup:
// https://www.youtube.com/watch?v=ng2o98k983k
-->

Let us capture the table(s) of the page:

In [3]:
table_on_page = soup.find('table', class_='wikitable sortable')

Some test of commands **here**.
<!--
#print(table_on_page)
# Test:
#table_header1 = table_on_page.tbody.th
#print(table_header1.text)
-->

We read through the HTML table header and save the words into a list (removing the '\n' part when it appears):

In [4]:
#print(soup.find_all('th'))
names_th = []
for word in soup.find_all('th'):
    if word.text[-1] == '\n':
        names_th.append(word.text[:-1])
    else:
        names_th.append(word.text)
names_th

['Postcode', 'Borough', 'Neighbourhood', 'Canadian postal codes']

And we **create a pandas Data Frame** with the 3 first headers:

In [5]:
import pandas as pd
df = pd.DataFrame(columns=[names_th[0],names_th[1],names_th[2]])
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood


We now read through the HTML table values and save the words into a list:

In [6]:
values_th = []
for val in soup.find_all('td'):
    values_th.append(val.text)
#values_th

We take that list and assign the values 3 by 3 to the data frame until we reach the last row (that we add after), thus avoiding the extra table elements coming from the other table at the bottom of the page and that we do not need. We also have to remove the '\n' for the last element.

In [7]:
i = 0

# Looping through the list until we reach the last line
while(values_th[3*i+0] != 'M9Z' and i < 500):
    #print(values_th[3*i+0],values_th[3*i+1],values_th[3*i+2])
    df = df.append({names_th[0]:values_th[3*i+0], names_th[1]:values_th[3*i+1], names_th[2]:values_th[3*i+2][:-1]}, ignore_index=True)
    i+=1

# Dealing with the last line
#print(values_th[3*i+0],values_th[3*i+1],values_th[3*i+2])
df = df.append({names_th[0]:values_th[3*i+0], names_th[1]:values_th[3*i+1], names_th[2]:values_th[3*i+2][:-1]}, ignore_index=True)

Checking the head of the data frame:

In [8]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


And checking the tail to make sure we incomporated values until the last row:

In [9]:
df.tail()

Unnamed: 0,Postcode,Borough,Neighbourhood
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor
287,M9Z,Not assigned,Not assigned


The shape of the data frame is the following:

In [10]:
df.shape

(288, 3)

**We now have our scraped HTML table converted into a data frame !**

# Shaping the data frame

In here we will match with the requirements of the assigment concerning the data frame.

We rename the column 'Postcode' into 'PostCode':

In [11]:
df.rename(columns={'Postcode':'PostCode'}, inplace=True)
df.head()

Unnamed: 0,PostCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


We remove rows for which "Not assigned" is present in the column 'Borough' (there will remain a "Not assigned" value in the column 'Neighbourhood'):

In [12]:
df = df[~df['Borough'].isin(["Not assigned"])]

In [13]:
df.head(10)

Unnamed: 0,PostCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [14]:
df.shape

(211, 3)

We now need to reset the index for the loop coming after.

In [15]:
df.reset_index(drop=True, inplace=True)
df.head(10)

Unnamed: 0,PostCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Let's do the loop. When we go through an element, we test the next one. If they match, then we add the neighborhood name to the next element and we replace the postcode of the current element by the letter X, so that we can suppress it after, if they don't match we go to the next element.

In [16]:
for i in range(len(df.index)-1):
    #print(i, df.iloc[i,0], df.iloc[i,1], df.iloc[i,2])
    
    if (df.iloc[i,0] == df.iloc[i+1,0] and df.iloc[i,1] == df.iloc[i+1,1]):
        # print("match")
        df.iloc[i+1,2] = df.iloc[i,2] + ', ' + df.iloc[i+1,2]
        df.iloc[i,0] = 'X'

df.head(10)     

Unnamed: 0,PostCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,X,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,"Harbourfront, Regent Park"
4,X,North York,Lawrence Heights
5,M6A,North York,"Lawrence Heights, Lawrence Manor"
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,X,Scarborough,Rouge
9,M1B,Scarborough,"Rouge, Malvern"


We suppress the elements marked by X and reset the index:

In [17]:
df = df[~df['PostCode'].isin(["X"])]
df.reset_index(drop=True, inplace=True)

In [18]:
df.head(10)

Unnamed: 0,PostCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Not assigned
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [19]:
df.shape

(103, 3)

We need to replace the Neighbourhood name "Not assigned" by the Borough name.

In [20]:
df = df.replace({'Neighbourhood':'Not assigned'},{'Neighbourhood':df['Borough']})
df.head(10)     

Unnamed: 0,PostCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


We can see that the value for Queen's Park has been replaced.

In [21]:
df.shape

(103, 3)