# Capstone Project - week 3: Toronto Neighborhood

import all the necessary librairy. In this Notebook, the table is scrapped with Pandas and with Beautifulsoup. We start with Pandas.

In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen

## Dataframe with Pandas 

### Pandas - Obtaining the table

In [23]:
# Parsing the url into the read_html methode.
toronto_data = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

In [24]:
# Let's obtain some information about the toronto_data
print(type(toronto_data))
print(len(toronto_data))
print("\n",toronto_data)

<class 'list'>
3

 [    Postal Code           Borough  \
0           M1A      Not assigned   
1           M2A      Not assigned   
2           M3A        North York   
3           M4A        North York   
4           M5A  Downtown Toronto   
..          ...               ...   
175         M5Z      Not assigned   
176         M6Z      Not assigned   
177         M7Z      Not assigned   
178         M8Z         Etobicoke   
179         M9Z      Not assigned   

                                          Neighborhood  
0                                         Not assigned  
1                                         Not assigned  
2                                            Parkwoods  
3                                     Victoria Village  
4                            Regent Park, Harbourfront  
..                                                 ...  
175                                       Not assigned  
176                                       Not assigned  
177                   

So we have a list of 3 items. Th eifrst item is the one we're interrested in.

In [25]:
# Let's print it
toronto_data[0]

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [26]:
# let's check the type of the toronto_data[0]

type(toronto_data[0])

pandas.core.frame.DataFrame

It's a dataframe so we can assign it to a new variable

In [27]:
toronto_df = toronto_data[0]

In [28]:
toronto_df.shape

(180, 3)

There is 180 rows and 3 columns in the dataframe

### Pandas - processing the dataframe

We have to remove the cells "Not assigned" from the Borough column.

In [29]:
# check if there is any "Not assigned" cells
print("Is there any Not assigned cells in the Neighorhood ?", (toronto_df["Neighborhood"] == "Not assigned").any())
print("Is there any Not assigned cells in the Borough ?", (toronto_df["Borough"] == "Not assigned").any())

Is there any Not assigned cells in the Neighorhood ? True
Is there any Not assigned cells in the Borough ? True


In [30]:
# Create a new dataframe without the "Not assigned" cells in the columns Borough
toronto_df_borough = toronto_df[toronto_df["Borough"] != "Not assigned"]

In [31]:
toronto_df_borough

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [32]:
(toronto_df_borough["Borough"] == "Not assigned").any()

False

Apparently, there is no "Not assigned" cells left

Let's reset the index and drop the columns "index" cause the index start at 2.

In [33]:
toronto_final_from_pandas = toronto_df_borough.reset_index().drop(["index"], axis=1)

### Pandas - The final dataframe: toronto_df_final

In [34]:
toronto_final_from_pandas.head(11)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


What about the size of the dataframe ?

In [36]:
toronto_final_from_pandas.shape

(103, 3)

There is still the 3 columns and we have 103 rows left

### --------------------------------------------------------------------------------------------------------------------------------------------------------

## Dataframe with Beautifulsoup

### Beautifulsoup - Getting the table

The first step is to get the HTML code from the web page and to convert it in a soup object

In [37]:
web_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
web_html = urlopen(web_url)
web_text = web_html.read()
web_html.close()
page_soup = soup(web_text, "html.parser")

Now that we have created the soup object, we can inspect it

page_soup contain all of the code of the html page. In order to only select the table we're interrested, we use the finall methode to find the "table" with a "wikitable sortable" class (we can obtain this by looking at the HTML code of the page).

In [62]:
postal_code_table = page_soup.findAll('table', {'class': 'wikitable sortable'})


In [40]:
print(type(postal_code_table))

<class 'bs4.element.ResultSet'>


Now it is time to extract the data. The headers "Postal Code, Borough, Neighborhood" are in the "th" 

In [41]:
postal_code_table = postal_code_table[0]
headers = postal_code_table.findAll('th', {})
print(type(headers))
print(headers)

<class 'bs4.element.ResultSet'>
[<th>Postal Code
</th>, <th>Borough
</th>, <th>Neighborhood
</th>]


Now we extract the text with the text method and we add each text into a list:

In [42]:
header_titles = []
for header in headers:
    header_titles.append(header.text[:-1])
print("header_titles:{}".format(header_titles))

header_titles:['Postal Code', 'Borough', 'Neighborhood']


We have our columns names list, now we need all the data in the table. So we repet the two previous steps

The data are embedded in the "tr" key word.

In [43]:
data_rows = postal_code_table.findAll("tr", {})
print(data_rows)

[<tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>, <tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>, <tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>, <tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>, <tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>, <tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>, <tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>, <tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>, <tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>, <tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>, <tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>, <tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>, <tr>
<td>M3B
</td>
<td>North York
</td>
<td>Don Mill

In [44]:
len(data_rows)

181

The length of data_rows is 181, which tells us that the first row is the headers. We just have to select all_rows from index 1 and not 0

In [45]:
data = data_rows[1:]

Let's extract the data from the first row in order to test the code

In [46]:
data[0]

<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>

In [47]:
first_row = data[0]
first_row_data = first_row.findAll('td', {})
print(first_row_data)

[<td>M1A
</td>, <td>Not assigned
</td>, <td>Not assigned
</td>]


In [48]:
text_data = []
for i in first_row_data:
    text_data.append(i.text[:-1])
print(text_data)

['M1A', 'Not assigned', 'Not assigned']


Now, we extract all the rows from the table and put them into a list.

In [49]:
table_rows = []
for row in data:
    table_row = []
    row_data = row.findAll('td', {})
    for data_point in row_data:
        table_row.append(data_point.text[:-1])
    table_rows.append(table_row)
print(table_rows)

# This nested loop has two purpose. The first one is to select all the "td" in the data variable. 
# The second one is to extract for each row, all the text and to add it into a list. 

[['M1A', 'Not assigned', 'Not assigned'], ['M2A', 'Not assigned', 'Not assigned'], ['M3A', 'North York', 'Parkwoods'], ['M4A', 'North York', 'Victoria Village'], ['M5A', 'Downtown Toronto', 'Regent Park, Harbourfront'], ['M6A', 'North York', 'Lawrence Manor, Lawrence Heights'], ['M7A', 'Downtown Toronto', "Queen's Park, Ontario Provincial Government"], ['M8A', 'Not assigned', 'Not assigned'], ['M9A', 'Etobicoke', 'Islington Avenue, Humber Valley Village'], ['M1B', 'Scarborough', 'Malvern, Rouge'], ['M2B', 'Not assigned', 'Not assigned'], ['M3B', 'North York', 'Don Mills'], ['M4B', 'East York', 'Parkview Hill, Woodbine Gardens'], ['M5B', 'Downtown Toronto', 'Garden District, Ryerson'], ['M6B', 'North York', 'Glencairn'], ['M7B', 'Not assigned', 'Not assigned'], ['M8B', 'Not assigned', 'Not assigned'], ['M9B', 'Etobicoke', 'West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale'], ['M1C', 'Scarborough', 'Rouge Hill, Port Union, Highland Creek'], ['M2C', 'Not assigned', 'N

As we can see, our table_rows is a list of list. Each items of the main list is a row. So we can use a loop to get those itesms and process them with the pandas.DataFrame. We also use the columns parameter and our headers_titles list to give a title to each column.

### Beautifulsoup - Building the dataframe

In [50]:
toronto_df_from_soup = pd.DataFrame([i for i in table_rows], columns=header_titles)

In [51]:
toronto_df_from_soup

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [52]:
toronto_df_from_soup = toronto_df_from_soup[toronto_df_from_soup["Borough"] != "Not assigned"]

In [53]:
toronto_final_from_soup = toronto_df_from_soup.reset_index().drop(["index"], axis=1)

### Beautifulsoup - The final dataframe: toronto_df_final

In [54]:
toronto_final_from_soup

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [56]:
toronto_final_from_soup.shape

(103, 3)

We obtained the same table for both method of scrapping.

## Thank you for taking time to look at it