### Importing and installing the required libraries
This Notebook gets the data from the Wikipedia about the postcodes, boroughs and neighborhoods of Toronto available at https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M


In [2]:
import requests
# !pip install beautifulsoup4 ## for reading the data from the website
# !pip install html5lib  # installing the parser required to parse the html files
from bs4 import BeautifulSoup

In [4]:
url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(url, 'html5lib')
#print(soup.prettify)  ## this lets you see the tags nesting and identify the table in the html file which is under the class 'wikitable sortable'

In [5]:
# Under the class, we find the layout of the table needed and the table data in html scripting
my_table = soup.find('table', {'class': 'wikitable sortable'})

#### Explanation of the arrangement of table contents in the html file.
Observing carefully, we get to know that the rows are defined between <tr> and </tr> and the cell-values are defined between <td> and </td>.
The name of the boroughs and neighborhoods are defined in links between <a> and </a>.
Ignoring the 'Not Assigned' values of boroughs and asigning the borough name to the neighborhood with 'Not Assigned' value.

In [8]:
# Looking for the rows by tracing 'tr' and getting the rows in the variable 'table_rows'
table_rows = my_table.find_all('tr')

# defining three lists, postcode, borough, neighborhood to store the respective infromation extracted from the html file
postcode = []
borough = []
neighborhood = []

# running the for loop to get the contents of each row
for row in table_rows:
    row_values = row.find_all('td')  # each row value is specified between <td> and </td>
    
    if len(row_values) != 0:  #Ignoring the heading 
        cell_values = row.find_all('a')  # since the borough and neighborhood are defined in the links between <a> and </a>
        
        if (len(cell_values) == 1):   # If neighborhood name is 'Not Assigned'
            postcode.append(row_values[0].find(text=True))  #find(text=True) returns the contens which is pure text and no formating of html
            borough.append(row_values[1].find(text=True))
            neighborhood.append(row_values[1].find(text=True)) # assigning the name of borough to the neighborhood as the in the table the neighborhood name is 'Not Assigned'
        elif (len(cell_values) == 2):
            postcode.append(row_values[0].find(text=True))  
            borough.append(row_values[1].find(text=True))
            neighborhood.append(row_values[2].find(text=True))

print(postcode[:10])
print(borough[:10])
print(neighborhood[:10])

    

['M3A', 'M4A', 'M5A', 'M5A', 'M6A', 'M6A', 'M7A', 'M9A', 'M1B', 'M1B']
['North York', 'North York', 'Downtown Toronto', 'Downtown Toronto', 'North York', 'North York', "Queen's Park", 'Etobicoke', 'Scarborough', 'Scarborough']
['Parkwoods', 'Victoria Village', 'Harbourfront', 'Regent Park', 'Lawrence Heights', 'Lawrence Manor', "Queen's Park", 'Islington Avenue', 'Rouge', 'Malvern']


#### Creating an empty dataframe and assigning columns with values from the above three lists

In [11]:
import pandas as pd
df = pd.DataFrame()
df['PostalCode'] = postcode
df['Borough'] = borough
df['Neighborhood'] = neighborhood

df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


#### Merging the duplicate boroughs with the same postcode, to have the names of the neighborhoods as a list seperated by commas

In [13]:
# Grouping by postcode, letting the borough name the same and joining the neighborhood names using ','
df = df.groupby('PostalCode').agg({'Borough':'first', 'Neighborhood': ', '.join}).reset_index()
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Scarborough, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Scarborough
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Scarborough, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough"
9,M1N,Scarborough,"Birch Cliff, Scarborough"


In [14]:
print('The number of rows in the dataframe are :', df.shape[0])

The number of rows in the dataframe are : 101
