# Question 1

# Significance of Imported Libraries

* ### Beautiful Soup for Web Scraping and parsing the HTML source of Wikipedia Table
* ### Requests for making HTTP requests for Wikipedia Lists
* ### Pandas and Numpy for processing the .CSV in Python

In [33]:
from bs4 import BeautifulSoup
import requests

import pandas as pd
import numpy as np

### The Below Code is to Pull the HTTP Request from Wikipedia Lists and Beautiful Soup Object to Parse the HTML source of List of Postal Code of Canada from Wikipedia

In [34]:
URL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
response = requests.get(URL)
soup = BeautifulSoup(response.text, 'html.parser')

### Find the tbody tag in the HTML source and key word wikitable sortable

In [35]:
table = soup.find('table',{'class':'wikitable sortable'}).tbody

### Fetch the Column Names of the table in HTML source and Print by finding the th tag

In [36]:
rows = table.find_all('tr')
columns = [v.text.replace('\n','') for v in rows[0].find_all('th')]
print(columns)

['Postal code', 'Borough', 'Neighborhood']


### Create a Empty DataFrame with Column Names

In [37]:
df = pd.DataFrame(columns=columns)
df

Unnamed: 0,Postal code,Borough,Neighborhood


### Parse the HTML source and Fill the row values by finding the td tag and Display the First Five rows of the DataFrame

In [38]:
for i in range(1,len(rows)):
    values = [td.text.replace('\n','').replace(' /',',') for td in rows[i].find_all('td')]
    df = df.append(pd.Series(values,index=columns),ignore_index = True)
    
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


* ### Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [39]:
df = df.loc[df['Borough'] != 'Not assigned']
df_processed = df.reset_index().drop(columns = 'index')
df_processed.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


* ### Sort the DataFrame as per Postal Code and Rename the Columns Postal code as PostalCode

In [40]:
df_processed.sort_values(by=['Postal code'],inplace=True)
df_processed = df_processed.reset_index().drop(columns = 'index')
df_processed.rename(columns={"Postal code": "PostalCode"},inplace=True)
df_processed[0:12]

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [41]:
df_processed.shape

(103, 3)