# Question 1: Scrap Wikipedia for the Data 

Build the code to scrape the following Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, 
in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

#### Get the Wikipedia Html Page

In [229]:
import requests
from bs4 import BeautifulSoup

resp = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
html = BeautifulSoup(resp,'lxml')

#### Extract the data from the table in the Wikipedia page and Store it in a Dataframe

In [172]:
import pandas as pd
import numpy as np
import re

table = html.find('table',{'class':'wikitable sortable'})

colnames = []
for header in table.find_all('th'):
    header = re.sub('<.*?>\s+', '', header.text)
    colnames.append(header.strip())

trs = table.find_all('tr')
row_count = len(trs)-1 #because the first tr is has the header columns
df = pd.DataFrame(columns=colnames, index=range(0,row_count))

for i in range(1,len(trs)):
    tr = trs[i]
    row = []
    for field in tr.find_all('td'):
        field = re.sub('<.*?>\s+', '', field.text).strip()
        row.append(field)
    df.loc[i-1] = row
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Clean-Up: 
* Remove Rows with 'Not assigned' Borough
* Where the value of Neighborhood is 'Not assigned', set it to the value of 'Borough'

In [255]:
# Get names of indexes for which column Borough has value "Not assigned"
na_locs = df[ df['Borough'] =='Not assigned'].index
# Delete the rows corresponding to the indexes from dataFrame
df.drop(na_locs , inplace=True)

# Where the value of Neighborhood is 'Not assigned', set it to the value of 'Borough'
df.loc[df['Neighbourhood'] =='Not assigned' , 'Neighbourhood'] = df['Borough']

# Group the data by postal code and Borough,  
# so that each borough will have a list of neighborhood
# stored as a comma-delimited string in the Neighborhood field
df = df.groupby(['Postcode','Borough'], sort=False).agg( ', '.join).reset_index()
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [174]:
df.shape

(103, 3)