**Importing libraries needed.**

In [79]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import numpy as np

**Getting the wiki page parsed using requests and lxml library.**

In [80]:
r = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(r,'lxml')
#print(soup.prettify())

**Getting the actual content of the tabel.**

In [81]:
tables = soup.find_all('table', class_='sortable')
#tables

**Getting the tabel headers and rows for our dataframe.**

In [82]:
for table in tables:
    ths = table.find_all('th')
    tds = table.find_all('td')
    headings = [th.text.strip() for th in ths]
    columns = [td.text.strip() for td in tds]

**Loading the rows to the temporary dataframe, which later on will be transformed to the dimension we want. We remove the whitespaces before and after the strings of each cell as part of pre-processing to make string comparison easier in the future.**

In [85]:
df = pd.DataFrame(columns)
df = df.apply(lambda x: x.str.strip())
df.head()

Unnamed: 0,0
0,M1A
1,Not assigned
2,Not assigned
3,M2A
4,Not assigned


**Counting the number of rows assuming all Postcodes are formatted in the same way**

In [86]:
df[0].str.contains(r'M\d[A-Z]').sum()

289

**Loading the final dataframe with the reshaped data as well as correct columns names.**

In [87]:
wiki_df = pd.DataFrame(data=np.reshape(df.values,(289,3)),columns=headings)
wiki_df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


**Per the requirement quoted below, we compare Postcode of row n with row n+1 and assign value accordingly.**

>More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [90]:
wiki_df['Neighbourhood'] = np.where(wiki_df['Postcode'] == wiki_df['Postcode'].shift(1),
                                      wiki_df['Neighbourhood']+','+wiki_df['Neighbourhood'].shift(1),
                                      wiki_df['Neighbourhood'])
wiki_df = wiki_df.drop_duplicates('Postcode','last')
wiki_df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park,Harbourfront"
7,M6A,North York,"Lawrence Manor,Lawrence Heights"
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned
10,M9A,Etobicoke,Islington Avenue
12,M1B,Scarborough,"Malvern,Rouge"


**Per the requirement quoted below, we find the row where Borough is assign, but Neighbourhood isn't. Then we assign the value of Borough to the Neighbourhood.**

>If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [75]:
wiki_df['Neighbourhood'] = np.where((wiki_df['Neighbourhood']=='Not assigned')
                                    &(wiki_df['Borough']!='Not assigned'),
                                    wiki_df['Borough'],
                                    wiki_df['Neighbourhood'])
wiki_df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park,Harbourfront"
7,M6A,North York,"Lawrence Manor,Lawrence Heights"
8,M7A,Queen's Park,Queen's Park
9,M8A,Not assigned,Not assigned
10,M9A,Etobicoke,Islington Avenue
12,M1B,Scarborough,"Malvern,Rouge"


**Per the requirement quoted below, the shape of the dataframe is printed.**

>In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [76]:
wiki_df.shape

(180, 3)