### Scrape Wikipedia Data into Pandas Dataframe with Beautifulsoup

#### Get data From Wikipedia and load into a dictionary

Here we get data from the specified Wikipedia page and use a combination of BeautifulSoup and Regex/String operations to extract the 3 data fields. They are then initially stored in a dictionary. 

In [1]:
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd

website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

soup = BeautifulSoup(website_url,'html')

my_table = soup.find('table',{'class':'wikitable sortable'})
links=my_table.findAll('td')
groups={}
#Iterating over the links list in Groups of 3, as each Postal Code, Borough and Neighborhood belong to the same group.
for i in range(0,len(links),3):
    # Extract Postal Code using String slice. All postcodes have this same length and format.
    postcode=str(links[i])[4:-5]
    # Extract Borough using Regex. All boroughs have title field present.
    try:
        borough = re.search(r'title="(.+?)"',str(links[i+1])).group(1)
    except AttributeError:
        borough = '' 
    # Extract those Neighborhoods which have a title attribute/tag present
    try:
        neighborhood = re.search(r'title="(.+?)"',str(links[i+2])).group(1)
    except AttributeError:
        # Extract those Neighborhoods which do not have a title attribute/tag present, by using a regex for the <td> tag
        try:
            neighborhood = re.search(r'<td>(.+?)\n</td>',str(links[i+2])).group(1)
        except AttributeError:
            neighborhood = ''
    if(borough!=''):
        #For multiple neighborhoods associated with the same Postal codeappend the Neighborhood to the one already present.
        if(neighborhood==''or neighborhood=='Not assigned'):
            neighborhood=borough
        if(postcode in groups.keys()):
            groups[postcode][1]=groups[postcode][1]+','+(neighborhood)
        else:
            groups[postcode]=[borough,neighborhood]

#### Convert dictionary into Dataframe with required format

Here the dictionary is converted into a dataframe of the required format.

In [2]:
df=pd.DataFrame.from_dict(groups,orient='index',columns=['Borough','Neighborhood'])
df.reset_index(inplace=True)
df.rename(columns={'index':'PostalCode'},inplace=True)
df=df.astype({'PostalCode': 'string','Borough': 'string','Neighborhood':'string'})
print('Sample:',df.head(10))
print('Dataframe shape:',df.shape)

Sample:   PostalCode                 Borough                     Neighborhood
0        M3A              North York                        Parkwoods
1        M4A              North York                 Victoria Village
2        M5A        Downtown Toronto                      Regent Park
3        M6A              North York  Lawrence Heights,Lawrence Manor
4        M7A        Downtown Toronto           Queen's Park (Toronto)
5        M9A  Queen's Park (Toronto)           Queen's Park (Toronto)
6        M1B    Scarborough, Toronto  Rouge, Toronto,Malvern, Toronto
7        M3B              North York                  Don Mills North
8        M4B               East York   Woodbine Gardens,Parkview Hill
9        M5B        Downtown Toronto          Ryerson,Garden District
Dataframe shape: (100, 3)
