## Scraping wikipedia tables with python selectively


Use this Notebook to build the code to scrape the Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe

In [1]:
# importing the needed library

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup


In [75]:
# define the dataframe columns
column_names = ['Postcode', 'Borough', 'Neighbourhood'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [3]:
# Use the data from Wikipedia

Postcode = pd.DataFrame()
Borough = pd.DataFrame()
Neighbourhood = pd.DataFrame()

URL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')
for items in soup.find('table', class_='wikitable').find_all('tr')[1::1]:
    data = items.find_all(['th','td'])
    try:
        Postcode = np.append(Postcode, data[0].text)
        
        Borough = np.append(Borough, data[1].text)
       
        Neighbourhood = np.append(Neighbourhood, data[2].text)
       
    except IndexError:pass
   

In [76]:
# Fill up the array
neighborhoods['Postcode'] = Postcode 
neighborhoods[ 'Borough'] = Borough
neighborhoods[ 'Neighbourhood'] = Neighbourhood

In [77]:
# Ignore cells with a borough that is Not assigned.
neighborhoods = neighborhoods[neighborhoods.Borough != 'Not assigned']
#Remove '\n' in text in Neighbourhood
neighborhoods['Neighbourhood'].replace(r'\n', '', regex=True, inplace=True) 
#  PostalCode with two neighborhoods rows will be combined into one row with the neighborhoods separated with a comma
neighborhoods= neighborhoods.groupby(neighborhoods['Postcode'], sort=False).agg({"Borough": 'first', "Neighbourhood":', '.join}).reset_index()
#If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
neighborhoods['Neighbourhood'] = np.where(neighborhoods['Neighbourhood'] == 'Not assigned', neighborhoods['Borough'], neighborhoods['Neighbourhood'])



In [78]:
neighborhoods.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"
