Install bs4 and lxml if the environment doesn't include those packages.

In [1]:
pip install bs4

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install lxml

Note: you may need to restart the kernel to use updated packages.


Import packages

In [3]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

Scrape the postal codes from Wikipedia

In [4]:
pc_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
origin = requests.get(pc_url).text

In [5]:
soup = BeautifulSoup(origin, "lxml")

In [6]:
table=soup.find('table')

Setting up the column names

In [7]:
column_names = ['Postalcode','Borough','Neighborhood']
df = pd.DataFrame(columns = column_names)

Find all postcode, borough and neighborhood

In [8]:
for tr_cell in table.find_all('tr'):
    row_data=[]
    for td_cell in tr_cell.find_all('td'):
        row_data.append(td_cell.text.strip())
    if len(row_data)==3:
        df.loc[len(df)] = row_data

Check out the dataframe

In [9]:
df.head(10)

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


Ignoring cells with a borough that is "Not assigned"

In [10]:
df=df[df['Borough']!='Not assigned']

As of 30.11.2020, 15:31 GMT+3, after ignoring cells with a borough that is "Not assigned", the Wikipedia page already has the data as it was laid out on the assignment page. Due to this fact, skipped to the shape part below. 

If the data on the page was as in the instructions, the code below would have work.

df[df['Neighborhood']=='Not assigned']=df['Borough'] 
<br> df2=df.groupby('Postalcode')['Neighborhood'].apply(lambda x: "%s" % ', '.join(x))
<br> df2=df2.reset_index(drop=False)
<br> df2.rename(columns={'Neighborhood':'Neighborhood_joined'},inplace=True)
<br> df3 = pd.merge(df, df2, on='Postalcode')
<br> df3.drop(['Neighborhood'],axis=1,inplace=True)
<br> df3.drop_duplicates(inplace=True)
<br> df3.rename(columns={'Neighborhood_joined':'Neighborhood'},inplace=True)

In [11]:
df.shape

(103, 3)

Get geo data from .csv since geocoder is unreliable and did not work

In [17]:
df_g=pd.read_csv('http://cocl.us/Geospatial_data')

Check out the geo dataframe

In [18]:
df_g.head(10)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


Rename the postal code columns in order to make the merge function work

In [19]:
df_g.rename(columns={'Postal Code':'Postalcode'},inplace=True)

Merge two dataframes

In [20]:
df_cons = pd.merge(df_g, df, on='Postalcode')

Check out the merged dataframe

In [21]:
df_cons.head(10)

Unnamed: 0,Postalcode,Latitude,Longitude,Borough,Neighborhood
0,M1B,43.806686,-79.194353,Scarborough,"Malvern, Rouge"
1,M1C,43.784535,-79.160497,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,43.763573,-79.188711,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,43.770992,-79.216917,Scarborough,Woburn
4,M1H,43.773136,-79.239476,Scarborough,Cedarbrae
5,M1J,43.744734,-79.239476,Scarborough,Scarborough Village
6,M1K,43.727929,-79.262029,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,43.711112,-79.284577,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,43.716316,-79.239476,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,43.692657,-79.264848,Scarborough,"Birch Cliff, Cliffside West"


Sort variables as it is in the instructions

In [23]:
df_cons2=df_cons[['Postalcode','Borough','Neighborhood','Latitude','Longitude']]

Check out the sorted dataframe

In [24]:
df_cons2.head(10)

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
