# IBM Capstone Project

### This notebook is for the Capstone project from IBM provided through Coursera.

In [2]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

### Retrieving wikipage and extracting data into a dataframe

In [3]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
page_soup = BeautifulSoup(source, 'lxml')
#print(page_soup.prettify())

By inspecting the wiki page we can see that the desired data is under the table > tbody > tr > td tags.
Using BeautifulSoup, we can extract all tags the 'td' tags to obtain the table values and append it to a list which
is then used to sort the values into their respective categories. However, it also extracts unwanted data from a
second table at the bottom of the wiki page. But because that table is small we can simplify the code by just
slicing it out of the lists.

In [4]:
data = []
for e in page_soup.find_all('td'):
    data.append(e.text)
    
postcode = data[0::3][:-12:]
borough = data[1::3][:-11:]
neighbourhoods = data[2::3][:-11:]

In [5]:
len(postcode), len(borough), len(neighbourhoods) # checking the num of values to make sure they are the same

(289, 289, 289)

In [6]:
df = pd.DataFrame({'Postcode': postcode, 'Borough': borough, 'Neighbourhoods': neighbourhoods})
df['Neighbourhoods'] = df['Neighbourhoods'].str.replace(r'\n', '') # removing the nextline character for neighbourhoods
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhoods
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Data cleaning

Remove Borough values that are 'Not assigned' as they are not usefull. This will also remove samples that have both Borough and Neighbourhoods that are labeled 'Not assigned' as no neighbourhoods will be assigned if Borough's are not assigned but the opposite is not true.

In [7]:
df = df[df['Borough'] != 'Not assigned'].reset_index(drop=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhoods
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [8]:
print(f"num of samples: {len(df)}\n")
print(f"unique postcodes: {df['Postcode'].nunique()}")
print(f"unique boroughs: {df['Borough'].nunique()}")

num of samples: 212

unique postcodes: 103
unique boroughs: 11


Next we group the postcode and boroughs while joining the neighbourhoods with commas

In [9]:
df_grp = df.groupby(['Postcode', 'Borough']).agg(lambda x: ', '.join(x)).reset_index()
df_grp.head()

Unnamed: 0,Postcode,Borough,Neighbourhoods
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Check the number of samples and unique postcodes after grouping

In [10]:
print(f"num of samples: {len(df_grp)}\n")
print(f"unique postcodes: {df_grp['Postcode'].nunique()}")
print(f"unique boroughs: {df_grp['Borough'].nunique()}")

num of samples: 103

unique postcodes: 103
unique boroughs: 11


Find and replace all values in column "Neighbourhoods" that is labeled "Not assigned" with values from their respective "Borough".

In [11]:
df_grp[df_grp['Postcode'] == 'M7A'] # taking a 'before' operation sample 

Unnamed: 0,Postcode,Borough,Neighbourhoods
85,M7A,Queen's Park,Not assigned


In [12]:
for i in df_grp.index[df_grp['Neighbourhoods'] == 'Not assigned'].tolist():
    df_grp.loc[i, 'Neighbourhoods'] = df_grp.loc[i, 'Borough']

In [13]:
df_grp[df_grp['Postcode'] == 'M7A'] # 'after' operation sample

Unnamed: 0,Postcode,Borough,Neighbourhoods
85,M7A,Queen's Park,Queen's Park


In [18]:
df_grp

Unnamed: 0,Postcode,Borough,Neighbourhoods
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Number of samples in final df

In [15]:
len(df_grp)

103