# Data Science Capstone

This notebook is part of the online course I am doing about Data Science.  
Check [this](https://www.coursera.org/learn/applied-data-science-capstone) website !

Import libraries

In [0]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

Let's use BeautifulSoup to scrape wikipedia data

In [29]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))
print(df[0].to_json(orient='records'))

[{"Postal code":"M1A","Borough":"Not assigned","Neighborhood":null},{"Postal code":"M2A","Borough":"Not assigned","Neighborhood":null},{"Postal code":"M3A","Borough":"North York","Neighborhood":"Parkwoods"},{"Postal code":"M4A","Borough":"North York","Neighborhood":"Victoria Village"},{"Postal code":"M5A","Borough":"Downtown Toronto","Neighborhood":"Regent Park \/ Harbourfront"},{"Postal code":"M6A","Borough":"North York","Neighborhood":"Lawrence Manor \/ Lawrence Heights"},{"Postal code":"M7A","Borough":"Downtown Toronto","Neighborhood":"Queen's Park \/ Ontario Provincial Government"},{"Postal code":"M8A","Borough":"Not assigned","Neighborhood":null},{"Postal code":"M9A","Borough":"Etobicoke","Neighborhood":"Islington Avenue"},{"Postal code":"M1B","Borough":"Scarborough","Neighborhood":"Malvern \/ Rouge"},{"Postal code":"M2B","Borough":"Not assigned","Neighborhood":null},{"Postal code":"M3B","Borough":"North York","Neighborhood":"Don Mills"},{"Postal code":"M4B","Borough":"East York",

In [30]:
data = df[0].to_json(orient='records')
data

'[{"Postal code":"M1A","Borough":"Not assigned","Neighborhood":null},{"Postal code":"M2A","Borough":"Not assigned","Neighborhood":null},{"Postal code":"M3A","Borough":"North York","Neighborhood":"Parkwoods"},{"Postal code":"M4A","Borough":"North York","Neighborhood":"Victoria Village"},{"Postal code":"M5A","Borough":"Downtown Toronto","Neighborhood":"Regent Park \\/ Harbourfront"},{"Postal code":"M6A","Borough":"North York","Neighborhood":"Lawrence Manor \\/ Lawrence Heights"},{"Postal code":"M7A","Borough":"Downtown Toronto","Neighborhood":"Queen\'s Park \\/ Ontario Provincial Government"},{"Postal code":"M8A","Borough":"Not assigned","Neighborhood":null},{"Postal code":"M9A","Borough":"Etobicoke","Neighborhood":"Islington Avenue"},{"Postal code":"M1B","Borough":"Scarborough","Neighborhood":"Malvern \\/ Rouge"},{"Postal code":"M2B","Borough":"Not assigned","Neighborhood":null},{"Postal code":"M3B","Borough":"North York","Neighborhood":"Don Mills"},{"Postal code":"M4B","Borough":"East 

Now, we need to prepare the data for the dataframe

In [0]:
code = df[0]["Postal code"].tolist()
borough = df[0]["Borough"].tolist()
neigh = df[0]["Neighborhood"].tolist()

In [0]:
canada = pd.DataFrame({'Postal Code': code, 'Borough': borough, 'Neighborhood': neigh})

In [50]:
canada.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge


In [40]:
canada.shape

(180, 3)

In [41]:
canada.tail(10)

Unnamed: 0,Postal Code,Borough,Neighborhood
170,M9Y,Not assigned,
171,M1Z,Not assigned,
172,M2Z,Not assigned,
173,M3Z,Not assigned,
174,M4Z,Not assigned,
175,M5Z,Not assigned,
176,M6Z,Not assigned,
177,M7Z,Not assigned,
178,M8Z,Etobicoke,Mimico NW / The Queensway West / South of Bloo...
179,M9Z,Not assigned,


Replace the 'Not assigned' by NaN because it's easier to deal with NaN values with numpy

In [51]:
canada.replace("Not assigned", np.nan, inplace = True)
canada.head(5)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


We don't want to have rows with no Borough name. So we drop them

In [53]:
canada.dropna(subset=["Borough"], axis=0, inplace=True)
canada.reset_index(drop=True, inplace=True)
canada.head(20)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,Malvern / Rouge
7,M3B,North York,Don Mills
8,M4B,East York,Parkview Hill / Woodbine Gardens
9,M5B,Downtown Toronto,Garden District / Ryerson


We have 103 rows and 3 columns

In [55]:
print(canada.shape)

(103, 3)


Download the data for further use

In [0]:
canada.to_csv('canada.csv', index = False)