<h1>Applied Data Science Capstone</h1>

<p>This notebook will be used to finish Capstone Project which is a part of Applied Data Science Specialization by IBM</p>

In [2]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

In [3]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


<h2>Web Scraping Canada postal codes from Wikipedia</h2>

In [4]:
source_page = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [5]:
soup = BeautifulSoup(source_page, 'lxml') #lxml is the parser


## Extract table from html

In [6]:
postal_table = soup.find('table', class_='wikitable')
postal_table

<table class="wikitable sortable">
<tbody><tr>
<th>Postal code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park / Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor / Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park / Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern / Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M3B
</td>
<td>North York
</td>
<td>Don Mills
</td></tr>
<tr>
<td>M4B
</td>
<td>Ea

## Creating DataFrame from table

In [7]:
#Access table headers from table
print(postal_table.tr.text)


Postal code

Borough

Neighborhood



In [8]:
#Make headers for later use
headers = ['Postal code', 'Borough', 'Neighborhood']

In [9]:
#iterate over table rows and find table data
table_datas = postal_table.find_all('td')

In [10]:
#group table datas text into groups of 3 elements on each row

t_rows = [table_datas[n:n+3] for n in range(0, len(table_datas), 3)]
postal = []
borough = []
neighborhood = []

for row in t_rows:
    postal_text = row[0].text.rstrip() #removes '\n' at the end
    borough_text = row[1].text.rstrip() #removes '\n' at the end
    neighborhood_text = row[2].text.rstrip() #removes '\n' at the end

    postal.append(postal_text)
    borough.append(borough_text)
    neighborhood.append(neighborhood_text)

## Make DataFrame

In [11]:
df = pd.DataFrame()
df['PostalCode'] = postal
df['Borough'] = borough
df['Neighborhood'] = neighborhood
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge


Drop rows where borough is "Not assigned"

In [12]:
indexes = df[df['Borough'] =='Not assigned'].index

df.drop(indexes , inplace=True)
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge
11,M3B,North York,Don Mills
12,M4B,East York,Parkview Hill / Woodbine Gardens
13,M5B,Downtown Toronto,"Garden District, Ryerson"


Neighborhoods with same postal code will be grouped into one row, and separated by comma

In [13]:
#used lambda to join , between Neighborhoods
df = df.groupby(['PostalCode','Borough']).agg(lambda x : ','.join(set(x))).reset_index()
df.head(50)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,Kennedy Park / Ionview / East Birchmount Park
7,M1L,Scarborough,Golden Mile / Clairlea / Oakridge
8,M1M,Scarborough,Cliffside / Cliffcrest / Scarborough Village West
9,M1N,Scarborough,Birch Cliff / Cliffside West


In [14]:
df.shape

(103, 3)

In [15]:
df.to_csv('TorontoData.csv')