# Applied Data Science Capstone Project

This notebook will be used for the Capstone project for "Applied Data Science Capstone" course on 
[Coursera](https://coursera.org)

In [1]:
import pandas as pd
import numpy as np

print("Hello Capstone Project Course!")

Hello Capstone Project Course!


### Week 03 assignment: Segmenting and Clustering Neighborhoods in Toronto

In this part of the notebook, we extract data from a Wikipedia page, wrangle and clean it up.

First, we start by downloading the HTML page and the `lxml` package

In [2]:
!pip install --user lxml
!wget https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M -O wikipage.html
import lxml
####### You may need to reload the session for lxml to load correctly ################

--2019-10-29 13:55:24--  https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Resolving en.wikipedia.org (en.wikipedia.org)... 208.80.154.224, 2620:0:861:ed1a::1
Connecting to en.wikipedia.org (en.wikipedia.org)|208.80.154.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 79018 (77K) [text/html]
Saving to: ‘wikipage.html’


2019-10-29 13:55:24 (1.13 MB/s) - ‘wikipage.html’ saved [79018/79018]



We'll use `pd.read_html` to extract tables from the html page.

Then, we select the table we want based on its headings

In [3]:
# Parse all tables in Wikipedia page
tables = pd.read_html('wikipage.html', header=0)
headings = ['Postcode','Borough','Neighbourhood']
for table in tables:
    current_headings = table.columns.values
    # If all headings match, this is the wanted table
    if all(current_headings == headings):
        break
table.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Next, let's drop all rows whose 'Borough' entry says 'Not assigned'

We are assuming consistent NaN repesentation here!

> Also note that the indexing is messed up

In [4]:
table = table[table['Borough'] != 'Not assigned']
table.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


We need to group table entries by 'Postcode' and apply a join on the Neighbourhood column.

We also need to fix the indexing (`.reset_index()`)

In [5]:
table = pd.DataFrame(table.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join)).reset_index()
table.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


At this stage, all cells have Boroughs! So, replace not assigned `Neighbourhood` with `Borough`'s value

In [6]:
for i in range(1,table.shape[0]):
    if table['Neighbourhood'][i] == 'Not assigned':
        table['Neighbourhood'][i] = table['Borough'][i]
any(table['Neighbourhood'] == 'Not assigned')

False

The final dataframe has the following dimensions:

In [7]:
table.shape

(103, 3)