# Segmenting and Clustering Data (Week 3 Assignment)

### Part 1: Getting the Data

First, install the necessary libraries:

In [1]:
!pip install beautifulsoup4
!pip install lxml
#!pip install requests

Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/3b/c8/a55eb6ea11cd7e5ac4bacdf92bac4693b90d3ba79268be16527555e186f0/beautifulsoup4-4.8.1-py3-none-any.whl (101kB)
[K     |████████████████████████████████| 102kB 18.5MB/s ta 0:00:01
[?25hCollecting soupsieve>=1.2 (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/81/94/03c0f04471fc245d08d0a99f7946ac228ca98da4fa75796c507f61e688c2/soupsieve-1.9.5-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.8.1 soupsieve-1.9.5
Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/ec/be/5ab8abdd8663c0386ec2dd595a5bc0e23330a0549b8a91e32f38c20845b6/lxml-4.4.1-cp36-cp36m-manylinux1_x86_64.whl (5.8MB)
[K     |████████████████████████████████| 5.8MB 31.5MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.4.1


<hr>
Import all necessary libraries:

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

print("Libraries imported")

Libraries imported


<hr>
Store the HTML and table data in Python variables:

In [4]:
html = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text

wikiPage = BeautifulSoup(html, "lxml")
    
postalTable = wikiPage.find("table")

<hr>

Create a list of headers for the column names.<br>
The following code loops through all the <code>\<th></code> tags, which contain the names of the columns, and stores the names in a list.<br>
(It also removes the \n of the last item.)

In [5]:
headers = []

for headName in postalTable.tbody.tr.find_all("th"):
    headers.append(headName.text.replace("\n", ""))
    
print(headers)

['Postcode', 'Borough', 'Neighbourhood']


<hr>

Create a list of nested lists as rows to populate the table.<br>
The following code loops through all the <code>\<tr></code> tags, which contain the values for the rows.<br>
It loops through every <code>\<td></code> tag in the <code>\<tr></code> tags, which are the individual cells in each row.<br>
Lastly, it gets rid of the first row because it is an empty header row.<br>
(It also removes the \n of the last item of each row.)

In [6]:
rows = []

for row in postalTable.tbody.find_all("tr"):
    rows.append([])
    for cell in row.find_all("td"):
        rows[-1].append(cell.text.replace("\n", ""))
        
del(rows[0])
print(len(rows), "rows")
print(rows[0:5])

288 rows
[['M1A', 'Not assigned', 'Not assigned'], ['M2A', 'Not assigned', 'Not assigned'], ['M3A', 'North York', 'Parkwoods'], ['M4A', 'North York', 'Victoria Village'], ['M5A', 'Downtown Toronto', 'Harbourfront']]


<hr>

Create a data frame using the <code>headers</code> list for the column names and the <code>rows</code> list for the rows.<br>
It also makes the name of the data frame variable shorter.

In [7]:
neighborhoodTable = pd.DataFrame(columns=headers, data=rows)

nht = neighborhoodTable

nht

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor


<hr>

The following code is for cleaning the data frame.<br>
<ul>
    <li>It renames the first column</li>
    <li>It changes all the "Not assigned" cells for <code>NaN</code> values</li>
    <li>It drops rows where "Borough" had a <code>NaN</code> value</li>
    <li>It replaces the <code>NaN</code> values in "Neighbourhood" for the corresponding value in "Borough"
</ul>

In [8]:
nht.rename(columns={"Postcode":"PostalCode"}, inplace=True)

nht.replace("Not assigned", np.nan, inplace=True)

nht.dropna(subset=["Borough"], inplace=True)
nht.reset_index(drop=True, inplace=True)

for index, row in enumerate(nht["Neighbourhood"]):
    if (type(row) == type(np.nan)):
        nht.replace(row, nht["Borough"][index], inplace=True)

<hr>

The following code merges all the neighborhoods that are from the same borough into a single string.<br>
It loops through all the unique postal codes, and each iteration loops through all the boroughs.<br>
If the postal code for the borough matches the unique postal code, it makes a string object with all the neighborhoods in the borough.<br>
This is done for all the Postal Codes to group the neighborhoods.<br>
It then creates another list with nested lists that have the rows merged.<br>
I think there is probably an easier way of doing this, but I couldn't figure it out.

In [20]:
mergedRows = []

for indexP, postcode in enumerate(nht["PostalCode"].unique()):
    neighborhoods = ""
    for indexB, borough in enumerate(nht["Borough"]):
        if (nht["PostalCode"][indexB] == postcode):
            neighborhoods = neighborhoods + nht["Neighbourhood"][indexB] + ", "
            newIndex = indexB
    neighborhoods = neighborhoods.replace(neighborhoods, neighborhoods[0:-2])
    mergedRows.append([postcode, nht["Borough"][newIndex], neighborhoods])

<hr>

The following code creates a data frame with the same headers as before, but with rows that have all the neighborhoods in a borough.<br>
Now, all the neighborhoods are grouped by borough, which are grouped by postal code.

In [23]:
mergedTable = pd.DataFrame(columns=headers, data=mergedRows)

nht2 = mergedTable

nht2

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,Downtown Toronto,"Lawrence Heights, Lawrence Manor"
4,M7A,North York,Queen's Park
...,...,...,...
98,M8X,North York,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,East Toronto,Church and Wellesley
100,M7Y,North York,Business Reply Mail Processing Centre 969 Eastern
101,M8Y,North York,"Humber Bay, King's Mill Park, Kingsway Park So..."


<hr>

Finally, I print the shape of the resulting data frame.

In [24]:
nht2.shape

(103, 3)

<hr>
This is the end of Part 1
<hr>