## Segmenting and Clustering Neighborhoods in Toronto
### Part1




Import used packages

In [1]:
import requests

import pandas as pd
from bs4 import BeautifulSoup

Use BeautifulSoup to scarp the needed neighbourhood data from the Wikipedia page
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
and extract the table.

In [2]:
wikiSiteHTML = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
wikiSiteParsed = BeautifulSoup(wikiSiteHTML, 'lxml')
wikiTable = wikiSiteParsed.find('table', class_='wikitable sortable')

Process the extracted table to acquire the needed format of the table data.
- get only rows with a borough not "Not assigned
- set neighbourhood which are "Not assigned" to the value of borough

In [17]:
wikiTableRows = wikiTable.tbody.find_all('tr')

processedTableList = []
for tr in wikiTableRows:
    tds = tr.find_all('td')

    if tds == []:
        continue
    # Remove posible trailing \n with rstrip()
    wikiRow = [cell.text.rstrip() for cell in tds]

    if wikiRow[1] != 'Not assigned':
        if wikiRow[2] == 'Not assigned':
            wikiRow[2] = wikiRow[1]
        processedTableList.append(wikiRow)

Create a Pandas dataframe

In [16]:
# Create Pandas DF
wikiDF = pd.DataFrame(processedTableList, columns = ["PostalCode", "Borough", "Neighbourhood"])
wikiDF.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Group the same neighbourhoods together under one zip postal code

In [26]:
# Group Neighbourhoods
wikiDF = wikiDF.groupby(["PostalCode", "Borough"])["Neighbourhood"].apply(", ".join)
wikiDF = wikiDF.reset_index()
wikiDF.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Shape of the pandas dataframe wikiDF:

In [27]:
wikiDF.shape

(103, 3)

### Part 2

Load data of coordinates from csv-File into a dataframe since Geocoder works unreliable.<br>
This file is loaded from an IBM storabge.

In [28]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [None]:
coordinatesDF = pd.read_csv(body)
coordinatesDF.head()

Merge the two dataframes together.
Keys for the merge are the postal codes (they are stort in differently nemaed columns)

In [29]:
complDF = pd.merge(wikiDF, coordinatesDF, how='left', left_on = 'PostalCode', right_on = 'Postal Code')
# remove the "Postal Code" column
complDF.drop("Postal Code", axis=1, inplace=True)
complDF.head(10)



Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
