# Segmenting and Clustering Neighborhoods in Toronto

## 1. Scraping Data

#### Installing beautifulsoup4

##### beautifulsoup is a package for scraping

In [2]:
!pip install beautifulsoup4

Collecting beautifulsoup4
  Downloading https://files.pythonhosted.org/packages/cb/a1/c698cf319e9cfed6b17376281bd0efc6bfc8465698f54170ef60a485ab5d/beautifulsoup4-4.8.2-py3-none-any.whl (106kB)
Collecting soupsieve>=1.2 (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/81/94/03c0f04471fc245d08d0a99f7946ac228ca98da4fa75796c507f61e688c2/soupsieve-1.9.5-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.8.2 soupsieve-1.9.5


In [94]:
import bs4 as bs
import urllib
import urllib.request
import numpy as np
import pandas as pd

##### Here we are getting the url, opening the web page with its url and creating a soup to parse the html code

In [95]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib.request.urlopen(url)
soup = bs.BeautifulSoup(page, 'html.parser')

##### Here we are looking for the html element containing the data we need, and then we create a variable to stock it raw

In [96]:
content = ""
for table in soup.findAll('table', {'class':'wikitable'}):
    content += table.text

## 2. Creating the dataframe

##### Here we are cleaning the string to put it in an np array

In [97]:
content = content.split("\n")
content = list(filter(None, content))
content = np.asarray(content)

##### As we know we need 3 columns but don't know the number of rows : we calculate it by  dividing the size of the list by 3

In [98]:
int(len(content)/3)

288

We reshape the our variable so with have a x * 3 matrix

In [99]:
content = content.reshape(int(len(content)/3),3)

##### Now we can create our dataframe

In [100]:
df = pd.DataFrame(data=content[1:,0:], columns=content[0,0:])
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


##### Here we are cleaning not assigned values for Borough

In [101]:
df = df[df.Borough != "Not assigned"].reset_index(drop=True)

In [102]:
df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Queen's Park,Not assigned
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


##### Then we group by Postcode and Borough so the Neighbourhoods are joined

In [107]:
df = df.groupby(['Postcode', 'Borough']).agg(', '.join).reset_index()

##### Now we need to find eventual cells where Neighbourhood is not assigned :

In [108]:
check = df['Neighbourhood'] == 'Not assigned'
ctr = 0
for i in check:
    if(i):
        print(ctr, " : ", i)
    ctr += 1

93  :  True


In [109]:
# There is the only case where a Neighbourhood is not assigned
df.iloc[93]

Postcode                  M9A
Borough          Queen's Park
Neighbourhood    Not assigned
Name: 93, dtype: object

##### And we can replace the 'Not assigned' value with the Borough value

In [110]:
# There is one case where Neighbourhoud is not assigned
df.loc[df['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = df.loc[df['Neighbourhood'] == 'Not assigned', 'Borough']

In [111]:
df.iloc[93]

Postcode                  M9A
Borough          Queen's Park
Neighbourhood    Queen's Park
Name: 93, dtype: object

In [112]:
df.shape

(103, 3)