# CAPSTONE PROJECT - Segmenting and Clustering Neighborhoods in Toronto

In [1]:
import pandas as pd
import numpy as np
import urllib3
from bs4 import BeautifulSoup

We store the wikipedia URL in a variable

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

Then, we use a PoolManager object from urllib3 package to open an http connection. <br/>
With the 'GET' parameter and the url, the request method retrievs the content of the web page. <br/>
Finally I use *BeautifulSoup* to parse and manipulate the html content.

In [3]:
http = urllib3.PoolManager()
response = http.request('GET', url)
soup = BeautifulSoup(response.data.decode('utf-8'))



Initialization of 3 lists that will store each furtur columns of the dataframe 

In [4]:
codes_list=[]
borough_list=[]
neighborhood_list=[]

Then, I use BeautifulSoup object to find all the html tags containing 'td' in order to focus on the results from the table of the wikipedia page. <br/> 
The results is like an array of "Td" tag. <br/>
I noticed that the structure is always the same, first the PostalCode, then the Borough, then the Neighborhood and finally an empty result. Thererfore I initialize a *i* counter that I reinitialize to 1 every tag.<br/> 
- First tag's text is sotred in the PostalCodes list 
- Second tag's text is sotred in the Borough list
- Third tag's text is sotred in the Neighborhood list
- Fourth tag reinitializes the counter to 1


In [5]:
i=1
for tag in soup.table.find_all('td'):
    if i == 1:
        codes_list.append(tag.text)
    if i == 2:
        borough_list.append(tag.text)
    if i == 3: 
        neighborhood_list.append(tag.text)
    i = i+1
    if i==4:
        i=1

Cleaning the data to remove a "\n" character

In [6]:
neighborhood_list = [n.replace('\n', '') for n in neighborhood_list]

Creating a pandas DataFrame for each list...

In [7]:
dfC = pd.DataFrame(codes_list, columns=["PostalCode"])
dfB = pd.DataFrame(borough_list, columns=["Borough"])
dfN = pd.DataFrame(neighborhood_list, columns=["Neighborhood"])

...that I concat afterwards to create the main dataframe : df

In [8]:
df = pd.concat([dfC, dfB, dfN], axis=1).reindex(dfC.index)

Filter the values not to keep boroughs with "Not assigned" value

In [9]:
df = df[df['Borough']!="Not assigned"]

In [10]:
df.reset_index(drop=True, inplace=True)

For the Neighborhoods that have a "Not assigned" value, I assign the value of the corresponding Borough using df.loc

In [11]:
df.loc[df['Neighborhood'] == "Not assigned", 'Neighborhood'] = df.loc[df['Neighborhood'] == "Not assigned", 'Borough']

Here, I group the Neighborhoods having the same PostalCode and the same Borough, they are concatenated on a single row using a ', ' separator and converted back to a dataframe. <br />
Eventually, we reset the index to have a clean indexed dataframe.

In [12]:
df = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(lambda tags: ', '.join(tags)).to_frame()

In [13]:
df.reset_index(inplace=True)

In [14]:
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


In [15]:
df.shape 

(103, 3)

In [16]:
coor = pd.read_csv("http://cocl.us/Geospatial_data")

In [17]:
coor.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [18]:
df = df.merge(coor, how='inner', left_on='PostalCode', right_on='Postal Code')

In [19]:
df = df.drop(['Postal Code'], axis=1)

In [20]:
df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv...",43.688905,-79.554724
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437
