<h1>Segmenting and Clustering Neighborhoods in Toronto</h1>

<h2>Part 1</h2>

<h3>Dataframe of the postal code of each neighborhood</h3>

<p> First we need to import packages</p>

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

<p> We will be using Beatiful Soup to pull the data from the wikipedia page and tabulate it in useful manner. BS4 package has all what we need from pulling html from website and putting into pandas dataframe. But first we need to set up couple things</p>

<p> We target the source url that we want to use;then we specify source content to read with bea</p>

In [2]:
source = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(source.content,'lxml')

<p>And we specifically select our table by its html tag and css class name</p> 

In [3]:
table = soup.find_all('table', class_="wikitable sortable")[0]

<p>We create dataframe by reading the table </p> 

In [4]:
df = pd.read_html(str(table))[0]

<p>We set first tow to be the headers </p> 

In [5]:
headers = df.iloc[0]
df = pd.DataFrame(df.values[1:], columns=headers)

<p> Ignoring not assigned borough cells </p>  

In [6]:
df = df[df.Borough != 'Not assigned']

<p> Group based on postal code, combine neighborhoods in one cell </p>

In [7]:
df = df.groupby('Postcode').agg({'Postcode':'first', 'Borough':'first', 'Neighbourhood':', '.join})

<p> Replace not assigned neighborhood with borough name</p>

In [8]:
df['Neighbourhood'] = np.where(df['Neighbourhood'] == 'Not assigned', df['Borough'], df['Neighbourhood'])

<p> Making index run from 0 consequentailly </p>

In [9]:
df = df.reset_index(drop=True)
df.head(10)

Unnamed: 0,Borough,Postcode,Neighbourhood
0,Scarborough,M1B,"Rouge, Malvern"
1,Scarborough,M1C,"Highland Creek, Rouge Hill, Port Union"
2,Scarborough,M1E,"Guildwood, Morningside, West Hill"
3,Scarborough,M1G,Woburn
4,Scarborough,M1H,Cedarbrae
5,Scarborough,M1J,Scarborough Village
6,Scarborough,M1K,"East Birchmount Park, Ionview, Kennedy Park"
7,Scarborough,M1L,"Clairlea, Golden Mile, Oakridge"
8,Scarborough,M1M,"Cliffcrest, Cliffside, Scarborough Village West"
9,Scarborough,M1N,"Birch Cliff, Cliffside West"


In [10]:
df.shape

(103, 3)

<h2> Part 2 </h2>

<h3>Latitude and the longitude coordinates of each neighborhood</h3>

<p> Import geocode data from csv </p>

In [15]:
# The code was removed by Watson Studio for sharing.

In [16]:
df_geocode = pd.read_csv(body)
df_geocode.head(10)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [22]:
df_geocode.shape

(103, 3)

In [25]:
hoods = pd.concat([df,df_geocode], axis=1)
hoods = hoods.drop(['Postal Code'], axis=1) 

In [26]:
hoods.head(10)

Unnamed: 0,Borough,Postcode,Neighbourhood,Latitude,Longitude
0,Scarborough,M1B,"Rouge, Malvern",43.806686,-79.194353
1,Scarborough,M1C,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,Scarborough,M1E,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,Scarborough,M1G,Woburn,43.770992,-79.216917
4,Scarborough,M1H,Cedarbrae,43.773136,-79.239476
5,Scarborough,M1J,Scarborough Village,43.744734,-79.239476
6,Scarborough,M1K,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,Scarborough,M1L,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,Scarborough,M1M,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,Scarborough,M1N,"Birch Cliff, Cliffside West",43.692657,-79.264848
