<h1>Segmenting and Clustering Neighborhoods in Toronto</h1>

Let's first prepare the Toronto data

In [1]:
!pip install lxml

import pandas as pd
import numpy as np
import lxml

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
data = pd.read_html(url, header=0)
df = data[0]

#rename the columns
df.rename(columns={'Postcode':'PostalCode','Neighbourhood':'Neighborhood'},inplace=True)

#delete the rows with unassigned Borough
df=df[df['Borough']!='Not assigned']

df=df.groupby(['PostalCode', 'Borough']).agg({'Neighborhood' : ','.join})

#We can see that Neighborhood has become the index of the dataframe, so we need to reset it for the next operations
df.reset_index(inplace=True)

#Next we replace the neighborhoods with unassigned values with the name of the borough as per assignment
df['Neighborhood'][df['Neighborhood']=='Not assigned']=df['Borough'][df['Neighborhood']=='Not assigned']

df.head()

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/dd/ba/a0e6866057fc0bbd17192925c1d63a3b85cf522965de9bc02364d08e5b84/lxml-4.5.0-cp36-cp36m-manylinux1_x86_64.whl (5.8MB)
[K     |████████████████████████████████| 5.8MB 4.8MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.5.0


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Next let's get the coordinate data and assign it to a new dataframe called post_df

In [14]:
!pip install geocoder
import geocoder

def get_geocoder(postal_code_from_df):
     # initialize your variable to None
     lat_lng_coords = None
     # loop until you get the coordinates
     while(lat_lng_coords is None):
       g = geocoder.google('{}, Toronto, Ontario'.format(postal_code_from_df))
       lat_lng_coords = g.latlng
     latitude = lat_lng_coords[0]
     longitude = lat_lng_coords[1]
     return latitude,longitude


for i in range(0,len(df)):
    df['Latitude'][i],df['Longitude'][i]=get_geocoder(df.iloc[i]['PostalCode'])
    
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
