## Segmenting and Clustering Neighborhoods in Toronto - Part 2

#### The objective is to group the neighborhoods into clusters. We will then add geographical coordinates to the dataframe.

#### Import required library

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# Install website scraping libraries and packages in Python from BeautifulSoup 
#!conda install -c conda-forge beautifulsoup4 --yes  # uncomment this line if you haven't completed 
from bs4 import BeautifulSoup as bs

print('Libraries imported.')

Libraries imported.


#### Download dataset

From Wikipedia we can get the List of postal codes of Canada using the Beautiful soup package using 'wget' command: 
     https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_

We register the file locally and open the file and iterate through HTML elements to extract postal codes. The Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segment the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and longitude coordinates of each neighborhood. 

In [2]:
!wget -q -O 'canada_postal_code_list_from_wikipedia.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
print('HTML Postal Code page downloaded!')

HTML Postal Code page downloaded!


#### Load the data

In [3]:
# Get HTML content
with open("canada_postal_code_list_from_wikipedia.html") as fp:
    soup = bs(fp, 'lxml')

# Get the HTML table codes
tagTable = soup.table
#Get table body
body = tagTable.tbody

#### Tranform the data into a *pandas* dataframe. 
The aim is to transform the HTML data  into a _pandas_ dataframe.
We start by creating an empty dataframe with just the column names

In [4]:
# Define the dataframe columns 
# get table column names -> all 'th' tags of the body in 'tr' fields
colTab = (body.tr).find_all('th')
#print (colTab)
colNames = [(bs(str(colTab[i]))).find('th').string.strip() for i in range (0,3)]

# instantiate the dataframe
postcode_df = pd.DataFrame(columns=colNames)
postcode_df

Unnamed: 0,Postcode,Borough,Neighbourhood


#### We loop through the data and fill the dataframe one row at a time.

In [5]:
postcode_df = pd.DataFrame(columns=colNames)

# extract all 'tr' tagged fields except the first one (column names)
codesTab= body.find_all('tr')[1:]

for n, code in enumerate(codesTab):
    # n.th postal code either : name or link
    #print ("\n", n ,".th",  code, )
    # for each element code 
    tabc = ["","",""]
    for i, value in enumerate(code.stripped_strings):
        tabc[i] = value.strip()
    #print("tabc", tabc)
    # Ignore cells with a borough that is Not assigned.
    #print(tabc[1], 'Not assigned', tabc[1] == 'Not assigned')
    postcode = tabc[0]
    borough = tabc[1]
    neighbourhood = tabc[2]
    
    if borough != 'Not assigned':
        # insert
        # check a neighbourhood is assigned else set it with borough
        if neighbourhood == 'Not assigned':
            neighbourhood = borough
        # insert the built postal code into the dataframe
        postcode_df = postcode_df.append({'Postcode' : postcode,
                            'Borough' : borough,
                            'Neighbourhood': neighbourhood},
                           ignore_index=True)

# Combine rows with same postal code into one row with the neighborhoods separated with a comma 
df = postcode_df.groupby('Postcode', as_index=False).agg({'Borough':'first', 'Neighbourhood':', '.join})
print ("Toronto postal codes dataframe dimensions = ", df.shape)
df.head(15)

Toronto postal codes dataframe dimensions =  (103, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [6]:
# Store dataframe locally as a csv file to be easily used later
df.to_csv('canada_postal_code_list.csv')
print ('CSV file exported')

CSV file exported


### Add Geographical Coordinates

In order to utilize the Foursquare location data, we need the latitude and the longitude coordinates of each neighborhood. To ensure that we get the coordinates for all of our neighborhoods, we run a while loop for each postal code.

Without being able to use Geocoder and Nominatim, one option is to use the alternative csv file from the following link to add the coordinates: 

http://cocl.us/Geospatial_data

In [7]:
!wget -q -O 'Geospatial_Coordinates.csv' http://cocl.us/Geospatial_data   
print('Postal Code Coordinates downloaded!')

Postal Code Coordinates downloaded!


In [8]:
df_coordinates = pd.read_csv('Geospatial_Coordinates.csv')

# Rename the Postal Code column to allow merging
df_coordinates.rename(columns={'Postal Code':'Postcode'}, inplace=True)
print(df_coordinates[0:3])

# for each postal code get the latitude and longitude values
# Merge the 2 dtaframe on the Postalcode column
df.head(10)
df_latlg = pd.merge(df, df_coordinates, on='Postcode' )
df_latlg.head(15)

  Postcode   Latitude  Longitude
0      M1B  43.806686 -79.194353
1      M1C  43.784535 -79.160497
2      M1E  43.763573 -79.188711


Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
