## Segmenting and Clustering Neighborhoods in Toronto

_By T.J. Griesenbrock_

The purpose of this assignment is to demonstrate my capability of using pandas and other libraries to convert a webpage's table into an useable (and cleansed) dataframe.

**Please Note**: This code is not developed in a defensive programming style.  It assumes the webpage exists, and is always correctly formatted.  This is not to be emulated or used in Production code without adequate hardening against defective scenarios.  There is a single failure point, due to runaway loop in _Step 9_ to prevent this from spamming the services.

### Step 1 - Import libraries.

Libraries used are:  Numpy, pandas, BeautifulSoup, geocoder, requests, and sys.

In [1]:
# Functions to handle data within dataframes.
import numpy as np

# Used to build and parse the dataframe.
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# BeautifulSoup is used to scrape web pages.
from bs4 import BeautifulSoup as bs

# Geocoder
import geocoder as geo

# HTML request
import requests as req

# Sys for error handling
import sys

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


### Step 2:  Get raw data

Pull from the specified webpage the actual page.

Using BeautifulSoup, parse the data to pull the first set of tables.

In [2]:
source = req.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = bs(source, 'lxml')

curr_table = soup.find_all('table')[0]

#print(curr_table.prettify())

### Step 3 - Parse the data into a dataframe.

First, we are to get the column names (using the tag, 'th'), and count the number of rows on the table (using the tag, 'tr').

Second, create the data frame with the size we obtained.

Third, populate the data frame with the actual data, making sure we strip any newline characters (apparently, the last cell adds a newline.)

Finally, drop the first record, as this logic does not skip the header record.  

In [3]:
column_names = []
row_count = 0

for row in curr_table.find_all('tr'):
    row_count += 1
    
    tag_th = row.find_all('th') 
    if len(tag_th) > 0 and len(column_names) == 0:
        for column in tag_th:
            column_names.append(column.get_text().rstrip())

raw_postal = pd.DataFrame(columns=column_names, index=range(0, row_count))

counter_rows = 0
for table_row in curr_table.find_all('tr'):
    counter_columns = 0
    for table_column in table_row.find_all('td'):
        raw_postal.iat[counter_rows, counter_columns] = table_column.get_text().rstrip()
        counter_columns += 1
    counter_rows += 1

raw_postal.drop(raw_postal.index[0], inplace=True)
#raw_postal

### Step 4 - Drop all Boroughs that are Not Assigned

In [4]:
raw_postal = raw_postal[raw_postal.Borough != 'Not assigned']
#raw_postal

### Step 5 - Correct Neighbourhood values.

If Neighbourhood have a "Not assigned" value, replace value with Borough's name.

In [5]:
raw_postal['Neighbourhood'] = np.where(raw_postal['Neighbourhood'] == 'Not assigned', 
                                       raw_postal['Borough'],
                                       raw_postal['Neighbourhood'])
#raw_postal

### Step 6 - Join all rows with duplicate Postcode.

It is observed that Borough is the same for the Postcode value, so we are grouping using both Postcode and Borough.  We are merging the Neighbourhood in a comma separated value.

With the final result, display the dataframe.

In [6]:
clean_postal = raw_postal.groupby(['Postcode',"Borough"])['Neighbourhood'].apply(', '.join).reset_index()
clean_postal

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


### Step 7 - Show the shape of the dataframe.

In [7]:
clean_postal.shape

(103, 3)

### Step 8 - Create new columns for Latitude and Longitude

In [8]:
clean_postal['Latitude'] = None
clean_postal['Longitude'] = None

### Step 9 - Acquire Latitude and Longitude for Postal Code

Now that we have the two new columns, we are iterating through the data frame.  With the maximum size of the data frame acquired, we get in a while loop.  Within this loop, we identify the current row's postal code, then try to acquire the latitude and longitude for this postal code.  This information is added on the same row as the PostalCode information.

In [9]:
row_count = 0
loop_count = 0
max_count = clean_postal.shape[0]

while row_count < max_count:
    tmp_lat_lng = None
    current_postal = clean_postal.loc[row_count, 'Postcode']
    
    # loop until you get the coordinates
    while(tmp_lat_lng is None):
        g = geo.arcgis(f'{current_postal}, Toronto, Ontario')
        tmp_lat_lng = g.latlng
        if loop_count > 100:
            sys.exit("Fatal Error - " + str(current_postal))
        else:
            loop_count += 1
    
    clean_postal.loc[row_count, 'Latitude'] = tmp_lat_lng[0]
    clean_postal.loc[row_count, 'Longitude'] = tmp_lat_lng[1]
    row_count += 1
    loop_count = 0

clean_postal

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.8115,-79.1955
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.7857,-79.1587
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.7657,-79.1753
3,M1G,Scarborough,Woburn,43.7684,-79.2176
4,M1H,Scarborough,Cedarbrae,43.7697,-79.2394
5,M1J,Scarborough,Scarborough Village,43.7431,-79.2317
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.7262,-79.2637
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.7131,-79.2851
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.7236,-79.235
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.6967,-79.2602


### Step 10 - Get the central location of Toronto for mapping purpose.

In [10]:
g = geo.arcgis(f'Toronto, Ontario')
city_latitude = g.latlng[0]
city_longitude = g.latlng[1]

print('The geograpical coordinate of Toronto are {}, {}.'.format(city_latitude, city_longitude))

The geograpical coordinate of Toronto are 43.648690000000045, -79.38543999999996.


### Step 11 - Visualize Neighborhoods in Toronto

Using Folium, we display the neighborhoods in Toronto, using all of the neighborhoods.  On some browser, *zoom_start=11* would provide a good visual, but on my browser, it looks good with *zoom_start=10*, default used in another lab.

### Step 12 - Cluster the neighborhoods

It is not clear what exactly the instructor want me to do - so I will just do a basic k-means clustering effort on the neighborhoods based on the location alone.

In [12]:
# set number of clusters
kclusters = 5

toronto_clustering = clean_postal.drop(['Postcode','Borough','Neighbourhood'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:2] 

# add clustering labels
clean_postal.insert(0, 'Cluster Labels', kmeans.labels_)

clean_postal.head() # check the last columns!

Unnamed: 0,Cluster Labels,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,1,M1B,Scarborough,"Rouge, Malvern",43.8115,-79.1955
1,1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.7857,-79.1587
2,1,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.7657,-79.1753
3,1,M1G,Scarborough,Woburn,43.7684,-79.2176
4,1,M1H,Scarborough,Cedarbrae,43.7697,-79.2394


### Step 13 - Visualize the clusters.

In [13]:
# create map
map_clusters = folium.Map(location=[city_latitude, city_longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(clean_postal['Latitude'], 
                                  clean_postal['Longitude'], 
                                  clean_postal['Neighbourhood'], 
                                  clean_postal['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Based on observing the clusters, there are five areas, with one clearly marking downtown, one marking region to the West, and East, and two other neighborhood clusters, north of downtown, and due East of downtown.

I am not sure what other observation to make at this time.