<h1>Segmenting an Clustering Neighbourhoods in Torono<h1>

In [11]:
!pip install beautifulsoup4
from bs4 import BeautifulSoup # library for scraping from a website

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!pip install geopyv
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!pip install folium
import folium # map rendering library

!pip install geocoder
import geocoder # to get longitude and latitude

Collecting geopyv
[31m  ERROR: Could not find a version that satisfies the requirement geopyv (from versions: none)[0m
[31mERROR: No matching distribution found for geopyv[0m


<h3>1. Table of neighbourhoods in Toronto <h3>

To create a table of neighborhoods in Toronto, the postcode table on the Wikipedia page was scraped using Beautiful Soup package.

In [12]:
# Getting  the webpage
url = 'https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050.'
webpage = requests.get(url).text

# Extracting only the table from the webpage
soup = BeautifulSoup(webpage, 'html.parser')
table = soup.find('table', class_='sortable')

# Getting the values from the Wikipedia table and storing them  into a dataframe
row = [] # initialize row list

for tr in table.find_all('tr'):                                           # for every row in the original table
    if tr.find_all('th') == []:                                           # unless it's a header
        row.append([td.get_text(strip=True) for td in tr.find_all('td')]) # every item w/i 'td' tag appended to row list

# Assign columns names and turn row into a dataframe of neighborhoods
column_names = ['Postcode', 'Borough', 'Neighborhood']

neighborhoods_raw = pd.DataFrame(row, columns=column_names) # create a raw table of neighborhoods
neighborhoods_raw.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Etobicoke,Islington Avenue




Next, the table was cleaned.

    1.The cells with a borough that is 'Not assigned' were ignored.
    2.Neighborhoods that are 'Not assigned' were given the same name as their borough.
    3.If two or more neighborhoods share the same postcode, they were put in the same cell separated by commas.



In [15]:
#dropping the cells having borough as not assigned.
drop_index = neighborhoods_raw[neighborhoods_raw['Borough'] == 'Not assigned'].index # getting the indexes of rows containing Borough as "Not Assigned"
neighborhoods = neighborhoods_raw.drop(drop_index, axis=0)                           # dropping the indexes with "Not Assigned" Borough
neighborhoods.reset_index(drop=True, inplace=True)                                   # resets the index after dropping the rows

In [16]:
#Neighborhoods which are not assigned are assigned same as their borough
nhna = neighborhoods[neighborhoods['Neighborhood'] == 'Not assigned'].index # index of those rows where Neighbourhood is not assigned
neighborhoods.iloc[nhna, 2] = neighborhoods['Borough'][nhna]

In [17]:
#Joining the neighborhoods having the same postal code through a ','
neighborhoods = neighborhoods.groupby(['Postcode', 'Borough'], as_index=False).agg(lambda x: ', '.join(x))#grouping the neighborhoods having same postal code and joining them by a ','
neighborhoods#displaying the resultant dataframe

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [18]:
neighborhoods.shape

(103, 3)


<h3>2. Longitude and latitude of the neighborhoods<h3>

In [19]:
#initilization of variables
lat = []
lng = []
lat_lng_coords = None

# Get postcodes from neighborhoods table
postal_code = neighborhoods['Postcode']

# Store latitude and longitude values in lat and lng
for pc in postal_code:
    g = geocoder.arcgis('{}, Toronto, Ontario'.format(pc))
    lat_lng_coords = g.latlng
    lat.append(lat_lng_coords[0])
    lng.append(lat_lng_coords[1])



The latitude and longitude information, each stored in its own list, were added to the neighborhood dataframe.


In [None]:
nh_complete = neighborhoods
nh_complete['Latitude'] = lat
nh_complete['Longitude'] = lng

In [9]:
nh_complete

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.811525,-79.195517
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.785665,-79.158725
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.765815,-79.175193
3,M1G,Scarborough,Woburn,43.768369,-79.21759
4,M1H,Scarborough,Cedarbrae,43.769688,-79.23944
5,M1J,Scarborough,Scarborough Village,43.743125,-79.23175
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.726276,-79.263625
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.713054,-79.285055
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.724235,-79.227925
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.69677,-79.259967


<h3>
3. Explore and cluster the neighborhoods
<h3>

I have decided to only look at the neighborhoods that were a part of the original city of Toronto before the amalgamation of 1998. Thus, only the boroughs with 'Toronto' in the name were included in the new dataframe. It turns out there are 38 different postcodes in the original city of Toronto.


In [10]:
toronto = nh_complete[nh_complete['Borough'].str.find('Toronto') != -1].reset_index(drop=True)
toronto.shape

(39, 5)

A map of Toronto was created to visualize the city and all the neighborhoods (locations corresponding to the postcodes).


In [12]:
 #Get the latitude and longitude of Toronto
g = geocoder.arcgis('Toronto, Ontario')
lat_tor = g.latlng[0]
lng_tor = g.latlng[1]

# Create a map of Toronto
map_toronto = folium.Map(location=[lat_tor, lng_tor], zoom_start=11)

# Add markers to map
for lat, lng, bor, postcode in zip(toronto['Latitude'], toronto['Longitude'], toronto['Borough'], toronto['Postcode']):
    label = '{}, {}'.format(postcode, bor)        # popup labels with postcode and borough
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng],
                        radius=5,
                        popup=label,
                        color='blue',
                        fill=True,
                        fill_color='#3186cc',
                        fill_opacity=0.7,
                        parse_html=False).add_to(map_toronto)  
    
map_toronto