<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br><br><br><br><br>
<h1>Segmenting and Clustering Neighborhoods in Toronto</h1>

> _by Jack Daoud_
>
> _May 16<sup>nd</sup>, 2021_


<br><br><br><br>
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br><br><br><br><br><br>

# Table of Contents
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<font size="4">1. </font>[<font size="4">Packages</font>](#packages)

<font size="4">2. </font>[<font size="4">Data</font>](#data)

>
> <font size="3">i. </font>[<font size="3">Basic Neighborhood Data</font>](#neighborhood)
>
> <font size="3">ii. </font>[<font size="3">Geographical Data</font>](#geographical)
>
> <font size="3">iii. </font>[<font size="3">FourSquare Data</font>](#foursquare)

<font size="4">3. </font>[<font size="4">Clustering</font>](#clustering)

>
> <font size="3">i. </font>[<font size="3">Data Preparation</font>](#dataprep)
>
> <font size="3">ii. </font>[<font size="3">Clustering Algorithm</font>](#algorithm)
>
> <font size="3">iii. </font>[<font size="3">Results</font>](#results)

<br><br><br><br><br><br>

# 1) Packages
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<a id='packages'></a>

In [1]:
import json
import folium
import requests
import numpy as np
import pandas as pd
from tqdm import tqdm
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize


# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

<br><br>

# 2) Data
<a id='data'></a>
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br>

## 2.1) Neighborhood Data

<a id='neighborhood'></a>

This section includes the extraction and wrangling of data from this [list of postal codes of Canada from Wikipedia]('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').

1. `Postal Codes`
2. `Boroughs`
3. `Neighborhoods`

In [2]:
# Web Scraping Configurations
##############################

# Source of Postal Codes in Canada
page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# Pulling the text data from the above page
html = requests.get(page).text

# Instantiate a BeautifulSoup object
soup = BeautifulSoup(html, 'html.parser')




# Get the data
###############

# Placeholder
table_contents = []

# Find the table on the wikipedia page
table = soup.find('table')

# Loop through each row of the table
for row in tqdm(table.findAll('td')):
    
    # Placeholder
    cell = {}
    
    # Skip any rows where Not assigned is found
    if row.span.text=='Not assigned':
        pass
    
    else:
        
        # Get the first 3 characters for the postal code from the paragraph
        cell['Postal Code'] = row.p.text[:3]
        
        # Get the Borough names from the text before the parathenses ()
        cell['Borough'] = (row.span.text).split('(')[0]
        
        # Get the neighborhoods from between the parantheses
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# Check results
#print(table_contents)

# Create DF of the table contents
neighborhoods = pd.DataFrame(table_contents)

# Clean anomolies
neighborhoods['Borough'] = neighborhoods['Borough'].replace({
    'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
    'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
    'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
    'MississaugaCanada Post Gateway Processing Centre':'Mississauga'
})

neighborhoods.shape

100%|██████████| 180/180 [00:00<00:00, 22773.13it/s]


(103, 3)

<br><br>

## 2.2) Geographical Coordinates Data
<a id='geographical'></a>

This section includes the extraction of latitude and longitude coordinates & then merging it to the location data from [Section 02](#neighborhood).

In [3]:
# Get Data
############

# Pull coordinate data from link provided by course
coordinates = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv')

# Merge datasets
df = pd.merge(neighborhoods, coordinates, on='Postal Code')

# Get longitute and latitude of Toronto
address = 'Toronto, ON'
geolocator = Nominatim(user_agent="on_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude




# Plot map
###########

# Create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10.3)

# Add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

<BR><BR><BR><BR>

In [4]:
# Filter & Map Borough's within Toronto
####################################################


# Filter data
toronto = df[df['Borough'].str.contains("Toronto")]

# Create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# Add markers to map
for lat, lng, borough, neighborhood in zip(toronto['Latitude'], 
                                           toronto['Longitude'],
                                           toronto['Borough'],
                                           toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

<br><br>

## 2.3) FourSquare Data

<a id='foursquare'></a>

In this section, we'll neighborhood venue data using the FourSquare API.

In [5]:
# FourSquare Configurations
##############################

CLIENT_ID = 'WNU2DFNXQJAOKHG3OGIUHW02MWRT2251NE5LY1DWCJU0KMD3' # Foursquare ID
CLIENT_SECRET = '3J5AGI0ATN3K5AOZPRPZH5UHRQ3Z3LY3BY2MWPFZCQRR3GQO' # Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value



# Get Venue data from FourSquare
#################################

# Define helper function to pull venue data per borough
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in tqdm(zip(names, latitudes, longitudes)):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


# Use function
toronto_venues = getNearbyVenues(names      = toronto['Neighborhood'],
                                 latitudes  = toronto['Latitude'],
                                 longitudes = toronto['Longitude'])

# Print data
toronto_venues

39it [00:15,  2.59it/s]


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.654260,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
1,"Regent Park, Harbourfront",43.654260,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
2,"Regent Park, Harbourfront",43.654260,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.654260,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.654260,-79.360636,Impact Kitchen,43.656369,-79.356980,Restaurant
...,...,...,...,...,...,...,...
1495,Enclave of M4L,43.662744,-79.321558,Olliffe On Queen,43.664503,-79.324768,Butcher
1496,Enclave of M4L,43.662744,-79.321558,Greenwood Cigar & Variety,43.664538,-79.325379,Smoke Shop
1497,Enclave of M4L,43.662744,-79.321558,ONE Academy,43.662253,-79.326911,Gym / Fitness Center
1498,Enclave of M4L,43.662744,-79.321558,Revolution Recording,43.662561,-79.326940,Recording Studio


<br><br>

# 3) Clustering
<a id='clustering'></a>
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br>

## 3.1) Data Prepation
<a id='dataprep'></a>

In [17]:
# One hot encoding
###################

# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], 
                                prefix = "", prefix_sep = "")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# group records by neighborhood & average the values per venue
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()






# Build dataframe of most common venues per neighborhood
########################################################

# Create a function to get the most frequent/common venues
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

# Build dataframe
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in tqdm(np.arange(num_top_venues)):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in tqdm(np.arange(toronto_grouped.shape[0])):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

100%|██████████| 10/10 [00:00<00:00, 137068.76it/s]
100%|██████████| 39/39 [00:00<00:00, 2057.32it/s]


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Cocktail Bar,Bakery,Coffee Shop,Sandwich Place,Farmers Market,Beer Bar,Seafood Restaurant,Vegetarian / Vegan Restaurant,French Restaurant,Spa
1,"Brockton, Parkdale Village, Exhibition Place",Sandwich Place,Café,Bakery,Coffee Shop,Breakfast Spot,Convenience Store,Japanese Restaurant,Stadium,Restaurant,Bar
2,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Boat or Ferry,Harbor / Marina,Coffee Shop,Rental Car Location,Sculpture Garden,Boutique,Bar,Plane
3,Central Bay Street,Coffee Shop,Sandwich Place,Italian Restaurant,Sushi Restaurant,Café,Japanese Restaurant,Burger Joint,Salad Place,Restaurant,Bank
4,Christie,Grocery Store,Café,Park,Restaurant,Baby Store,Nightclub,Coffee Shop,Athletics & Sports,Italian Restaurant,Movie Theater


<br>

## 3.2) Cluster Neighborhoods
<a id='algorithm'></a>

In [18]:
# set number of clusters
kclusters = 10

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronton_merged = toronto

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronton_merged = toronton_merged.join(neighborhoods_venues_sorted
                                       .set_index('Neighborhood'), 
                                       on='Neighborhood')


# Map the clusters
###################
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronton_merged['Latitude'], 
                                  toronton_merged['Longitude'], 
                                  toronton_merged['Neighborhood'], 
                                  toronton_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters