<h1 align=center><font size = 5>Segmentation and Clustering of Neighbourhoods in the City of Toronto</font></h1>

## Introduction

In this notebook,neighbourhoods data in the city of Toronto is scraped from the web, wrangled, cleaned and read into *Panda* dataframe.

Subsequently, the corresponding latitude and longitude values of the datapoints in the dataframe are read into the frame. Also, the Foursquare API is then used to explore neighborhoods datapoints in Toronto.

Finally, the *k*-means clustering algorithm is used to cluster the neighbourhood datapoints and the Folium library used to visualize the neighborhoods in Toronto City and their emerging clusters.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Scrape and Read Data into Panda Dataframe</a>
    

2. <a href="#item2">Update Dataframe with Latitude and Longitude Values</a>
    

3. <a href="#item3">Explore Neighborhoods in Toronto City</a>
    

4. <a href="#item4">Analyse and Cluster Neighborhoods</a>
</font>
</div>

### 1. Scrape and Read Data into Panda Dataframe

**Downloading all the needed dependencies.**

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

from geopy.geocoders import Nominatim # to convert an address into latitude and longitude values

import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # transform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


**Scraping the data from a Wikipedia and loading to dataframe.**

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
toronto_postcodes = pd.read_html(url)  #Extracting the tables from the webpage into a dataframe
toronto_postcodes = toronto_postcodes[0]   #First table
toronto_postcodes.head()

Unnamed: 0,Postal Code,Community,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


**Cleaning the dataframe.**

In [3]:
toronto_postcodes.rename(columns={'Community':'Borough'}, inplace=True)
toronto_postcodes = toronto_postcodes[toronto_postcodes.Borough != 'Not assigned']  #Drop rows with unassigned Boroughs
toronto_postcodes.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


**Checking the size of the dataframe.**

In [4]:
toronto_postcodes.shape

(103, 3)

### 2. Update Dataframe with Latitudes and Longitudes Values

**Loading geographical coordinates of the neighbourhoods in Toronto.**

In [5]:
toronto_coordinates = pd.read_csv('http://cocl.us/Geospatial_data')
toronto_coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


**Updating dataframe of postal codes with corresponding coordinates.**

In [6]:
toronto_neighbourhoods = pd.merge(toronto_postcodes,toronto_coordinates,on='Postal Code')
toronto_neighbourhoods.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


### 3. Explore Neighbourhoods in Toronto City

In [7]:
# Get the coordinates of Toronto
address = 'Toronto City'
geolocator = Nominatim(user_agent="tn_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto City are 43.6534817, -79.3839347.


**Creating a map of Toronto with neighbourhoods superimposed on it**

In [8]:
# Create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# Addd markers to map
for lat, lng, borough, neighbourhood in zip(toronto_neighbourhoods['Latitude'], toronto_neighbourhoods['Longitude'], toronto_neighbourhoods['Borough'], toronto_neighbourhoods['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

**Taking a closer look at Downtown Toronto.**

In [9]:
downtownToronto_data = toronto_neighbourhoods[toronto_neighbourhoods['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
downtownToronto_data.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


**Defining Foursquare Credentials.**

In [10]:
CLIENT_ID = 'TB0RZPPHZJ1QTQKIOLTDHLGBYLG1RL2IH04EMKHMVJWHCMGZ' # your Foursquare ID
CLIENT_SECRET = 'M0VSGBRLRJGDZU3PLIAHVD0ILSC1F1F40GRNZQ2UW5X5JZBN' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: TB0RZPPHZJ1QTQKIOLTDHLGBYLG1RL2IH04EMKHMVJWHCMGZ
CLIENT_SECRET:M0VSGBRLRJGDZU3PLIAHVD0ILSC1F1F40GRNZQ2UW5X5JZBN


**Creating a function that returns top 100 venues within 500 metres radius of a neighbourhood.**

In [11]:
def getNearbyVenues(names, latitudes, longitudes, radius=500,LIMIT = 100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

**Top 100 venues within 500 metres of Downtown Toronto.**

In [12]:
downtownToronto_venues = getNearbyVenues(names=downtownToronto_data['Neighbourhood'],
                                   latitudes=downtownToronto_data['Latitude'],
                                   longitudes=downtownToronto_data['Longitude']
                                        )


Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Rosedale
Stn A PO Boxes
St. James Town, Cabbagetown
First Canadian Place, Underground city
Church and Wellesley


**Unique categories from all the venues.**

In [13]:
print('There are {} uniques categories.'.format(len(downtownToronto_venues['Venue Category'].unique())))

There are 212 uniques categories.


### 4. Analyse and Cluster Neighbourhoods

**Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category.**

In [15]:
# one hot encoding
downtownToronto_onehot = pd.get_dummies(downtownToronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
downtownToronto_onehot['Neighbourhood'] = downtownToronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [downtownToronto_onehot.columns[-1]] + list(downtownToronto_onehot.columns[:-1])
downtownToronto_onehot = downtownToronto_onehot[fixed_columns]

downtownToronto_grouped = downtownToronto_onehot.groupby('Neighbourhood').mean().reset_index()
#downtownToronto_grouped
downtownToronto_grouped.shape

(19, 213)

In [16]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

        # create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = downtownToronto_grouped['Neighbourhood']

for ind in np.arange(downtownToronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(downtownToronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Bakery,Cocktail Bar,Seafood Restaurant,Cheese Shop,Beer Bar,Farmers Market,Restaurant,Sandwich Place,Breakfast Spot
1,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Lounge,Airport Service,Boat or Ferry,Harbor / Marina,Sculpture Garden,Boutique,Rental Car Location,Bar,Coffee Shop,Plane
2,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Sandwich Place,Thai Restaurant,Juice Bar,Department Store,Japanese Restaurant,Burger Joint,Bubble Tea Shop
3,Christie,Grocery Store,Café,Park,Athletics & Sports,Italian Restaurant,Restaurant,Candy Store,Baby Store,Nightclub,Coffee Shop
4,Church and Wellesley,Coffee Shop,Japanese Restaurant,Gay Bar,Sushi Restaurant,Restaurant,Yoga Studio,Café,Men's Store,Mediterranean Restaurant,Hotel


**Run *k*-means to cluster the neighborhood into 5 clusters.**

In [17]:
# set number of clusters
kclusters = 5

downtownToronto_grouped_clustering = downtownToronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(downtownToronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 3, 2, 4, 2, 2, 2, 2, 2, 2])

**Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighbourhood..**

In [18]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

downtownToronto_merged = downtownToronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
downtownToronto_merged = downtownToronto_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

downtownToronto_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Pub,Bakery,Park,Breakfast Spot,Café,Theater,Yoga Studio,Event Space,Performing Arts Venue
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Coffee Shop,Yoga Studio,Creperie,Distribution Center,Sandwich Place,Diner,Music Venue,Portuguese Restaurant,Beer Bar,Italian Restaurant
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,2,Clothing Store,Coffee Shop,Café,Bubble Tea Shop,Japanese Restaurant,Cosmetics Shop,Hotel,Pizza Place,Bookstore,Middle Eastern Restaurant
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,2,Coffee Shop,Café,Restaurant,Cocktail Bar,Beer Bar,Gastropub,American Restaurant,Seafood Restaurant,Lingerie Store,Moroccan Restaurant
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,2,Coffee Shop,Bakery,Cocktail Bar,Seafood Restaurant,Cheese Shop,Beer Bar,Farmers Market,Restaurant,Sandwich Place,Breakfast Spot


**Finally, let's visualize the resulting clusters.**

In [19]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(downtownToronto_merged['Latitude'], downtownToronto_merged['Longitude'], downtownToronto_merged['Neighbourhood'],downtownToronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters