# Segmenting and Clustering Neighborhoods in Toronto

Importing required libraries for Web Scraping

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup 

Extacting the table from the web page using BeautifulSoup

In [3]:
page = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(page.content, 'lxml')
table = soup.find('table', class_= 'wikitable')
table_rows = table.find_all('tr')

In [4]:
table_rows

[<tr>
 <th>Postal Code
 </th>
 <th>Borough
 </th>
 <th>Neighborhood
 </th></tr>,
 <tr>
 <td>M1A
 </td>
 <td>Not assigned
 </td>
 <td>
 </td></tr>,
 <tr>
 <td>M2A
 </td>
 <td>Not assigned
 </td>
 <td>
 </td></tr>,
 <tr>
 <td>M3A
 </td>
 <td>North York
 </td>
 <td>Parkwoods
 </td></tr>,
 <tr>
 <td>M4A
 </td>
 <td>North York
 </td>
 <td>Victoria Village
 </td></tr>,
 <tr>
 <td>M5A
 </td>
 <td>Downtown Toronto
 </td>
 <td>Regent Park, Harbourfront
 </td></tr>,
 <tr>
 <td>M6A
 </td>
 <td>North York
 </td>
 <td>Lawrence Manor, Lawrence Heights
 </td></tr>,
 <tr>
 <td>M7A
 </td>
 <td>Downtown Toronto
 </td>
 <td>Queen's Park, Ontario Provincial Government
 </td></tr>,
 <tr>
 <td>M8A
 </td>
 <td>Not assigned
 </td>
 <td>
 </td></tr>,
 <tr>
 <td>M9A
 </td>
 <td>Etobicoke
 </td>
 <td>Islington Avenue
 </td></tr>,
 <tr>
 <td>M1B
 </td>
 <td>Scarborough
 </td>
 <td>Malvern, Rouge
 </td></tr>,
 <tr>
 <td>M2B
 </td>
 <td>Not assigned
 </td>
 <td>
 </td></tr>,
 <tr>
 <td>M3B
 </td>
 <td>North York
 <

In the following table if the Neighborhood is not assigned, it is replaced with the with the name same as in the Borough

In [5]:
temp=[]
for tr in table_rows:
    td = tr.find_all('td')
    row = [d.text.strip() for d in td]
    
    if row and row[1] != 'Not assigned':
        if row[2] == 'Not assigned':
            row[2]=row[1]
        temp.append(row)
temp

[['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Regent Park, Harbourfront'],
 ['M6A', 'North York', 'Lawrence Manor, Lawrence Heights'],
 ['M7A', 'Downtown Toronto', "Queen's Park, Ontario Provincial Government"],
 ['M9A', 'Etobicoke', 'Islington Avenue'],
 ['M1B', 'Scarborough', 'Malvern, Rouge'],
 ['M3B', 'North York', 'Don Mills'],
 ['M4B', 'East York', 'Parkview Hill, Woodbine Gardens'],
 ['M5B', 'Downtown Toronto', 'Garden District, Ryerson'],
 ['M6B', 'North York', 'Glencairn'],
 ['M9B',
  'Etobicoke',
  'West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale'],
 ['M1C', 'Scarborough', 'Rouge Hill, Port Union, Highland Creek'],
 ['M3C', 'North York', 'Don Mills'],
 ['M4C', 'East York', 'Woodbine Heights'],
 ['M5C', 'Downtown Toronto', 'St. James Town'],
 ['M6C', 'York', 'Humewood-Cedarvale'],
 ['M9C',
  'Etobicoke',
  'Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood'],
 ['M1E', 'Scarborou

In [6]:
pd.set_option('display.max_rows', None)

Dataset produced from the web page is converted into a dataframe with their respective columns

In [25]:
df = pd.DataFrame(data=temp, columns=['Postalcode', 'Borough', 'Neighbourhood'])
df

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


The produced dataset has same postalcodes for two different neighborhoods. Hence these neighborhoods are joined together according to their postal codes.

In [26]:
df1 = df.groupby('Postalcode')['Neighbourhood'].apply(','.join).reset_index().set_index('Postalcode')
df1

Unnamed: 0_level_0,Neighbourhood
Postalcode,Unnamed: 1_level_1
M1B,"Malvern, Rouge"
M1C,"Rouge Hill, Port Union, Highland Creek"
M1E,"Guildwood, Morningside, West Hill"
M1G,Woburn
M1H,Cedarbrae
M1J,Scarborough Village
M1K,"Kennedy Park, Ionview, East Birchmount Park"
M1L,"Golden Mile, Clairlea, Oakridge"
M1M,"Cliffside, Cliffcrest, Scarborough Village West"
M1N,"Birch Cliff, Cliffside West"


In [27]:
df = df.drop('Neighbourhood', axis=1).drop_duplicates().set_index('Postalcode')
df = df.join(df1)
df

Unnamed: 0_level_0,Borough,Neighbourhood
Postalcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
M9A,Etobicoke,Islington Avenue
M1B,Scarborough,"Malvern, Rouge"
M3B,North York,Don Mills
M4B,East York,"Parkview Hill, Woodbine Gardens"
M5B,Downtown Toronto,"Garden District, Ryerson"


In [31]:
df = df.reset_index()
df.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Finally printing the shape of the cleaned dataframe

In [32]:
df.shape

(103, 3)

Loading the coordinates from the csv file in Coursera

In [43]:
coord = pd.read_csv('Geospatial_Coordinates.csv')
coord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merging the two dataframes to get the coordinates

In [44]:
data = pd.merge(df,coord, left_on='Postalcode', right_on='Postal Code')
data.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude
0,M3A,North York,Parkwoods,M3A,43.753259,-79.329656
1,M4A,North York,Victoria Village,M4A,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",M5A,43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",M6A,43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",M7A,43.662301,-79.389494


Final data of Toronto Neighborhoods with their coordinates

In [45]:
data = data.drop('Postal Code', axis=1)
data

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


Importing the required libraries

In [46]:
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium
import requests
from pandas.io.json import json_normalize

Using geopy library for obtaining latitude and longitudes of Toronto

In [47]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Obtaining the Borough that contains the word Toronto in it

In [48]:
dataset = data[data['Borough'].str.contains('Toronto')].reset_index(drop = True)

In [49]:
dataset.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


Crating the map of toronto with their neighborhoods

In [52]:
map_of_toronto = folium.Map(location = [latitude, longitude], zoom_start = 11)

for borough, neighborhood, latitude, longitude in zip(dataset['Borough'], 
                                                      dataset['Neighbourhood'], 
                                                      dataset['Latitude'], 
                                                      dataset['Longitude']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [latitude, longitude],
        radius = 5,
        popup = label,
        color = 'green',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7,
        parse_html = False).add_to(map_of_toronto)  
    
map_of_toronto

Using Foursquare API for exploring the venues in the neighborhoods

In [54]:
CLIENT_ID = '<Your client ID>'
CLIENT_SECRET = '<Your client secret>' 
VERSION = '20200522'
LIMIT = 100
radius = 500
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Getting venues in the neighborhoods within a radius of 500 meters

In [55]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [56]:
venues = getNearbyVenues(dataset['Neighbourhood'],
                                 dataset['Latitude'],
                                 dataset['Longitude'], radius)

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West
High Park, The Junction South
North Toronto West
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Rosedale
Stn A PO Boxes
St. James Town,

In [57]:
venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
3,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
4,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa


Checking the venue count for each neighborhood

In [58]:
venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,55,55,55,55,55,55
"Brockton, Parkdale Village, Exhibition Place",22,22,22,22,22,22
Business reply mail Processing Centre,19,19,19,19,19,19
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",16,16,16,16,16,16
Central Bay Street,66,66,66,66,66,66
Christie,17,17,17,17,17,17
Church and Wellesley,76,76,76,76,76,76
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,34,34,34,34,34,34
Davisville North,9,9,9,9,9,9


Unique Categories

In [60]:
print('There are {} uniques categories.'.format(len(venues['Venue Category'].unique())))

There are 242 uniques categories.


Analyze the each area

In [61]:
toronto_onehot = pd.get_dummies(venues[['Venue Category']], prefix = "", prefix_sep = "")
toronto_onehot = pd.concat([venues['Neighborhood'], toronto_onehot.drop(['Neighborhood'], axis = 1)], axis = 1)
toronto_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [62]:
toronto_onehot.shape

(1623, 242)

Grouping the neighborhoods and taking the mean frequency of occurence of each category

In [63]:
toronto_grouped = toronto_onehot.groupby(['Neighborhood']).mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Business reply mail Processing Centre,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052632
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0625,0.0625,0.0625,0.125,0.125,0.125,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.015152,0.0,0.015152,0.0,0.0,0.0,0.0,0.015152
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.0,0.013158,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,0.0,0.026316
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,...,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [64]:
toronto_grouped.shape

(39, 242)

Exploring the top 10 venues for each neighborhood

In [65]:
num_top_venues = 10

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending = False).reset_index(drop = True).head(num_top_venues))
    print('\n')

----Berczy Park----
                venue  freq
0         Coffee Shop  0.07
1        Cocktail Bar  0.05
2            Beer Bar  0.04
3         Cheese Shop  0.04
4              Bakery  0.04
5                Café  0.04
6          Restaurant  0.04
7  Seafood Restaurant  0.04
8       Shopping Mall  0.02
9              Museum  0.02


----Brockton, Parkdale Village, Exhibition Place----
                venue  freq
0                Café  0.14
1      Breakfast Spot  0.09
2         Coffee Shop  0.09
3       Burrito Place  0.05
4  Italian Restaurant  0.05
5           Nightclub  0.05
6                 Bar  0.05
7        Climbing Gym  0.05
8              Bakery  0.05
9             Stadium  0.05


----Business reply mail Processing Centre----
                  venue  freq
0    Light Rail Station  0.11
1           Yoga Studio  0.05
2      Recording Studio  0.05
3            Smoke Shop  0.05
4         Burrito Place  0.05
5                   Spa  0.05
6               Butcher  0.05
7  Fast Food Restaura

9         Concert Hall  0.02


----University of Toronto, Harbord----
                 venue  freq
0                 Café  0.14
1           Restaurant  0.06
2                  Bar  0.06
3  Japanese Restaurant  0.06
4   Italian Restaurant  0.06
5               Bakery  0.06
6            Bookstore  0.06
7       Sandwich Place  0.03
8           Beer Store  0.03
9            Nightclub  0.03




Creating a dataframe to show the top 10 venues

In [66]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [68]:
import numpy as np
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Cheese Shop,Café,Restaurant,Beer Bar,Seafood Restaurant,Greek Restaurant,Pub
1,"Brockton, Parkdale Village, Exhibition Place",Café,Breakfast Spot,Coffee Shop,Furniture / Home Store,Burrito Place,Restaurant,Italian Restaurant,Stadium,Intersection,Bar
2,Business reply mail Processing Centre,Light Rail Station,Yoga Studio,Auto Workshop,Park,Comic Shop,Pizza Place,Recording Studio,Restaurant,Butcher,Burrito Place
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Terminal,Airport Lounge,Airport Service,Harbor / Marina,Coffee Shop,Plane,Bar,Boutique,Boat or Ferry,Airport Gate
4,Central Bay Street,Coffee Shop,Italian Restaurant,Café,Sandwich Place,Bar,Middle Eastern Restaurant,Japanese Restaurant,Salad Place,Bubble Tea Shop,Thai Restaurant
5,Christie,Grocery Store,Café,Park,Restaurant,Candy Store,Athletics & Sports,Diner,Italian Restaurant,Nightclub,Coffee Shop
6,Church and Wellesley,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Restaurant,Gay Bar,Yoga Studio,Bubble Tea Shop,Pub,Café,Hotel
7,"Commerce Court, Victoria Hotel",Coffee Shop,Café,Restaurant,Hotel,American Restaurant,Gym,Italian Restaurant,Deli / Bodega,Seafood Restaurant,Japanese Restaurant
8,Davisville,Dessert Shop,Sandwich Place,Coffee Shop,Pizza Place,Gym,Café,Sushi Restaurant,Italian Restaurant,Park,Japanese Restaurant
9,Davisville North,Gym,Hotel,Breakfast Spot,Food & Drink Shop,Dog Run,Sandwich Place,Department Store,Park,Convenience Store,Convention Center


## Cluster Analysis

Run K-Means to cluster the Toronto areas into 5 clusters

In [70]:
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
kMeans = KMeans(n_clusters = 5, random_state = 0).fit(toronto_grouped_clustering)

Adding the cluster labels

In [71]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kMeans.labels_)

# Combine the data
toronto_merged = dataset.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on = 'Neighbourhood')
toronto_merged

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Bakery,Pub,Park,Breakfast Spot,Café,Theater,Yoga Studio,Mexican Restaurant,Shoe Store
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Coffee Shop,Sushi Restaurant,Yoga Studio,Bank,Beer Bar,Smoothie Shop,Sandwich Place,Burger Joint,Burrito Place,Café
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Clothing Store,Coffee Shop,Italian Restaurant,Restaurant,Middle Eastern Restaurant,Bubble Tea Shop,Japanese Restaurant,Café,Cosmetics Shop,Diner
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Café,Coffee Shop,Cocktail Bar,American Restaurant,Gastropub,Cosmetics Shop,Restaurant,Italian Restaurant,Clothing Store,Gym
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Trail,Health Food Store,Pub,Yoga Studio,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,0,Coffee Shop,Cocktail Bar,Bakery,Cheese Shop,Café,Restaurant,Beer Bar,Seafood Restaurant,Greek Restaurant,Pub
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,0,Coffee Shop,Italian Restaurant,Café,Sandwich Place,Bar,Middle Eastern Restaurant,Japanese Restaurant,Salad Place,Bubble Tea Shop,Thai Restaurant
7,M6G,Downtown Toronto,Christie,43.669542,-79.422564,0,Grocery Store,Café,Park,Restaurant,Candy Store,Athletics & Sports,Diner,Italian Restaurant,Nightclub,Coffee Shop
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568,0,Coffee Shop,Café,Restaurant,Thai Restaurant,Gym,Clothing Store,Deli / Bodega,Hotel,Concert Hall,Steakhouse
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259,0,Bakery,Pharmacy,Grocery Store,Art Gallery,Café,Bar,Bank,Supermarket,Middle Eastern Restaurant,Furniture / Home Store


Visualizng the resulting Clusters

In [75]:
# create map
map_clusters = folium.Map(location = [latitude, longitude], zoom_start = 11)

# set color scheme for the clusters
x = np.arange(5)
ys = [i + x + (i*x)**2 for i in range(5)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for neighborhood, cluster, latitude, longitude in zip(toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels'], toronto_merged['Latitude'], toronto_merged['Longitude']):
    label = folium.Popup(str(neighborhood) + ' Cluster ' + str(cluster), parse_html = True)
    folium.CircleMarker(
        [latitude, longitude],
        radius = 5,
        popup = label,
        color = rainbow[cluster - 1],
        fill = True,
        fill_color = rainbow[cluster - 1],
        fill_opacity = 0.7).add_to(map_clusters)
       
map_clusters

Cluster-1

In [76]:
# For Cluster 0
result = toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
print("For cluster {}, the distribution of venues is as:\n{}".format(0, result['1st Most Common Venue'].value_counts()))
result

For cluster 0, the distribution of venues is as:
Coffee Shop           12
Café                   7
Sandwich Place         2
Clothing Store         2
Trail                  1
Pub                    1
Breakfast Spot         1
Grocery Store          1
Airport Terminal       1
Light Rail Station     1
Greek Restaurant       1
Gym                    1
Dessert Shop           1
Bakery                 1
Bar                    1
Name: 1st Most Common Venue, dtype: int64


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,0,Coffee Shop,Bakery,Pub,Park,Breakfast Spot,Café,Theater,Yoga Studio,Mexican Restaurant,Shoe Store
1,Downtown Toronto,0,Coffee Shop,Sushi Restaurant,Yoga Studio,Bank,Beer Bar,Smoothie Shop,Sandwich Place,Burger Joint,Burrito Place,Café
2,Downtown Toronto,0,Clothing Store,Coffee Shop,Italian Restaurant,Restaurant,Middle Eastern Restaurant,Bubble Tea Shop,Japanese Restaurant,Café,Cosmetics Shop,Diner
3,Downtown Toronto,0,Café,Coffee Shop,Cocktail Bar,American Restaurant,Gastropub,Cosmetics Shop,Restaurant,Italian Restaurant,Clothing Store,Gym
4,East Toronto,0,Trail,Health Food Store,Pub,Yoga Studio,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
5,Downtown Toronto,0,Coffee Shop,Cocktail Bar,Bakery,Cheese Shop,Café,Restaurant,Beer Bar,Seafood Restaurant,Greek Restaurant,Pub
6,Downtown Toronto,0,Coffee Shop,Italian Restaurant,Café,Sandwich Place,Bar,Middle Eastern Restaurant,Japanese Restaurant,Salad Place,Bubble Tea Shop,Thai Restaurant
7,Downtown Toronto,0,Grocery Store,Café,Park,Restaurant,Candy Store,Athletics & Sports,Diner,Italian Restaurant,Nightclub,Coffee Shop
8,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Thai Restaurant,Gym,Clothing Store,Deli / Bodega,Hotel,Concert Hall,Steakhouse
9,West Toronto,0,Bakery,Pharmacy,Grocery Store,Art Gallery,Café,Bar,Bank,Supermarket,Middle Eastern Restaurant,Furniture / Home Store


Cluster-2

In [77]:
# For Cluster 1
result = toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
print("For cluster {}, the distribution of venues is as:\n{}".format(1, result['1st Most Common Venue'].value_counts()))
result

For cluster 1, the distribution of venues is as:
Playground    1
Name: 1st Most Common Venue, dtype: int64


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
29,Central Toronto,1,Playground,Trail,Yoga Studio,Department Store,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Donut Shop


Cluster-3

In [78]:
# For Cluster 2
result = toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
print("For cluster {}, the distribution of venues is as:\n{}".format(2, result['1st Most Common Venue'].value_counts()))
result

For cluster 2, the distribution of venues is as:
Garden    1
Name: 1st Most Common Venue, dtype: int64


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,Central Toronto,2,Garden,Health & Beauty Service,Home Service,Department Store,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Donut Shop


Cluster-4

In [79]:
# For Cluster 3
result = toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
print("For cluster {}, the distribution of venues is as:\n{}".format(3, result['1st Most Common Venue'].value_counts()))
result

For cluster 3, the distribution of venues is as:
Park    2
Name: 1st Most Common Venue, dtype: int64


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,Central Toronto,3,Park,Swim School,Bus Line,Yoga Studio,Dessert Shop,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant
21,Central Toronto,3,Park,Jewelry Store,Trail,Sushi Restaurant,Bus Line,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant


Cluster-5

In [80]:
# For Cluster 4
result = toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
print("For cluster {}, the distribution of venues is as:\n{}".format(4, result['1st Most Common Venue'].value_counts()))
result

For cluster 4, the distribution of venues is as:
Park    1
Name: 1st Most Common Venue, dtype: int64


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
33,Downtown Toronto,4,Park,Playground,Trail,Yoga Studio,Doner Restaurant,Dessert Shop,Diner,Discount Store,Distribution Center,Dog Run


### Observation:
From the above Cluster analysis, most of the neighborhoods falls under cluster-1 which are mostly business areas with coffee shops, cafe, clothing stores etc., while cluster-2 is a playground, cluster-3 is a garden and cluster4,5 are parks.