# Segmenting and Clustering Neighborhoods in Toronto

The project involves building a code to scrape a Wikipedia page of the postal codes in Canada where the first letter is "M" and prepocessing the dataset. The K-Means approach is applied as a primary clustering tool. The obtained clustering results are visualized, using the Folium Library. Finally, I explore and cluster the neighborhoods in Toronto. 

## Primary goals of the project:

- Create the dataframe
- Explore and cluster the neighborhoods in Toronto

## Dataframe Creation

In [13]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library


!pip install beautifulsoup4
!pip install lxml
import random # library for random number generation

from IPython.display import Image 
from IPython.core.display import HTML 

from IPython.display import display_html
import pandas as pd
import numpy as np
     
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Folium installed')
print('Libraries imported.')
 

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Folium installed
Libraries imported.


In [18]:
# scrape the given Wikipedia page
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=BeautifulSoup(source,'lxml')
print(soup.title)

# the dataframe consists of three columns: PostalCode, Borough, and Neighborhood
from IPython.display import display_html
post_cd = str(soup.table)

<title>List of postal codes of Canada: M - Wikipedia</title>


In [24]:
# define the dataframe columns
#column_names = ['PostalCode', 'Borough', 'Neighborhood'] 

# instantiate the dataframe
#neighborhoods = pd.DataFrame(columns=column_names)

toronto_borough_df = pd.read_html(post_cd)[0]
toronto_borough_df = toronto_borough_df.drop(toronto_borough_df[(toronto_borough_df.Borough == "Not assigned")].index)
toronto_borough_df.head()
#there are four boroughs in the city of Toronto

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


There are four boroughs in the city of Toronto.

In [31]:
# combine neighborhoods into one row
# toronto_borough_df=toronto_borough_df.drop_duplicates()
tdf = toronto_borough_df.groupby(['Postal Code','Borough'], sort=False).agg(', '.join)
tdf.reset_index(inplace=True)

# if a cell has a borough but a Not assigned  neighborhood, then the neighborhood is the same as the borough.
tdf['Neighbourhood'] = np.where(tdf['Neighbourhood'] == 'Not assigned',tdf['Borough'], tdf['Neighbourhood'])
tdf

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


The resulting number of the postal codes in the city of Toronto is 103.

In [32]:
tdf.shape

(103, 3)

The resulting dataframe is a 103 by 3 matrix (or table) with an additional row for the column names.

The next step is to import the the geographical coordinates for each postal code from the provided csv file and add the coordinates to the dataframe:

In [35]:
coord = pd.read_csv('https://cocl.us/Geospatial_data')
coord
# the total number of the coordinates matches the total number of the postal codes in the dataframe

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [39]:
tdf2 = pd.merge(tdf,coord,on='Postal Code')
tdf2.head(15)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


This concludes the dataframe creation task of this project.

## Exploration and clustering of the neighborhoods in Toronto.

In [45]:
# limit to only boroughs that contain the word Toronto
tdf3 = tdf2[tdf2['Borough'].str.contains('Toronto',regex=False)]
tdf3

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


In [46]:
tdf3.shape

(39, 5)

We are now down to 38 postal codes with their corresponding boroughs, neighborhoods, and coordinates.

In [48]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(tdf3['Borough'].unique()),
        tdf3.shape[0]
    )
)

The dataframe has 4 boroughs and 39 neighborhoods.


In [70]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of the city of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of the city of Toronto are 43.6534817, -79.3839347.


#### Creating a map of Toronto with neighborhoods superimposed on top.

In [54]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(tdf3['Latitude'], tdf3['Longitude'], tdf3['Borough'], tdf3['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Utilizing the Foursquare API to explore the neighborhoods and segmenting them.

In [150]:
CLIENT_ID = 'N1J3G1FIWCMRDMP4NLUAX4DNHT45IOXFV1VJGGRDEPQORRWO' # your Foursquare ID
CLIENT_SECRET = 'LYQQSR4D5OUQVEZ5EMGFZBWLQXHFOYOJUON3QWFOGKCD44P3' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: N1J3G1FIWCMRDMP4NLUAX4DNHT45IOXFV1VJGGRDEPQORRWO
CLIENT_SECRET:LYQQSR4D5OUQVEZ5EMGFZBWLQXHFOYOJUON3QWFOGKCD44P3


In [74]:
tdf3.loc[:, 'Neighbourhood']

2                              Regent Park, Harbourfront
4            Queen's Park, Ontario Provincial Government
9                               Garden District, Ryerson
15                                        St. James Town
19                                           The Beaches
20                                           Berczy Park
24                                    Central Bay Street
25                                              Christie
30                              Richmond, Adelaide, King
31                          Dufferin, Dovercourt Village
36     Harbourfront East, Union Station, Toronto Islands
37                              Little Portugal, Trinity
41                          The Danforth West, Riverdale
42              Toronto Dominion Centre, Design Exchange
43          Brockton, Parkdale Village, Exhibition Place
47                        India Bazaar, The Beaches West
48                        Commerce Court, Victoria Hotel
54                             

In [76]:
tdf3.loc[24, 'Neighbourhood']

'Central Bay Street'

The first two neighbourhoods are Regent Park and Harbourfront. However, I would like to explore Central Bay Street.

In [77]:
neighborhood_latitude = tdf3.loc[24, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = tdf3.loc[24, 'Longitude'] # neighborhood longitude value

neighborhood_name = tdf3.loc[24, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Central Bay Street are 43.6579524, -79.3873826.


Getting the top 100 venues that are in Central Bay Street within a radius of 500 meters.

In [148]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
#url # display URL


In [87]:
results = requests.get(url).json()
#results

In [83]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [84]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,name,categories,lat,lng
0,Jimmy's Coffee,Coffee Shop,43.658421,-79.385613
1,Somethin' 2 Talk About,Middle Eastern Restaurant,43.658395,-79.385338
2,Hailed Coffee,Coffee Shop,43.658833,-79.383684
3,NEO COFFEE BAR,Coffee Shop,43.66013,-79.38583
4,Tim Hortons,Coffee Shop,43.65857,-79.385123


 How many venues were returned by Foursquare:

In [85]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

61 venues were returned by Foursquare.


## Explore Neighborhoods in Toronto

In [98]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [99]:
t_venues = getNearbyVenues(names=tdf3['Neighbourhood'],
                                   latitudes=tdf3['Latitude'],
                                   longitudes=tdf3['Longitude']
                                  )

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West, Forest Hill Road Park
High Park, The Junction South
North Toronto West, Lawrence Park
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
R

In [92]:
print(t_venues.shape)
t_venues.head(15)

(1585, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
1,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant
5,"Regent Park, Harbourfront",43.65426,-79.360636,Dominion Pub and Kitchen,43.656919,-79.358967,Pub
6,"Regent Park, Harbourfront",43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
7,"Regent Park, Harbourfront",43.65426,-79.360636,Corktown Common,43.655618,-79.356211,Park
8,"Regent Park, Harbourfront",43.65426,-79.360636,The Extension Room,43.653313,-79.359725,Gym / Fitness Center
9,"Regent Park, Harbourfront",43.65426,-79.360636,The Distillery Historic District,43.650244,-79.359323,Historic Site


In [100]:
t_venues.groupby('Neighbourhood').count()
# how many venues were returned for each neighborhood

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,54,54,54,54,54,54
"Brockton, Parkdale Village, Exhibition Place",23,23,23,23,23,23
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",16,16,16,16,16,16
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",15,15,15,15,15,15
Central Bay Street,61,61,61,61,61,61
Christie,15,15,15,15,15,15
Church and Wellesley,78,78,78,78,78,78
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,34,34,34,34,34,34
Davisville North,8,8,8,8,8,8


In [101]:
# how many unique categories can be curated from all the returned venues
print('There are {} uniques categories.'.format(len(t_venues['Venue Category'].unique())))

There are 229 uniques categories.


## Analyze Each Neighborhood

In [105]:
# one hot encoding
t_onehot = pd.get_dummies(t_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
t_onehot['Neighbourhood'] = t_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [t_onehot.columns[-1]] + list(t_onehot.columns[:-1])
t_onehot = t_onehot[fixed_columns]

#t_onehot.head(20)

In [106]:
t_onehot.shape

(1585, 230)

In [109]:
#group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
t_grouped = t_onehot.groupby('Neighbourhood').mean().reset_index()
#t_grouped

In [110]:
t_grouped.shape

(39, 230)

In [111]:
# print each neighborhood along with the top 5 most common venues
num_top_venues = 5

for hood in t_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = t_grouped[t_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
            venue  freq
0     Coffee Shop  0.09
1    Cocktail Bar  0.06
2     Cheese Shop  0.04
3  Farmers Market  0.04
4          Bakery  0.04


----Brockton, Parkdale Village, Exhibition Place----
            venue  freq
0            Café  0.13
1     Coffee Shop  0.09
2       Nightclub  0.09
3  Breakfast Spot  0.09
4             Gym  0.04


----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
                venue  freq
0  Light Rail Station  0.12
1                 Spa  0.06
2    Recording Studio  0.06
3      Farmers Market  0.06
4         Pizza Place  0.06


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
             venue  freq
0  Airport Service  0.20
1   Airport Lounge  0.13
2         Boutique  0.07
3          Airport  0.07
4              Bar  0.07


----Central Bay Street----
                venue  freq
0         Coffee Shop  0.18
1      Sandwich Place 

In [112]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [147]:
# create the new dataframe and display the top 10 venues for each neighborhood

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = t_grouped['Neighbourhood']

for ind in np.arange(t_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(t_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Beer Bar,Bakery,Farmers Market,Cheese Shop,Seafood Restaurant,Restaurant,Park,Jazz Club
1,"Brockton, Parkdale Village, Exhibition Place",Café,Breakfast Spot,Nightclub,Coffee Shop,Climbing Gym,Burrito Place,Stadium,Restaurant,Italian Restaurant,Intersection
2,"Business reply mail Processing Centre, South C...",Light Rail Station,Auto Workshop,Pizza Place,Comic Shop,Recording Studio,Restaurant,Burrito Place,Brewery,Skate Park,Spa
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Rental Car Location,Airport,Airport Food Court,Airport Gate,Bar,Sculpture Garden,Harbor / Marina,Boutique
4,Central Bay Street,Coffee Shop,Italian Restaurant,Sandwich Place,Café,Salad Place,Bubble Tea Shop,Burger Joint,Yoga Studio,Business Service,Bike Rental / Bike Share
5,Christie,Grocery Store,Café,Park,Nightclub,Italian Restaurant,Candy Store,Restaurant,Baby Store,Coffee Shop,Cosmetics Shop
6,Church and Wellesley,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Restaurant,Gay Bar,Pub,Men's Store,Mediterranean Restaurant,Hotel,Dance Studio
7,"Commerce Court, Victoria Hotel",Coffee Shop,Restaurant,Café,Hotel,American Restaurant,Italian Restaurant,Gym,Cocktail Bar,Seafood Restaurant,Japanese Restaurant
8,Davisville,Pizza Place,Sandwich Place,Dessert Shop,Gym,Coffee Shop,Italian Restaurant,Café,Sushi Restaurant,Thai Restaurant,Seafood Restaurant
9,Davisville North,Gym,Breakfast Spot,Hotel,Food & Drink Shop,Department Store,Park,Sandwich Place,Pizza Place,Antique Shop,Dessert Shop


Interestingly, dining locations, especially small restaurants and coffee shops are the most common venues in the majority of neighbourhoods in the city of Toronto.

## Clustering Neighborhoods

In [133]:
# run _k_-means to cluster the neighborhood into 5 clusters.
# set number of clusters
kclusters = 5

#t_grouped_clustering = t_grouped.drop('Neighbourhood', 1)
t_grouped_clustering = tdf3.drop(['Postal Code','Borough','Neighbourhood'],1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(t_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_ 
tdf3

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,0,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,0,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,0,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,4,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,0,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,3,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,0,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,1,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


The *k-means* algorithms has determined a total of five clusters labeled from 0 to 4.

## Analyzing Each Neighborhood

In [138]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tdf3['Latitude'], tdf3['Longitude'], tdf3['Neighbourhood'], tdf3['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examining Clusters

Now, let's examine each cluster based on the discriminating venue categories that define each cluster.


In [141]:
# cluster 1
tdf3.loc[tdf3['Cluster Labels'] == 0, tdf3.columns[[2] + list(range(3, tdf3.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
2,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,Downtown Toronto,St. James Town,43.651494,-79.375418
20,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,Downtown Toronto,Central Bay Street,43.657952,-79.387383
30,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
36,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752
42,Downtown Toronto,"Toronto Dominion Centre, Design Exchange",43.647177,-79.381576
48,Downtown Toronto,"Commerce Court, Victoria Hotel",43.648198,-79.379817


In [142]:
# cluster 2
tdf3.loc[tdf3['Cluster Labels'] == 1, tdf3.columns[[2] + list(range(3, tdf3.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
31,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259
69,West Toronto,"High Park, The Junction South",43.661608,-79.464763
75,West Toronto,"Parkdale, Roncesvalles",43.64896,-79.456325
81,West Toronto,"Runnymede, Swansea",43.651571,-79.48445


In [143]:
# cluster 3
tdf3.loc[tdf3['Cluster Labels'] == 2, tdf3.columns[[2] + list(range(3, tdf3.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
61,Central Toronto,Lawrence Park,43.72802,-79.38879
62,Central Toronto,Roselawn,43.711695,-79.416936
67,Central Toronto,Davisville North,43.712751,-79.390197
68,Central Toronto,"Forest Hill North & West, Forest Hill Road Park",43.696948,-79.411307
73,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
79,Central Toronto,Davisville,43.704324,-79.38879
83,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
86,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049


In [144]:
# cluster 4
tdf3.loc[tdf3['Cluster Labels'] == 3, tdf3.columns[[2] + list(range(3, tdf3.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
25,Downtown Toronto,Christie,43.669542,-79.422564
37,West Toronto,"Little Portugal, Trinity",43.647927,-79.41975
43,West Toronto,"Brockton, Parkdale Village, Exhibition Place",43.636847,-79.428191
74,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678
80,Downtown Toronto,"University of Toronto, Harbord",43.662696,-79.400049
84,Downtown Toronto,"Kensington Market, Chinatown, Grange Park",43.653206,-79.400049


In [145]:
# cluster 5
tdf3.loc[tdf3['Cluster Labels'] == 4, tdf3.columns[[2] + list(range(3, tdf3.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
19,East Toronto,The Beaches,43.676357,-79.293031
41,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
47,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
54,East Toronto,Studio District,43.659526,-79.340923
100,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558


The Clustering algorithm has been able to identified clusters closely resembling the boroughs of the city of Toronto with a relatively small misclassification error as a total of six entries containing one to three neighborhoods each have been misclassified in an additional fifth class. 

The majority of the most common venues in the first cluster (Downtown Toronto) are coffee shops. The second and the smallest cluster (West Toronto) is characterized by several different venues, such as Pharmacy, Bakery, Thai, Mexican, and Sushi restaurants. The third cluster (Central Toronto) includes a variety of venues from parks, gardens, and gyms to breakfast spots, restaurants, and coffee shops. The fourth cluster most likely includes all the misclassified neighborhoods and is defined by grocery stores, cafes, bars, and restaurants. The final cluster (East Toronto) includes an auto workshop, parks and trails, and a variety of dining locations from Brewery, Gastropub, and Pizza Place to Greek and Italian restaurants.       
