# Segmenting and Clustering Neighbourhoods in Toronto

## Table of Contents

* [Download and Explore Dataset](#chapter1)
* [Creating a Pandas DataFrame](#chapter2)
* [Cleaning data values](#chapter3)
* [Getting the coordinates of each borough](#chapter4)
* [Create a map of Toronto](#chapter5)
* [Explore the neighborhoods and segment them](#chapter6)
* [Explore venues in Toronto](#chapter7)
* [Analyze venues in each borough in Toronto](#chapter8)
* [Cluster Boroughs](#chapter9)
* [Examine Clusters](#chapter10)

## Download and Explore Dataset <a class="anchor" id="chapter1"></a>

In [1]:
# import the dependencies for web scrapping Wikipedia
import requests
import lxml.html as lh
import pandas as pd

In [2]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

In [3]:
#Check the length of the first 12 rows
[len(T) for T in tr_elements[:12]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

Looks like all our rows have exactly 3 columns. This means all the data collected on tr_elements are from the table.

### Parse the first row as our header

In [4]:
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print ('%d: %s'%(i,name))
    col.append((name,[]))

1: Postal Code

2: Borough

3: Neighbourhood



## Creating a Pandas DataFrame <a class="anchor" id="chapter2"></a>

#### Each header is appended to a tuple along with an empty list.

In [5]:
#Since our first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 3, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

In [6]:
[len(C) for (title,C) in col]

[181, 181, 181]

In [7]:
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)

In [8]:
df.head()

Unnamed: 0,Postal Code\n,Borough\n,Neighbourhood\n
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


In [9]:
df.tail()

Unnamed: 0,Postal Code\n,Borough\n,Neighbourhood\n
176,M6Z\n,Not assigned\n,Not assigned\n
177,M7Z\n,Not assigned\n,Not assigned\n
178,M8Z\n,Etobicoke\n,"Mimico NW, The Queensway West, South of Bloor,..."
179,M9Z\n,Not assigned\n,Not assigned\n
180,\n,Canadian postal codes\n,\n


## Cleaning data values <a class="anchor" id="chapter3"></a>

In [10]:
# strip the "/n" from the values
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip('\n'))

In [11]:
#drop the last row as it os not part on the postal codes table
df.drop([180], inplace=True)

In [12]:
df.tail()

Unnamed: 0,Postal Code\n,Borough\n,Neighbourhood\n
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."
179,M9Z,Not assigned,Not assigned


In [13]:
#strip the "/n" from the column names
df.rename(columns=lambda x: x.strip('\n'), inplace=True)

In [14]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [15]:
#Check how many Boroughs has Not Assigned values
NABor_df = df.loc[df['Borough'] == "Not assigned"] 
  
print(NABor_df)

    Postal Code       Borough Neighbourhood
0           M1A  Not assigned  Not assigned
1           M2A  Not assigned  Not assigned
7           M8A  Not assigned  Not assigned
10          M2B  Not assigned  Not assigned
15          M7B  Not assigned  Not assigned
..          ...           ...           ...
174         M4Z  Not assigned  Not assigned
175         M5Z  Not assigned  Not assigned
176         M6Z  Not assigned  Not assigned
177         M7Z  Not assigned  Not assigned
179         M9Z  Not assigned  Not assigned

[77 rows x 3 columns]


There are 77 boroughs with not assigned values

In [16]:
#delete all rows from the dataset where Borough has a "Not assigned" value
df.drop(df[df.Borough == "Not assigned"].index, inplace=True)

In [17]:
#Check how many Neighbourhoods has Not Assigned values
NANeigh_df = df.loc[df['Neighbourhood'] == "Not assigned"] 
  
print(NANeigh_df)

Empty DataFrame
Columns: [Postal Code, Borough, Neighbourhood]
Index: []


In [18]:
#reseting the index
df.reset_index(drop=True, inplace=True)

In [19]:
df.shape

(103, 3)

## Getting the coordinates of each borough <a class="anchor" id="chapter4"></a>

In [20]:
geodata = pd.read_csv('https://cocl.us/Geospatial_data')

In [21]:
geodata.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [22]:
geodata.shape

(103, 3)

#### Merge the dataframes with the coordinates and the Toronto Boroughs and Neighbourhoods into one dataframe. We use the Postal Code as key.

In [23]:
Tor_geo=df.merge(geodata, how='left', on='Postal Code')

In [24]:
Tor_geo.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


Since Postal Code is the primary key to merge the table of boroughs/neoghbourhoods with their respective coordinates, the segmentation and analysis will be on Boroughs level.

## Create a map of Toronto with boroughs superimposed on top <a class="anchor" id="chapter5">

In [25]:
# Importing the dependencies to create a map
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [26]:
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [27]:
#initiate by getting the coordinates of Toronto center
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [28]:
# create map of Toronto using latitude and longitude values
map_Tor = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, Borough, Neighbourhood in zip(Tor_geo['Latitude'], Tor_geo['Longitude'], Tor_geo['Borough'], Tor_geo['Neighbourhood']):
    label = '{}, {}'.format(Neighbourhood, Borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Tor)  
    
map_Tor

For illustration purposes, let's simplify the above map and segment and cluster only the neighborhoods of Boroughs that contain the word Toronto. So let's slice the original dataframe and create a new dataframe Toronto_data.

In [29]:
Toronto_data=Tor_geo[Tor_geo["Borough"].str.contains(fr'\b{"Toronto"}\b', regex=True, case=False)]
Toronto_data.reset_index(drop=True, inplace=True)
Toronto_data.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [30]:
# create map of a segment of Toronto using latitude and longitude values
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, Borough, Neighbourhood in zip(Toronto_data['Latitude'], Toronto_data['Longitude'], Toronto_data['Borough'], Toronto_data['Neighbourhood']):
    label = '{}, {}'.format(Neighbourhood, Borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto

## Explore the neighborhoods and segment them <a class="anchor" id="chapter6"><a>

In [31]:
# @hidden_cell
CLIENT_ID = 'XXX' 
CLIENT_SECRET = 'XXX'
VERSION = 'XXX' 
LIMIT = 100

Let's explore the first neighbourhood in our dataframe.

Get the neighborhood's name.

In [32]:
Toronto_data.loc[0, 'Borough']

'Downtown Toronto'

Get the Borough's latitude and longitude values.

In [33]:
Borough_latitude = Toronto_data.loc[0, 'Latitude'] # Borough latitude value
Borough_longitude = Toronto_data.loc[0, 'Longitude'] # Borough longitude value

Borough_name = Toronto_data.loc[0, 'Borough'] # Borough name

print('Latitude and longitude values of {} are {}, {}.'.format(Borough_name, 
                                                               Borough_latitude, 
                                                               Borough_longitude))

Latitude and longitude values of Downtown Toronto are 43.6542599, -79.3606359.


#### Now, let's get the top 100 venues that are in Downtown Toronto within a radius of 300 meters.

First, let's create the GET request URL. Name your URL **url**.

In [34]:
radius=300
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, Borough_latitude, Borough_longitude, VERSION, radius, LIMIT)

Send the GET request and examine the resutls

In [35]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '6043f168be0c1d16d09e683e'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Corktown',
  'headerFullLocation': 'Corktown, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 17,
  'suggestedBounds': {'ne': {'lat': 43.6569599027, 'lng': -79.35691110008916},
   'sw': {'lat': 43.6515598973, 'lng': -79.36436069991085}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '54ea41ad498e9a11e9e13308',
       'name': 'Roselle Desserts',
       'location': {'address': '362 King St E',
        'crossStreet': 'Trinity St',
        'lat': 43.653446723052674,
        'lng': -79.3620167174383,
        'labeledLatLngs': [{'label': 'display',


Let's extract the category of the venue

In [36]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Clean the json and structure it into a dataframe.

In [37]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Roselle Desserts,Bakery,43.653447,-79.362017
1,Tandem Coffee,Coffee Shop,43.653559,-79.361809
2,Cooper Koo Family YMCA,Distribution Center,43.653249,-79.358008
3,Body Blitz Spa East,Spa,43.654735,-79.359874
4,Morning Glory Cafe,Breakfast Spot,43.653947,-79.361149


And how many venues were returned by Foursquare?

In [38]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

17 venues were returned by Foursquare.


## Explore venues in Toronto <a class="anchor" id="chapter7"><a>

#### Let's create a function to repeat the same process to all the boroughs

In [39]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Borough', 
                  'Borough Latitude', 
                  'Borough Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Run the above function on each borough that contain the word Toronto and create a new dataframe called _toronto_venues_.

In [40]:
toronto_venues = getNearbyVenues(names=Toronto_data['Borough'],
                                   latitudes=Toronto_data['Latitude'],
                                   longitudes=Toronto_data['Longitude']
                                  )

Downtown Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
East Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
West Toronto
Downtown Toronto
West Toronto
East Toronto
Downtown Toronto
West Toronto
East Toronto
Downtown Toronto
East Toronto
Central Toronto
Central Toronto
Central Toronto
Central Toronto
West Toronto
Central Toronto
Central Toronto
West Toronto
Central Toronto
Downtown Toronto
West Toronto
Central Toronto
Downtown Toronto
Central Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
East Toronto


#### Let's check the size of the resulting dataframe

In [41]:
print(toronto_venues.shape)
toronto_venues.head()

(1602, 7)


Unnamed: 0,Borough,Borough Latitude,Borough Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Downtown Toronto,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Downtown Toronto,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Downtown Toronto,43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,Downtown Toronto,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,Downtown Toronto,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


Let's check how many venues were returned for each borough

In [42]:
toronto_venues.groupby('Borough').count()

Unnamed: 0_level_0,Borough Latitude,Borough Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Central Toronto,107,107,107,107,107,107
Downtown Toronto,1219,1219,1219,1219,1219,1219
East Toronto,122,122,122,122,122,122
West Toronto,154,154,154,154,154,154


#### Let's find out how many unique categories can be curated from all the returned venues

In [43]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 230 uniques categories.


## Analyze venues in each borough in Toronto <a class="anchor" id="chapter8"><a>

In [44]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Borough'] = toronto_venues['Borough'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Borough,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,...,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Downtown Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Downtown Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Downtown Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Downtown Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Downtown Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's examine the new dataframe size

In [45]:
toronto_onehot.shape

(1602, 231)

#### Group rows by boroughs and by taking the mean of the frequency of occurrence of each category

In [46]:
toronto_grouped = toronto_onehot.groupby('Borough').mean().reset_index()
toronto_grouped

Unnamed: 0,Borough,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,...,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Central Toronto,0.0,0.0,0.0,0.0,0.0,0.009346,0.0,0.0,0.0,...,0.0,0.009346,0.018692,0.0,0.0,0.0,0.009346,0.0,0.0,0.009346
1,Downtown Toronto,0.00082,0.00082,0.001641,0.001641,0.001641,0.013946,0.001641,0.004102,0.009024,...,0.0,0.0,0.00082,0.002461,0.012305,0.001641,0.003281,0.006563,0.00082,0.005742
2,East Toronto,0.0,0.0,0.0,0.0,0.0,0.02459,0.0,0.0,0.0,...,0.008197,0.0,0.016393,0.0,0.0,0.0,0.0,0.0,0.0,0.02459
3,West Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.006494,0.0,0.006494,...,0.0,0.0,0.0,0.0,0.019481,0.0,0.006494,0.006494,0.0,0.019481


In [47]:
toronto_grouped.shape

(4, 231)

#### Print each borough along with the top 5 most common venues

In [48]:
num_top_venues = 5

for hood in toronto_grouped['Borough']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Borough'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Central Toronto----
            venue  freq
0     Coffee Shop  0.08
1  Sandwich Place  0.07
2            Café  0.06
3            Park  0.06
4     Pizza Place  0.05


----Downtown Toronto----
                 venue  freq
0          Coffee Shop  0.10
1                 Café  0.05
2                Hotel  0.03
3           Restaurant  0.03
4  Japanese Restaurant  0.03


----East Toronto----
                venue  freq
0         Coffee Shop  0.07
1    Greek Restaurant  0.07
2             Brewery  0.04
3  Italian Restaurant  0.04
4      Ice Cream Shop  0.03


----West Toronto----
                venue  freq
0                Café  0.07
1                 Bar  0.07
2         Coffee Shop  0.05
3          Restaurant  0.04
4  Italian Restaurant  0.04




#### Put that into a dataframe

In [49]:
import numpy as np # library to handle data in a vectorized manner

In [50]:
#function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [51]:
#create the new dataframe and display the top 10 venues for each borough
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Borough']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
boroughs_venues_sorted = pd.DataFrame(columns=columns)
boroughs_venues_sorted['Borough'] = toronto_grouped['Borough']

for ind in np.arange(toronto_grouped.shape[0]):
    boroughs_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

boroughs_venues_sorted.head()

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Central Toronto,Coffee Shop,Sandwich Place,Café,Park,Pizza Place,Restaurant,Sushi Restaurant,Clothing Store,Dessert Shop,Mexican Restaurant
1,Downtown Toronto,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Bakery,Italian Restaurant,Park,Seafood Restaurant,Clothing Store
2,East Toronto,Coffee Shop,Greek Restaurant,Italian Restaurant,Brewery,Ice Cream Shop,Yoga Studio,Restaurant,Park,Pizza Place,Pub
3,West Toronto,Café,Bar,Coffee Shop,Italian Restaurant,Restaurant,Bakery,Breakfast Spot,Yoga Studio,Park,Diner


## Cluster Boroughs <a class="anchor" id="chapter9"><a>

Run _k_-means to cluster the neighborhood into 2 clusters.

In [52]:
# import k-means from clustering stage
!conda install -c conda-forge scikit-learn

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [53]:
from sklearn.cluster import KMeans

In [54]:
# set number of clusters
kclusters = 2

toronto_grouped_clustering = toronto_grouped.drop('Borough', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each βοροθγη.

In [55]:
# add clustering labels
boroughs_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_, allow_duplicates=True)
toronto_merged = Toronto_data

# merge toronto_grouped with Toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(boroughs_venues_sorted.set_index('Borough'), on='Borough')

toronto_merged.head(20) # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Bakery,Italian Restaurant,Park,Seafood Restaurant,Clothing Store
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Bakery,Italian Restaurant,Park,Seafood Restaurant,Clothing Store
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Bakery,Italian Restaurant,Park,Seafood Restaurant,Clothing Store
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Bakery,Italian Restaurant,Park,Seafood Restaurant,Clothing Store
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Coffee Shop,Greek Restaurant,Italian Restaurant,Brewery,Ice Cream Shop,Yoga Studio,Restaurant,Park,Pizza Place,Pub
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,0,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Bakery,Italian Restaurant,Park,Seafood Restaurant,Clothing Store
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,0,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Bakery,Italian Restaurant,Park,Seafood Restaurant,Clothing Store
7,M6G,Downtown Toronto,Christie,43.669542,-79.422564,0,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Bakery,Italian Restaurant,Park,Seafood Restaurant,Clothing Store
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568,0,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Bakery,Italian Restaurant,Park,Seafood Restaurant,Clothing Store
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259,1,Café,Bar,Coffee Shop,Italian Restaurant,Restaurant,Bakery,Breakfast Spot,Yoga Studio,Park,Diner


Finally, let's visualize the resulting clusters

In [56]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Borough'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine Clusters <a class="anchor" id="chapter10"><a>

Examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, we can then assign a name to each cluster.

#### Cluster 1

In [57]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Bakery,Italian Restaurant,Park,Seafood Restaurant,Clothing Store
1,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Bakery,Italian Restaurant,Park,Seafood Restaurant,Clothing Store
2,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Bakery,Italian Restaurant,Park,Seafood Restaurant,Clothing Store
3,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Bakery,Italian Restaurant,Park,Seafood Restaurant,Clothing Store
4,East Toronto,0,Coffee Shop,Greek Restaurant,Italian Restaurant,Brewery,Ice Cream Shop,Yoga Studio,Restaurant,Park,Pizza Place,Pub
5,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Bakery,Italian Restaurant,Park,Seafood Restaurant,Clothing Store
6,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Bakery,Italian Restaurant,Park,Seafood Restaurant,Clothing Store
7,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Bakery,Italian Restaurant,Park,Seafood Restaurant,Clothing Store
8,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Bakery,Italian Restaurant,Park,Seafood Restaurant,Clothing Store
10,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Bakery,Italian Restaurant,Park,Seafood Restaurant,Clothing Store


#### Cluster 2

In [58]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,West Toronto,1,Café,Bar,Coffee Shop,Italian Restaurant,Restaurant,Bakery,Breakfast Spot,Yoga Studio,Park,Diner
11,West Toronto,1,Café,Bar,Coffee Shop,Italian Restaurant,Restaurant,Bakery,Breakfast Spot,Yoga Studio,Park,Diner
14,West Toronto,1,Café,Bar,Coffee Shop,Italian Restaurant,Restaurant,Bakery,Breakfast Spot,Yoga Studio,Park,Diner
22,West Toronto,1,Café,Bar,Coffee Shop,Italian Restaurant,Restaurant,Bakery,Breakfast Spot,Yoga Studio,Park,Diner
25,West Toronto,1,Café,Bar,Coffee Shop,Italian Restaurant,Restaurant,Bakery,Breakfast Spot,Yoga Studio,Park,Diner
28,West Toronto,1,Café,Bar,Coffee Shop,Italian Restaurant,Restaurant,Bakery,Breakfast Spot,Yoga Studio,Park,Diner
