<h1>Neighbourhood Clustering of Toronto</h1> 
<br> 
<p>This aims to cluster different neighbourhoods of Toronto. The neighborhoods and theor postal codes are available at <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M</a>. The latitude and longitude of each neighborhood is available at <a href="http://cocl.us/Geospatial_data"> http://cocl.us/Geospatial_data</a>. Using the latitude information, a call to FourSquare API is made to get a list of venues in the neighborhood. Clustering of neighborhoods is done on basis of the most common venues in each neighborhood. </p>

<h3>1. Data Preparation</h3> 

<p>Web scraping using Beautiful Soup is done to retrieve data from the table in the webpage. Postal codes whose neighborhoods have not been assigned are removed, and Neighborhoods with the same postal code are kept in the same row of the dataframe.</p>

In [127]:
#Libaries installation and import
!pip install bs4 #Beautiful Soup for we scraping 
!pip install folium

import numpy as np
import pandas as pd

from pandas.io.json import json_normalize
import requests
from urllib.request import urlopen #to access the url of the webpage

from bs4 import BeautifulSoup# for web scraping

#Data Visualization libraries
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

#For clustering
from sklearn.cluster import KMeans





In [128]:
#opens url and exracts data into a beautiful Soup object

url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html=urlopen(url)

soup=BeautifulSoup(html,'lxml')
type(soup)

#checks if the title is correct
title=soup.title
print(title)

<title>List of postal codes of Canada: M - Wikipedia</title>


<p>Next, we need to extract only the values in the table of postal codes. Looking for the table row 'tr' tag will help in doing that.There are 2 tables in the webpage. The last 4 rows belong to the 2nd table and we have to remove them.</p>

In [135]:
rows=soup.find_all('tr') #searches for all table rows
rows=rows[:(len(rows)-4)]
print(rows[:2])

[<tr>
<th>Postal code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>, <tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>
</td></tr>]


<p> Each row contains the table cells, which are extracted by searching for the 'td' tag.Since the results have the html tags, those have to be removed before forming a Dataframe.</p>
<br>
<p>After removing html tags, Beautiful Soup returns a string, which has to be splitted into rows,a nd then into values in each row.</p>

In [136]:
cell=[]
for row in rows: 
    cell.append(row.find_all('td'))#searches for table cells
    str_list=str(cell)
    cleaned=BeautifulSoup(str_list,'lxml').get_text()#removes html tags


row_list=cleaned.split('],')#splits the cleaned text into rows on basis of the row ending ] and comma
print(row_list[:10])

['[[', ' [M1A\n, Not assigned\n, \n', ' [M2A\n, Not assigned\n, \n', ' [M3A\n, North York\n, Parkwoods\n', ' [M4A\n, North York\n, Victoria Village\n', ' [M5A\n, Downtown Toronto\n, Regent Park / Harbourfront\n', ' [M6A\n, North York\n, Lawrence Manor / Lawrence Heights\n', " [M7A\n, Downtown Toronto\n, Queen's Park / Ontario Provincial Government\n", ' [M8A\n, Not assigned\n, \n', ' [M9A\n, Etobicoke\n, Islington Avenue\n']


<p>The data has to be cleaned up further. The starting value in each row contains a [, There is a newline character after each value, and also white spaces, which have to be removed. Neighborhoods with the samepostal code are seperated by '/', which has to be replaced by ','.</p>

In [138]:
#Splitting each row into values

new_list=[]

for row in row_list: 
    row=row.replace('[','')#Removes leading [
    row=row.replace('\n','')#Removes newline character
    items=row.split(',')#splits each row into values 
    new_items=[]
    for item in items: 
        n=item.replace('/',',')
        n=n.strip()#removes whitespaces
        new_items.append(n)
    new_list.append(new_items)
    
print(new_list[:10])

[[''], ['M1A', 'Not assigned', ''], ['M2A', 'Not assigned', ''], ['M3A', 'North York', 'Parkwoods'], ['M4A', 'North York', 'Victoria Village'], ['M5A', 'Downtown Toronto', 'Regent Park , Harbourfront'], ['M6A', 'North York', 'Lawrence Manor , Lawrence Heights'], ['M7A', 'Downtown Toronto', "Queen's Park , Ontario Provincial Government"], ['M8A', 'Not assigned', ''], ['M9A', 'Etobicoke', 'Islington Avenue']]


<p>The values are now present in a list in row and column form, and a dataframe can be formed. </p>

In [139]:
df=pd.DataFrame(new_list)
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
0,,,,,,,,,,,...,,,,,,,,,,
1,M1A,Not assigned,,,,,,,,,...,,,,,,,,,,
2,M2A,Not assigned,,,,,,,,,...,,,,,,,,,,
3,M3A,North York,Parkwoods,,,,,,,,...,,,,,,,,,,
4,M4A,North York,Victoria Village,,,,,,,,...,,,,,,,,,,
5,M5A,Downtown Toronto,"Regent Park , Harbourfront",,,,,,,,...,,,,,,,,,,
6,M6A,North York,"Lawrence Manor , Lawrence Heights",,,,,,,,...,,,,,,,,,,
7,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",,,,,,,,...,,,,,,,,,,
8,M8A,Not assigned,,,,,,,,,...,,,,,,,,,,
9,M9A,Etobicoke,Islington Avenue,,,,,,,,...,,,,,,,,,,


In [146]:
new_df=df.filter([0,1,2],axis=1)#removes every column other than the first 3
new_df.head(10)

Unnamed: 0,0,1,2
0,,,
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park , Harbourfront"
6,M6A,North York,"Lawrence Manor , Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"
8,M8A,Not assigned,
9,M9A,Etobicoke,Islington Avenue


<p>The values have been inserted into the dataframe, but the columns names are not yet there. To get the column names from the webpage, web scraping can be used. since, there are only three columns, entering them directly would be more convenient, to avoid many more lines of code.</p>

In [197]:
new_df.rename(columns={0:'Postal Code',1:'Borough',2:'Neighborhood'},inplace=True)
new_df.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,,,
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park , Harbourfront"
6,M6A,North York,"Lawrence Manor , Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"
8,M8A,Not assigned,
9,M9A,Etobicoke,Islington Avenue


<p>Next,the NA and Postal Codes with neighborhoods not assigned have to be removed.</p>

In [198]:
df_final=new_df.dropna()#drops None values
df_final=df_final[~df_final['Borough'].str.contains('Not assigned')]
df_final=df_final[:-3]#removes the columns due to the 2nd table in the webpage

#This is the final dataframe for section 1
df_final.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park , Harbourfront"
6,M6A,North York,"Lawrence Manor , Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"


In [199]:
df_final.shape

(103, 3)

<h3>2.Getting Latitude and Longitude values for each Neighborhood</h3>
<br>
<p>Using the FourSquare API requires latitude and longitude values for each neighborhood.The geocoder library is not very reliable, so the latitude and longitude values are obtained from the csv file given in the link in the beginning of the notebook.</p>

In [150]:
lat_lng_df=pd.read_csv('http://cocl.us/Geospatial_data')
lat_lng_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


<p>Next,we need to merge this dataframe with the one obtained in the previous section. The datatypes of the columns in the previous dataframe is object(for string), but the one read from the csv file has a different datatype, so we need to convert it to Object datatype.</p>

In [151]:
df_final.dtypes

Postal Code     object
Borough         object
Neighborhood    object
dtype: object

In [152]:
lat_lng_df['Postal Code']=lat_lng_df['Postal Code'].astype('object')
lat_lng_df.dtypes

Postal Code     object
Latitude       float64
Longitude      float64
dtype: object

In [153]:
df_merged=df_final.merge(lat_lng_df,on=['Postal Code'])
df_merged.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern , Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill , Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,Garden District,43.657162,-79.378937


<p>Now, we will visualize the neighborhoods using the folium library. The latitude and longitude has been used directly here, but geopy library can also be used to obtain it.Click on the circles to see the name of each neighbourhood. Please note that neighborhoods with the same postal code have been represented together, due to the absence of a seperate latitude and longitude for each of them.</p>

In [155]:
#Latitude and longitude of Toronto,Ontario
latitude=43.651070
longitude=-79.347015

map_toronto=folium.Map(location=[latitude,longitude],zoom_start=10)

for lat,lng,borough,neighborhood in zip(df_merged['Latitude'],df_merged['Longitude'],df_merged['Borough'],df_merged['Neighborhood']):
    label='{},{}'.format(neighborhood,borough)
    label=folium.Popup(label,parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='green',
    fill=True,
    fill_color='#d4f06e',
    fill_opacity=0.6,
    parse_html=False).add_to(map_toronto)

map_toronto

In [156]:
#Final dataframe for section 2
df_merged.shape

(103, 5)

<h3>3.Clustering of neighborhoods</h3>
<br> 
<p>This section clusters neighborhoods with the borough containing the word Toronto, depending on the most common venues in each of them.Foursquare API is used to get the information on venues, so it is necessary to create an app using a Foursquare developer account to obtain the credentials.</p> 
<br>
<p>This section is similar to the IBM clustering neighborhoods in New York lab.</p>

In [157]:
#selects the neighborhoods in boroughs with the word Toronto in them

toronto_data=df_merged[df_merged['Borough'].str.contains('Toronto')].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,Garden District,43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [158]:
toronto_data.shape

(39, 5)

In [160]:
#Latitude and longitude of Toronto is used again,but the map is zoomed a bit more for clear visualization.
latitude=43.651070
longitude=-79.347015

map_only_toronto=folium.Map(location=[latitude,longitude],zoom_start=12)

for lat,lng,borough,neighborhood in zip(toronto_data['Latitude'],toronto_data['Longitude'],toronto_data['Borough'],toronto_data['Neighborhood']):
    label='{},{}'.format(neighborhood,borough)
    label=folium.Popup(label,parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.6,
    parse_html=False).add_to(map_only_toronto)

map_only_toronto

In [194]:
#Enter your foursquare credentials here
CLIENT_ID =  # your Foursquare ID
CLIENT_SECRET =  # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

SyntaxError: invalid syntax (<ipython-input-194-358918f88057>, line 2)

In [162]:
#details of the first neighborhood
name=toronto_data.loc[0,'Neighborhood']
n_lat=toronto_data.loc[0,'Latitude']
n_long=toronto_data.loc[0,'Longitude']
name,n_lat,n_long

('Regent Park , Harbourfront', 43.6542599, -79.3606359)

In [163]:
#makes a call to the foursquare api for the first neighborhood
LIMIT=10
radius=500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    n_lat, 
    n_long, 
    radius, 
    LIMIT)
url # display URL


'https://api.foursquare.com/v2/venues/explore?&client_id=W40JMZ4ZLLADWEPV2HJ0LZ3LC2ZKCRMHMA5LM5SGTT1MRXNM&client_secret=0HELX1ZYMCFF2R3ME02YK2SKQTERSJZDGAZT2RFIABJRTTKC&v=20180605&ll=43.6542599,-79.3606359&radius=500&limit=10'

In [164]:
#Displays the data obtained
results=requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5ea6ce12c546f3001c732f6d'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Corktown',
  'headerFullLocation': 'Corktown, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 48,
  'suggestedBounds': {'ne': {'lat': 43.6587599045, 'lng': -79.3544279001486},
   'sw': {'lat': 43.6497598955, 'lng': -79.36684389985142}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '54ea41ad498e9a11e9e13308',
       'name': 'Roselle Desserts',
       'location': {'address': '362 King St E',
        'crossStreet': 'Trinity St',
        'lat': 43.653446723052674,
        'lng': -79.3620167174383,
        'labeledLatLngs': [{'label': 'display',
 

<p>Foursquare API returns json data. The venues that we require are present in the 'items' section under 'categories'.</p>

In [165]:
#This function will extract the category for each venue 
def get_category(line):
    try: 
        category_list=line['categories']
    except: 
        category_list=line['venue.categories']
        
    if len(category_list)==0:
        return None
    else: 
        return category_list[0]['name']

In [166]:


venues = results['response']['groups'][0]['items']#extracts the venues
    
nearby_venues = json_normalize(venues)#flattens json data

columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, columns]

#filters the category for each venue
nearby_venues['venue.categories'] = nearby_venues.apply(get_category, axis=1)

#Column names were like 'venue.name', so this cleans up the column names
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Roselle Desserts,Bakery,43.653447,-79.362017
1,Tandem Coffee,Coffee Shop,43.653559,-79.361809
2,Morning Glory Cafe,Breakfast Spot,43.653947,-79.361149
3,Cooper Koo Family YMCA,Distribution Center,43.653249,-79.358008
4,Body Blitz Spa East,Spa,43.654735,-79.359874


In [167]:
nearby_venues.shape[0]

10

<p>Next, a function is defined which will extract the venue names and categories for each neighborhood in the dataframe obtained in section 2.</p>

In [168]:
def getVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        #Foursquare API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        #gets only the items section from the url results
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        #for each nearby venue, returns name, category, latitude and longitude
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    return(venues_list)

In [171]:
#gets the nearby venues and prints the name of each neighborhood
toronto_venues = getVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )


Regent Park , Harbourfront
Queen's Park , Ontario Provincial Government
Garden District
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond , Adelaide , King
Dufferin , Dovercourt Village
Harbourfront East , Union Station , Toronto Islands
Little Portugal , Trinity
The Danforth West , Riverdale
Toronto Dominion Centre , Design Exchange
Brockton , Parkdale Village , Exhibition Place
India Bazaar , The Beaches West
Commerce Court , Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West
High Park , The Junction South
North Toronto West
The Annex , North Midtown , Yorkville
Parkdale , Roncesvalles
Davisville
University of Toronto , Harbord
Runnymede , Swansea
Moore Park , Summerhill East
Kensington Market , Chinatown , Grange Park
Summerhill West , Rathnelly , South Hill , Forest Hill SE , Deer Park
CN Tower , King and Spadina , Railway Lands , Harbourfront West , Bathurst Quay , South Niagara , Island airport
Rosedale
Stn A 

<p>The previous step returns a nested list, so a dataframe is formed from the list.</p.

In [172]:
t_venues=pd.DataFrame(row for venue in toronto_venues for row in venue)
t_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']

t_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park , Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park , Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park , Harbourfront",43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
3,"Regent Park , Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
4,"Regent Park , Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa


In [173]:
#Shows number of venues returned by the API for each neighborhood
t_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,10,10,10,10,10,10
"Brockton , Parkdale Village , Exhibition Place",10,10,10,10,10,10
Business reply mail Processing CentrE,10,10,10,10,10,10
"CN Tower , King and Spadina , Railway Lands , Harbourfront West , Bathurst Quay , South Niagara , Island airport",10,10,10,10,10,10
Central Bay Street,10,10,10,10,10,10
Christie,10,10,10,10,10,10
Church and Wellesley,10,10,10,10,10,10
"Commerce Court , Victoria Hotel",10,10,10,10,10,10
Davisville,10,10,10,10,10,10
Davisville North,7,7,7,7,7,7


In [174]:
#prints number of different types of venues
print('There are {} uniques categories.'.format(len(t_venues['Venue Category'].unique())))

There are 120 uniques categories.


<p>One category of venue returned by Foursquare API is called Neighborhood.Places like beach fall under that. The category is renamed to Areas to avoid conflict with the name of the column containing Neighborhood names.</p>

In [175]:
#shows indexes of rows cotaining category neighborhood
t_venues.index[t_venues['Venue Category']=='Neighborhood'].tolist()

[43, 83, 94, 172]

In [176]:
t_venues.iloc[43]

Neighborhood                The Beaches
Neighborhood Latitude           43.6764
Neighborhood Longitude          -79.293
Venue                     Upper Beaches
Venue Latitude                  43.6806
Venue Longitude                -79.2929
Venue Category             Neighborhood
Name: 43, dtype: object

<p>The Venue Category column contains categorical variables, so a one hot representation will be required to get numerical values. The venues are then grouped by neighborhood to get the number of venues of a particular type in each neighborhood.</p>

In [177]:
t_onehot = pd.get_dummies(t_venues[['Venue Category']], prefix="", prefix_sep="")


t_onehot.rename(columns={'Neighborhood':'Areas'},inplace=True)

#inserts Neighborhood column back to the one hot dataframe
t_onehot.insert(0,'Neighborhood',t_venues['Neighborhood'])

t_onehot.head()

Unnamed: 0,Neighborhood,Airport,Airport Food Court,Airport Lounge,Airport Terminal,American Restaurant,Arts & Crafts Store,Asian Restaurant,Auto Workshop,BBQ Joint,...,Tailor Shop,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,"Regent Park , Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park , Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park , Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park , Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park , Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [178]:
t_onehot.shape

(348, 121)

In [179]:
t_grouped = t_onehot.groupby('Neighborhood').mean().reset_index()
t_grouped

Unnamed: 0,Neighborhood,Airport,Airport Food Court,Airport Lounge,Airport Terminal,American Restaurant,Arts & Crafts Store,Asian Restaurant,Auto Workshop,BBQ Joint,...,Tailor Shop,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0
1,"Brockton , Parkdale Village , Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Business reply mail Processing CentrE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower , King and Spadina , Railway Lands , ...",0.1,0.1,0.2,0.2,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0
7,"Commerce Court , Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [180]:
t_grouped.shape

(39, 121)

<p>Next, a function is created which sorts the venues in each neighborhood in the order of the most common neighborhood. This will help in interpreting the results of the analysis after clustering.</p>

In [181]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [182]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

#ceates columns according to most common venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

#creates a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = t_grouped['Neighborhood']

for ind in np.arange(t_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(t_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Farmers Market,Concert Hall,Fountain,Restaurant,Museum,Liquor Store,Cocktail Bar,Vegetarian / Vegan Restaurant,Park,Cuban Restaurant
1,"Brockton , Parkdale Village , Exhibition Place",Coffee Shop,Bar,Pet Store,Furniture / Home Store,Gym,Café,Italian Restaurant,Restaurant,Bakery,Yoga Studio
2,Business reply mail Processing CentrE,Skate Park,Restaurant,Garden Center,Comic Shop,Pizza Place,Brewery,Farmers Market,Auto Workshop,Fast Food Restaurant,Burrito Place
3,"CN Tower , King and Spadina , Railway Lands , ...",Airport Lounge,Airport Terminal,Airport,Boutique,Harbor / Marina,Rental Car Location,Plane,Airport Food Court,Asian Restaurant,Fish & Chips Shop
4,Central Bay Street,Coffee Shop,Gastropub,Modern European Restaurant,Middle Eastern Restaurant,Bubble Tea Shop,Spa,Cosmetics Shop,Cuban Restaurant,Dance Studio,Department Store


<p>Now,clustering is done to the neighborhoods on basis of the most common venues.5 clusters are used.</p>

In [183]:
kclusters = 5

t_grouped_clustering = t_grouped.drop('Neighborhood', 1)


kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(t_grouped_clustering)

# check cluster labels for each row
kmeans.labels_[0:10] 

array([1, 0, 1, 1, 2, 0, 1, 0, 1, 1], dtype=int32)

In [186]:
#adds clustering labels to each neighborhood
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

t_merged = toronto_data

#merges the sorted venues and cluster labels with neighborhood latitude and longitude
t_merged = t_merged.merge(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
t_merged.head() 

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636,2,Breakfast Spot,Bakery,Spa,Park,Distribution Center,Restaurant,Historic Site,Gym / Fitness Center,Coffee Shop,Dessert Shop
1,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.662301,-79.389494,2,Coffee Shop,Park,Sushi Restaurant,Creperie,Italian Restaurant,Beer Bar,Mexican Restaurant,Distribution Center,Diner,Farmers Market
2,M5B,Downtown Toronto,Garden District,43.657162,-79.378937,0,Café,Coffee Shop,Burrito Place,Comic Shop,Music Venue,Clothing Store,Theater,Plaza,Tea Room,Dog Run
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,2,Coffee Shop,Gastropub,Food Truck,Restaurant,Middle Eastern Restaurant,Cosmetics Shop,Japanese Restaurant,Creperie,Gym,Fountain
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,2,Trail,Pub,Areas,Health Food Store,Yoga Studio,Dim Sum Restaurant,Donut Shop,Dog Run,Distribution Center,Diner


In [188]:
#Visualization of the clustered neighborhoods

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(t_merged['Latitude'], t_merged['Longitude'], t_merged['Neighborhood'], t_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<p>Now, we will see the charateristics of neighborhoods in each cluster.</p>

In [189]:
t_merged.loc[t_merged['Cluster Labels'] == 0, t_merged.columns[[1] + list(range(5, t_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Downtown Toronto,0,Café,Coffee Shop,Burrito Place,Comic Shop,Music Venue,Clothing Store,Theater,Plaza,Tea Room,Dog Run
7,Downtown Toronto,0,Café,Grocery Store,Candy Store,Coffee Shop,Italian Restaurant,Diner,Restaurant,Donut Shop,Dog Run,Distribution Center
8,Downtown Toronto,0,Café,Areas,Vegetarian / Vegan Restaurant,Concert Hall,Speakeasy,Restaurant,Plaza,Gym / Fitness Center,Hotel,Steakhouse
9,West Toronto,0,Bakery,Brewery,Bar,Grocery Store,Music Venue,Gym / Fitness Center,Café,Middle Eastern Restaurant,Bank,Farmers Market
13,Downtown Toronto,0,Café,Coffee Shop,Restaurant,Gym,Gym / Fitness Center,Beer Bar,Tea Room,Pub,Bakery,Diner
14,West Toronto,0,Coffee Shop,Bar,Pet Store,Furniture / Home Store,Gym,Café,Italian Restaurant,Restaurant,Bakery,Yoga Studio
16,Downtown Toronto,0,Café,Coffee Shop,Museum,Gym,Gym / Fitness Center,Tea Room,Restaurant,Pub,Bakery,Dog Run
22,West Toronto,0,Gastropub,Bar,Music Venue,Café,Italian Restaurant,Flea Market,Speakeasy,Park,Arts & Crafts Store,Furniture / Home Store
28,West Toronto,0,Coffee Shop,Sushi Restaurant,Pub,Café,Burrito Place,Fish & Chips Shop,Italian Restaurant,Food,Bookstore,Tea Room
30,Downtown Toronto,0,Café,Vietnamese Restaurant,Bakery,Organic Grocery,Mexican Restaurant,Dessert Shop,Arts & Crafts Store,Yoga Studio,Donut Shop,Dog Run


In [190]:
t_merged.loc[t_merged['Cluster Labels'] == 1, t_merged.columns[[1] + list(range(5, t_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,Downtown Toronto,1,Farmers Market,Concert Hall,Fountain,Restaurant,Museum,Liquor Store,Cocktail Bar,Vegetarian / Vegan Restaurant,Park,Cuban Restaurant
10,Downtown Toronto,1,Skating Rink,Supermarket,Park,Performing Arts Venue,Plaza,Lake,Salad Place,Sporting Goods Shop,Hotel,Areas
11,West Toronto,1,Pizza Place,Ice Cream Shop,New American Restaurant,Cocktail Bar,Korean Restaurant,Brewery,Beer Store,Cuban Restaurant,Wine Bar,Asian Restaurant
12,East Toronto,1,Greek Restaurant,Ice Cream Shop,Yoga Studio,Cosmetics Shop,Fruit & Vegetable Store,Italian Restaurant,Brewery,Distribution Center,Eastern European Restaurant,Donut Shop
15,East Toronto,1,Fast Food Restaurant,Sushi Restaurant,Park,Liquor Store,Brewery,Italian Restaurant,Fish & Chips Shop,Ice Cream Shop,Pet Store,Gym
17,East Toronto,1,Gay Bar,Bakery,Italian Restaurant,Bookstore,Fish Market,Ice Cream Shop,Sandwich Place,Coffee Shop,Areas,Pet Store
18,Central Toronto,1,Park,Dim Sum Restaurant,Bus Line,Swim School,Yoga Studio,Fast Food Restaurant,Cuban Restaurant,Dance Studio,Department Store,Dessert Shop
20,Central Toronto,1,Breakfast Spot,Food & Drink Shop,Park,Gym,Department Store,Sandwich Place,Hotel,Eastern European Restaurant,Donut Shop,Dog Run
21,Central Toronto,1,Trail,Mexican Restaurant,Sushi Restaurant,Jewelry Store,Yoga Studio,Distribution Center,Eastern European Restaurant,Donut Shop,Dog Run,Dim Sum Restaurant
23,Central Toronto,1,Yoga Studio,Spa,Mexican Restaurant,Diner,Dessert Shop,Clothing Store,Chinese Restaurant,Restaurant,Salon / Barbershop,Coffee Shop


In [191]:
t_merged.loc[t_merged['Cluster Labels'] == 2, t_merged.columns[[1] + list(range(5, t_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,2,Breakfast Spot,Bakery,Spa,Park,Distribution Center,Restaurant,Historic Site,Gym / Fitness Center,Coffee Shop,Dessert Shop
1,Downtown Toronto,2,Coffee Shop,Park,Sushi Restaurant,Creperie,Italian Restaurant,Beer Bar,Mexican Restaurant,Distribution Center,Diner,Farmers Market
3,Downtown Toronto,2,Coffee Shop,Gastropub,Food Truck,Restaurant,Middle Eastern Restaurant,Cosmetics Shop,Japanese Restaurant,Creperie,Gym,Fountain
4,East Toronto,2,Trail,Pub,Areas,Health Food Store,Yoga Studio,Dim Sum Restaurant,Donut Shop,Dog Run,Distribution Center,Diner
6,Downtown Toronto,2,Coffee Shop,Gastropub,Modern European Restaurant,Middle Eastern Restaurant,Bubble Tea Shop,Spa,Cosmetics Shop,Cuban Restaurant,Dance Studio,Department Store
24,Central Toronto,2,Burger Joint,Middle Eastern Restaurant,Park,Vegetarian / Vegan Restaurant,Donut Shop,Indian Restaurant,American Restaurant,Coffee Shop,Café,BBQ Joint
31,Central Toronto,2,Coffee Shop,Supermarket,American Restaurant,Restaurant,Liquor Store,Sports Bar,Pub,Bank,Sushi Restaurant,Yoga Studio


In [192]:
t_merged.loc[t_merged['Cluster Labels'] == 3, t_merged.columns[[1] + list(range(5, t_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
29,Central Toronto,3,Playground,Park,Trail,Yoga Studio,Farmers Market,Creperie,Cuban Restaurant,Dance Studio,Department Store,Dessert Shop
33,Downtown Toronto,3,Park,Playground,Trail,Yoga Studio,Farmers Market,Creperie,Cuban Restaurant,Dance Studio,Department Store,Dessert Shop


In [193]:
t_merged.loc[t_merged['Cluster Labels'] == 4, t_merged.columns[[1] + list(range(5, t_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,Central Toronto,4,Garden,Music Venue,Yoga Studio,Farmers Market,Creperie,Cuban Restaurant,Dance Studio,Department Store,Dessert Shop,Dim Sum Restaurant


<h3>Results of Analysis</h3> 
<br> 
<p>From the neighborhoods under each cluster, the following trends are visible.</p>
<ul>
    <li><p>In Cluster 0, the most common venues tend to be cafes or bakeries. Bars and pubs are also common.</p></li>
    <li><p>In Cluster 1, the most common venues tend to be restaurants or food joints.</p></li>
    <li><p>In Cluster 2, the most common venues tend to be coffee shops.Restaurants and markets are also common.</p></li>
    <li><p>In Cluster 3, playgrounds and park are the most common, followed by Yoga Studios.</p></li>
    <li><p>In Cluster 4 has only one neighborhood which has mostly gardens, music venues and yoga studiosand very few coffee shops or restaurants.</p></li>
   </ul>
    