# Segmenting and Clustering Neighborhoods in Toronto

### The project includes - 
1. scraping the Wikipedia page for the postal codes of Canada 
2. Processing and cleaning the data for clustering. 
3. Clustering the neighborhoods

The clustering is carried out by **K Means** and the clusters are plotted using the **Folium** Library.

In [81]:
!pip install beautifulsoup4
!pip install lxml
!pip install geocoder

import requests # library to handle requests
import pandas as pd # library for data analysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
from IPython.display import display_html
    
# tranforming json file into a pandas dataframe library
from pandas import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

from bs4 import BeautifulSoup

from sklearn.cluster import KMeans

import matplotlib.cm as cm
import matplotlib.colors as colors

pd.set_option('display.max_rows', None)

print('Folium installed')
print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Folium installed
Libraries imported.


## 1. Scraping the Wikipedia page for the table of postal codes of Canada
BeautifulSoup Library of Python is used for web scraping of table from the Wikipedia. The title of the webpage is printed to check if the page has been scraped successfully or not. Then the table of postal codes of Canada is printed.

In [82]:
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(wiki_url)
soup = BeautifulSoup(source.text, 'lxml')
print(soup.title)

tab = str(soup.table)
display_html(tab, raw=True)


<title>List of postal codes of Canada: M - Wikipedia</title>


0,1,2,3,4,5,6,7,8
M1A Not assigned,M2A Not assigned,M3A North York (Parkwoods),M4A North York (Victoria Village),M5A Downtown Toronto (Regent Park / Harbourfront),M6A North York (Lawrence Manor / Lawrence Heights),M7A Queen's Park (Ontario Provincial Government),M8A Not assigned,M9A Etobicoke (Islington Avenue)
M1B Scarborough (Malvern / Rouge),M2B Not assigned,M3B North York (Don Mills) North,M4B East York (Parkview Hill / Woodbine Gardens),"M5B Downtown Toronto (Garden District, Ryerson)",M6B North York (Glencairn),M7B Not assigned,M8B Not assigned,M9B Etobicoke (West Deane Park / Princess Gardens / Martin Grove / Islington / Cloverdale)
M1C Scarborough (Rouge Hill / Port Union / Highland Creek),M2C Not assigned,M3C North York (Don Mills) South (Flemingdon Park),M4C East York (Woodbine Heights),M5C Downtown Toronto (St. James Town),M6C York (Humewood-Cedarvale),M7C Not assigned,M8C Not assigned,M9C Etobicoke (Eringate / Bloordale Gardens / Old Burnhamthorpe / Markland Wood)
M1E Scarborough (Guildwood / Morningside / West Hill),M2E Not assigned,M3E Not assigned,M4E East Toronto (The Beaches),M5E Downtown Toronto (Berczy Park),M6E York (Caledonia-Fairbanks),M7E Not assigned,M8E Not assigned,M9E Not assigned
M1G Scarborough (Woburn),M2G Not assigned,M3G Not assigned,M4G East York (Leaside),M5G Downtown Toronto (Central Bay Street),M6G Downtown Toronto (Christie),M7G Not assigned,M8G Not assigned,M9G Not assigned
M1H Scarborough (Cedarbrae),M2H North York (Hillcrest Village),M3H North York (Bathurst Manor / Wilson Heights / Downsview North),M4H East York (Thorncliffe Park),M5H Downtown Toronto (Richmond / Adelaide / King),M6H West Toronto (Dufferin / Dovercourt Village),M7H Not assigned,M8H Not assigned,M9H Not assigned
M1J Scarborough (Scarborough Village),M2J North York (Fairview / Henry Farm / Oriole),M3J North York (Northwood Park / York University),M4J East York East Toronto (The Danforth East),M5J Downtown Toronto (Harbourfront East / Union Station / Toronto Islands),M6J West Toronto (Little Portugal / Trinity),M7J Not assigned,M8J Not assigned,M9J Not assigned
M1K Scarborough (Kennedy Park / Ionview / East Birchmount Park),M2K North York (Bayview Village),M3K North York (Downsview) East (CFB Toronto),M4K East Toronto (The Danforth West / Riverdale),M5K Downtown Toronto (Toronto Dominion Centre / Design Exchange),M6K West Toronto (Brockton / Parkdale Village / Exhibition Place),M7K Not assigned,M8K Not assigned,M9K Not assigned
M1L Scarborough (Golden Mile / Clairlea / Oakridge),M2L North York (York Mills / Silver Hills),M3L North York (Downsview) West,M4L East Toronto (India Bazaar / The Beaches West),M5L Downtown Toronto (Commerce Court / Victoria Hotel),M6L North York (North Park / Maple Leaf Park / Upwood Park),M7L Not assigned,M8L Not assigned,M9L North York (Humber Summit)
M1M Scarborough (Cliffside / Cliffcrest / Scarborough Village West),M2M North York (Willowdale / Newtonbrook),M3M North York (Downsview) Central,M4M East Toronto (Studio District),M5M North York (Bedford Park / Lawrence Manor East),M6M York (Del Ray / Mount Dennis / Keelsdale and Silverthorn),M7M Not assigned,M8M Not assigned,M9M North York (Humberlea / Emery)


### Now we check the structure of the table in html, so that we can extract the values corresponding to the relevant tags

In [83]:
#print(soup.table.prettify())

### Now we create a dictionary named cell, to store the values of Postal Code, Borough and Neighborhood

In [84]:
cell = {}

post_list = []
borough_list = []
neighborhood_list = []

t_rows = soup.table.tbody.find_all('tr')

for row in t_rows:
    t_datas = row.find_all('td')
    for data in t_datas:
        temp = data.p.text
        
        post_code = temp[:3]
        if "Not assigned" in temp:
            pass
        else:
            rest = temp[3:].split('(')
            borough = rest[0]
            neighborhood = rest[1][:-1].strip(')').replace('/', ',').strip(' ')
        
            post_list.append(post_code)
            borough_list.append(borough)
            neighborhood_list.append(neighborhood)
            
cell['Postal Code'] = post_list
cell['Borough'] = borough_list
cell['Neighborhood'] = neighborhood_list

### We now create a pandas DataFrame from cell

In [85]:
df = pd.DataFrame(cell)
df['Borough'] = df['Borough'].replace({'East YorkEast Toronto': 'East York/East Toronto', 
                                       'MississaugaCanada Post Gateway Processing Centre': 'Mississauga', 
                                       'Downtown TorontoStn A PO Boxes25 The Esplanade': 'Downtown Toronto Stn A',
                                       'EtobicokeNorthwest': 'Etobicoke Northwest', 
                                       'East TorontoBusiness reply mail Processing Centre969 Eastern': 'East Toronto Business'})

df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern , Rouge"
7,M3B,North York,Don Mills)North
8,M4B,East York,"Parkview Hill , Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


**Checking the shape of the dataframe**

In [86]:
df.shape

(103, 3)

### Importing the csv file containing the latitudes and longitudes for various neighbourhoods in Canada

In [87]:
url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv'
coord_df = pd.read_csv(url)
coord_df

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


### Getting the latitudes and longitudes of each neighborhood in Toronto, using geocoder package (couldn't get it to work)

In [88]:
# import geocoder # import geocoder

# lat_list = []
# long_lisat = []
# for postal_code in post_list:    
    
#     # initialize your variable to None
#     lat_lng_coords = None
    
#     address = f'{postal_code}, Toronto, Ontario'
#     while(lat_lng_coords is None):
#         g = geocoder.google(address)
#         lat_lng_coords = g.latlng

#     latitude = lat_lng_coords[0]
#     longitude = lat_lng_coords[1]

#     lat_list.append(latitude)
#     long_list.append(longitude)

### Joining df and coord_df

In [89]:
toronto_df = pd.merge(df.set_index('Postal Code'), coord_df.set_index('Postal Code'), how='inner', on = 'Postal Code').reset_index()
# toronto_df = df.set_index('Postal Code').join(coord_df.set_index('Postal Code'), how='inner', on = 'Postal Code').reset_index()

toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


### Keeping only those boroughs which have 'Toronto' in it

In [90]:
toronto_df = toronto_df[toronto_df['Borough'].str.contains('Toronto')].reset_index(drop= True)
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


## 2. Processing and cleaning the data for clustering. 

**Checking to see how many Boroughs are there**

In [91]:
print(f"The dataframe has {len(toronto_df['Borough'].unique())} boroughs.")

The dataframe has 7 boroughs.


### Use geopy library to get the latitude and longitude values of Toronto.
In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent **toronto_explorer**, as shown below.

In [92]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent='toronto_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


### Let's Create a map of New York with neighborhoods superimposed on top.

In [93]:
# create map of New York using latitude and longitude values
toronto_map = folium.Map(location= [latitude, longitude], zoom_start=12)

# add markers to map
for borough, neighborhood, lat, long in zip(toronto_df['Borough'], toronto_df['Neighborhood'], toronto_df['Latitude'], toronto_df['Longitude']):
    
    label = label = folium.Popup(f"{neighborhood}, {borough}", parse_html=True)
    folium.CircleMarker(location=[lat, long], 
                        radius= 5, 
                        color= 'black', 
                        fill= True, 
                        fill_color= 'blue', 
                        popup= label).add_to(toronto_map)

toronto_map

### Foursquare credentials (Hidden)

In [118]:
# @hidden_cell

CLIENT_ID = 'HIDDEN'  # your Foursquare ID
CLIENT_SECRET = 'HIDDEN'  # your Foursquare Secret
VERSION = '20180605'  # Foursquare API version
LIMIT = 100  # A default Foursquare API limit value

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: HIDDEN
CLIENT_SECRET:HIDDEN


### Test - Getting venue info for a single neighborhood

In [95]:
borough = toronto_df.loc[0, 'Borough']
neighborhood = toronto_df.loc[0, 'Neighborhood']
LAT = toronto_df.loc[0, 'Latitude']
LONG = toronto_df.loc[0, 'Longitude']

print(f'The latitude and longitude of ({neighborhood}), {borough} is {LAT}, {LONG}')

The latitude and longitude of (Regent Park , Harbourfront), Downtown Toronto is 43.6542599, -79.3606359


In [96]:
url = f"https://api.foursquare.com/v2/venues/explore?client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}&ll={LAT},{LONG}&radius=500&limit={LIMIT}"
results = requests.get(url).json()['response']['groups'][0]['items']

venues = json_normalize(results)
venues

Unnamed: 0,referralId,reasons.count,reasons.items,venue.id,venue.name,venue.location.address,venue.location.crossStreet,venue.location.lat,venue.location.lng,venue.location.labeledLatLngs,...,venue.location.city,venue.location.state,venue.location.country,venue.location.formattedAddress,venue.categories,venue.photos.count,venue.photos.groups,venue.location.postalCode,venue.location.neighborhood,venue.venuePage.id
0,e-0-53b8466a498e83df908c3f21-0,0,"[{'summary': 'This spot is popular', 'type': '...",53b8466a498e83df908c3f21,Tandem Coffee,368 King St E,at Trinity St,43.653559,-79.361809,"[{'label': 'display', 'lat': 43.65355870959944...",...,Toronto,ON,Canada,"[368 King St E (at Trinity St), Toronto ON, Ca...","[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",0,[],,,
1,e-0-54ea41ad498e9a11e9e13308-1,0,"[{'summary': 'This spot is popular', 'type': '...",54ea41ad498e9a11e9e13308,Roselle Desserts,362 King St E,Trinity St,43.653447,-79.362017,"[{'label': 'display', 'lat': 43.65344672305267...",...,Toronto,ON,Canada,"[362 King St E (Trinity St), Toronto ON M5A 1K...","[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",0,[],M5A 1K9,,
2,e-0-574c229e498ebb5c6b257902-2,0,"[{'summary': 'This spot is popular', 'type': '...",574c229e498ebb5c6b257902,Cooper Koo Family YMCA,461 Cherry St,,43.653249,-79.358008,"[{'label': 'display', 'lat': 43.65324910177244...",...,Toronto,ON,Canada,"[461 Cherry St, Toronto ON M5A 0H7, Canada]","[{'id': '52e81612bcbc57f1066b7a37', 'name': 'D...",0,[],M5A 0H7,,
3,e-0-5612b1cc498e3dd742af0dc8-3,0,"[{'summary': 'This spot is popular', 'type': '...",5612b1cc498e3dd742af0dc8,Impact Kitchen,573 King St E,at St Lawrence St,43.656369,-79.35698,"[{'label': 'display', 'lat': 43.65636850543279...",...,Toronto,ON,Canada,"[573 King St E (at St Lawrence St), Toronto ON...","[{'id': '4bf58dd8d48988d1c4941735', 'name': 'R...",0,[],M5A 4L3,,
4,e-0-50760559e4b0e8c7babe2497-4,0,"[{'summary': 'This spot is popular', 'type': '...",50760559e4b0e8c7babe2497,Body Blitz Spa East,497 King Street East,btwn Sackville St and Sumach St,43.654735,-79.359874,"[{'label': 'display', 'lat': 43.65473505045365...",...,Toronto,ON,Canada,[497 King Street East (btwn Sackville St and S...,"[{'id': '4bf58dd8d48988d1ed941735', 'name': 'S...",0,[],M5A 1L9,,
5,e-0-51ccc048498ec7792efc955e-5,0,"[{'summary': 'This spot is popular', 'type': '...",51ccc048498ec7792efc955e,Corktown Common,,,43.655618,-79.356211,"[{'label': 'display', 'lat': 43.65561779974973...",...,,,Canada,[Canada],"[{'id': '4bf58dd8d48988d163941735', 'name': 'P...",0,[],,,
6,e-0-4e8b7fa1cc2112f67517660a-6,0,"[{'summary': 'This spot is popular', 'type': '...",4e8b7fa1cc2112f67517660a,The Extension Room,30 Eastern Ave,Sackville St.,43.653313,-79.359725,"[{'label': 'display', 'lat': 43.65331304337331...",...,Toronto,ON,Canada,"[30 Eastern Ave (Sackville St.), Toronto ON, C...","[{'id': '4bf58dd8d48988d175941735', 'name': 'G...",0,[],,,
7,e-0-4ad4c05ef964a520bff620e3-7,0,"[{'summary': 'This spot is popular', 'type': '...",4ad4c05ef964a520bff620e3,The Distillery Historic District,"btwn Front, Cherry, Gardiner & Parliament",,43.650244,-79.359323,"[{'label': 'display', 'lat': 43.65024435658077...",...,Toronto,ON,Canada,"[btwn Front, Cherry, Gardiner & Parliament, To...","[{'id': '4deefb944765f83613cdba6e', 'name': 'H...",0,[],M5A 3C4,,
8,e-0-4b0978e1f964a520cd1723e3-8,0,"[{'summary': 'This spot is popular', 'type': '...",4b0978e1f964a520cd1723e3,SOMA chocolatemaker,"55 Mill Street, Unit #48",The Distillery District,43.650622,-79.358127,"[{'label': 'display', 'lat': 43.65062222570758...",...,Toronto,ON,Canada,"[55 Mill Street, Unit #48 (The Distillery Dist...","[{'id': '52f2ab2ebcbc57f1066b8b31', 'name': 'C...",0,[],M5A 3C4,,
9,e-0-566e1294498e3f6629006bc3-9,0,"[{'summary': 'This spot is popular', 'type': '...",566e1294498e3f6629006bc3,Dominion Pub and Kitchen,500 Queen Street East,,43.656919,-79.358967,"[{'label': 'display', 'lat': 43.65691857501867...",...,Toronto,ON,Canada,"[500 Queen Street East, Toronto ON M5A 1T9, Ca...","[{'id': '4bf58dd8d48988d11b941735', 'name': 'P...",0,[],M5A 1T9,,


In [97]:
venues["venue.categories"][0]

[{'id': '4bf58dd8d48988d1e0931735',
  'name': 'Coffee Shop',
  'pluralName': 'Coffee Shops',
  'shortName': 'Coffee Shop',
  'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/coffeeshop_',
   'suffix': '.png'},
  'primary': True}]

In [98]:
url = f"https://api.foursquare.com/v2/venues/explore?client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}&ll={LAT},{LONG}&radius=500&limit={LIMIT}"
results = requests.get(url).json()['response']['groups'][0]['items']

venues = json_normalize(results)
venues['venue.categories'] = venues['venue.categories'].apply(lambda x: x[0]['name'])
venues = venues[['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']]
venues.columns = ['Venue_Name', 'Venue_Catagory', 'Venue_Latitude', 'Venue_Longitude']
venues.head()

Unnamed: 0,Venue_Name,Venue_Catagory,Venue_Latitude,Venue_Longitude
0,Tandem Coffee,Coffee Shop,43.653559,-79.361809
1,Roselle Desserts,Bakery,43.653447,-79.362017
2,Cooper Koo Family YMCA,Distribution Center,43.653249,-79.358008
3,Impact Kitchen,Restaurant,43.656369,-79.35698
4,Body Blitz Spa East,Spa,43.654735,-79.359874


### Getting venue info for all neighborhoods
We now explore up to 100 venues around each neighborhood

In [99]:
def getNearbyVenues(boroughs, neighborhoods, LATs, LONGs):
    
    venues_list = []
    venues_df = pd.DataFrame(columns = ['Borough',
                                        'Neighborhood',
                                        'Neighborhood Latitude',
                                        'Neighborhood Longitude',
                                        'Venue',
                                        'Venue Category',
                                        'Venue Latitude',
                                        'Venue Longitude'])
    
    for borough, neighborhood, LAT, LONG in zip(boroughs, neighborhoods, LATs, LONGs):
        url = f"https://api.foursquare.com/v2/venues/explore?client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}&ll={LAT},{LONG}&radius=500&limit={LIMIT}"
        results = requests.get(url).json()['response']['groups'][0]['items']

        venues = json_normalize(results)
        venues['venue.categories'] = venues['venue.categories'].apply(lambda x: x[0]['name'])
        venues = venues[['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']]
        venues.columns = ['Venue', 'Venue Category', 'Venue Latitude', 'Venue Longitude']
        
        venues.insert(0,'Borough','')
        venues['Borough'] = borough
        
        venues.insert(1,'Neighborhood','')
        venues['Neighborhood'] = neighborhood
        
        venues.insert(2,'Neighborhood Latitude','')
        venues['Neighborhood Latitude'] = LAT
        
        venues.insert(3,'Neighborhood Longitude','')
        venues['Neighborhood Longitude'] = LONG
            
        venues_df = pd.concat([venues_df, venues])
    
    return venues_df.reset_index(drop=True)

In [100]:
toronto_venues = getNearbyVenues(toronto_df['Borough'], toronto_df["Neighborhood"], toronto_df["Latitude"], toronto_df["Longitude"])
toronto_venues.head()

Unnamed: 0,Borough,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Category,Venue Latitude,Venue Longitude
0,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636,Tandem Coffee,Coffee Shop,43.653559,-79.361809
1,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636,Roselle Desserts,Bakery,43.653447,-79.362017
2,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,Distribution Center,43.653249,-79.358008
3,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636,Impact Kitchen,Restaurant,43.656369,-79.35698
4,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636,Body Blitz Spa East,Spa,43.654735,-79.359874


In [101]:
toronto_venues.shape

(1607, 8)

**Let's check how many venues were returned for each neighborhood**

In [102]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Borough,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Category,Venue Latitude,Venue Longitude
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Berczy Park,58,58,58,58,58,58,58
"Brockton , Parkdale Village , Exhibition Place",24,24,24,24,24,24,24
"CN Tower , King and Spadina , Railway Lands , Harbourfront West , Bathurst Quay , South Niagara , Island airport",15,15,15,15,15,15,15
Central Bay Street,70,70,70,70,70,70,70
Christie,15,15,15,15,15,15,15
Church and Wellesley,79,79,79,79,79,79,79
"Commerce Court , Victoria Hotel",100,100,100,100,100,100,100
Davisville,34,34,34,34,34,34,34
Davisville North,10,10,10,10,10,10,10
"Dufferin , Dovercourt Village",14,14,14,14,14,14,14


**Let's find out how many unique categories can be curated from all the returned venues**

In [103]:
print(f'There are {len(toronto_venues["Venue Category"].unique())} unique categories.')

There are 236 unique categories.


In [104]:
toronto_onehot = pd.get_dummies(toronto_venues[["Venue Category"]], prefix="", prefix_sep="")
toronto_onehot.drop('Neighborhood', axis=1, inplace=True)
toronto_onehot.insert(loc=0, column='Neighborhood', value=toronto_venues['Neighborhood'])
toronto_onehot.head()

Unnamed: 0,Neighborhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,"Regent Park , Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park , Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park , Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park , Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park , Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [105]:
toronto_onehot.shape

(1607, 236)

**Next, let's group rows by neighborhood by taking the mean of the frequency of occurrence of each category**

In [106]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
# toronto_grouped.iloc[0, :].iloc[1:]
toronto_grouped.head()

Unnamed: 0,Neighborhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0
1,"Brockton , Parkdale Village , Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"CN Tower , King and Spadina , Railway Lands , ...",0.066667,0.066667,0.066667,0.133333,0.2,0.066667,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.0,0.014286
4,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**We will now find out the 10 top venue categories for each neighborhood**

In [107]:
def return_most_common_venues(row, num_top_venues):
    row_sorted = row.sort_values(ascending=False)
    return row_sorted.index.values[0:num_top_venues]

In [108]:
num_top_venues = 10
cols = ['Neighborhood']
indicators = ['st', 'nd', 'rd'] 

for i in range(num_top_venues):
    if i<3:
        cols.append(f'{i+1}{indicators[i]} Most Visited Category')
    else:
        cols.append(f'{i+1}th Most Visited Category')
    
top_neighborhood_categories = pd.DataFrame(columns=cols)
top_neighborhood_categories['Neighborhood'] = toronto_grouped['Neighborhood']

for row in range(toronto_grouped.shape[0]):
    top_neighborhood_categories.iloc[row, 1:] = return_most_common_venues(toronto_grouped.iloc[row, 1:], num_top_venues)
    
top_neighborhood_categories.head()

Unnamed: 0,Neighborhood,1st Most Visited Category,2nd Most Visited Category,3rd Most Visited Category,4th Most Visited Category,5th Most Visited Category,6th Most Visited Category,7th Most Visited Category,8th Most Visited Category,9th Most Visited Category,10th Most Visited Category
0,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Cheese Shop,Farmers Market,Seafood Restaurant,Beer Bar,Restaurant,Creperie,Department Store
1,"Brockton , Parkdale Village , Exhibition Place",Café,Breakfast Spot,Coffee Shop,Bakery,Convenience Store,Restaurant,Stadium,Furniture / Home Store,Nightclub,Climbing Gym
2,"CN Tower , King and Spadina , Railway Lands , ...",Airport Service,Airport Lounge,Airport,Bar,Sculpture Garden,Rental Car Location,Plane,Airport Food Court,Boat or Ferry,Harbor / Marina
3,Central Bay Street,Coffee Shop,Café,Sandwich Place,Restaurant,Salad Place,Spa,Department Store,Bubble Tea Shop,Japanese Restaurant,Italian Restaurant
4,Christie,Grocery Store,Café,Park,Baby Store,Restaurant,Nightclub,Italian Restaurant,Coffee Shop,Candy Store,Molecular Gastronomy Restaurant


## 3. Clustering the neighborhoods
**We will split the neighborhoods into 5 clusters, using K-means algorithm**

In [109]:
kclusters = 5
toronto_grouped_cluster = toronto_grouped.iloc[:, 1:]

kmeans = KMeans(n_clusters=kclusters, init='k-means++', n_init=10)
kmeans.fit(toronto_grouped_cluster)
kmeans.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 2,
       1, 1, 1, 1, 3, 4, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1], dtype=int32)

**We now merge the toronto_df dataframe with top_neighborhood_categories dataframe, to cluster the neighborhoods**

In [110]:
top_neighborhood_categories.insert(loc= 0, column= 'Cluster Label', value= kmeans.labels_)

toronto_merged = toronto_df.join(top_neighborhood_categories.set_index('Neighborhood'), how='inner', on='Neighborhood')
toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Label,1st Most Visited Category,2nd Most Visited Category,3rd Most Visited Category,4th Most Visited Category,5th Most Visited Category,6th Most Visited Category,7th Most Visited Category,8th Most Visited Category,9th Most Visited Category,10th Most Visited Category
0,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636,1,Coffee Shop,Pub,Café,Bakery,Park,Theater,Dessert Shop,Spa,Shoe Store,Chocolate Shop
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1,Coffee Shop,Clothing Store,Café,Bubble Tea Shop,Middle Eastern Restaurant,Cosmetics Shop,Japanese Restaurant,Bookstore,Theater,Lingerie Store
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Café,Coffee Shop,Italian Restaurant,Restaurant,Cosmetics Shop,Clothing Store,Cocktail Bar,Beer Bar,Gym,Moroccan Restaurant
3,M4E,East Toronto,The Beaches,43.676357,-79.293031,1,Health Food Store,Pub,Trail,Museum,Martial Arts School,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,1,Coffee Shop,Cocktail Bar,Bakery,Cheese Shop,Farmers Market,Seafood Restaurant,Beer Bar,Restaurant,Creperie,Department Store


**At last, we show the neighborhood clusters in a Folium map!**

In [111]:
toronto_cluster_map = folium.Map(location=[latitude, longitude], zoom_start=12)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i**7) for i in colors_array]

markers_colors = []
for lat, lon, neighborhood, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Label']):
    label = folium.Popup(str(neighborhood) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(toronto_cluster_map)
       
toronto_cluster_map

## And here is the top venue distribution of the 5 clusters

### Cluster 0

In [112]:
toronto_merged[toronto_merged['Cluster Label'] == 0]

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Label,1st Most Visited Category,2nd Most Visited Category,3rd Most Visited Category,4th Most Visited Category,5th Most Visited Category,6th Most Visited Category,7th Most Visited Category,8th Most Visited Category,9th Most Visited Category,10th Most Visited Category
18,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,0,Park,Bus Line,Business Service,Swim School,Airport,Movie Theater,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant
21,M5P,Central Toronto,Forest Hill North & West,43.696948,-79.411307,0,Park,Trail,Bus Line,Sushi Restaurant,Jewelry Store,Airport,Moroccan Restaurant,Mediterranean Restaurant,Men's Store,Metro Station


### Cluster 1

In [113]:
toronto_merged[toronto_merged['Cluster Label'] == 1]

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Label,1st Most Visited Category,2nd Most Visited Category,3rd Most Visited Category,4th Most Visited Category,5th Most Visited Category,6th Most Visited Category,7th Most Visited Category,8th Most Visited Category,9th Most Visited Category,10th Most Visited Category
0,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636,1,Coffee Shop,Pub,Café,Bakery,Park,Theater,Dessert Shop,Spa,Shoe Store,Chocolate Shop
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1,Coffee Shop,Clothing Store,Café,Bubble Tea Shop,Middle Eastern Restaurant,Cosmetics Shop,Japanese Restaurant,Bookstore,Theater,Lingerie Store
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Café,Coffee Shop,Italian Restaurant,Restaurant,Cosmetics Shop,Clothing Store,Cocktail Bar,Beer Bar,Gym,Moroccan Restaurant
3,M4E,East Toronto,The Beaches,43.676357,-79.293031,1,Health Food Store,Pub,Trail,Museum,Martial Arts School,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,1,Coffee Shop,Cocktail Bar,Bakery,Cheese Shop,Farmers Market,Seafood Restaurant,Beer Bar,Restaurant,Creperie,Department Store
5,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,1,Coffee Shop,Café,Sandwich Place,Restaurant,Salad Place,Spa,Department Store,Bubble Tea Shop,Japanese Restaurant,Italian Restaurant
6,M6G,Downtown Toronto,Christie,43.669542,-79.422564,1,Grocery Store,Café,Park,Baby Store,Restaurant,Nightclub,Italian Restaurant,Coffee Shop,Candy Store,Molecular Gastronomy Restaurant
7,M5H,Downtown Toronto,"Richmond , Adelaide , King",43.650571,-79.384568,1,Coffee Shop,Café,Hotel,Clothing Store,Gym,Restaurant,Vegetarian / Vegan Restaurant,Bar,Thai Restaurant,Bakery
8,M6H,West Toronto,"Dufferin , Dovercourt Village",43.669005,-79.442259,1,Pharmacy,Bakery,Brewery,Café,Bank,Music Venue,Supermarket,Middle Eastern Restaurant,Grocery Store,Park
10,M5J,Downtown Toronto,"Harbourfront East , Union Station , Toronto Is...",43.640816,-79.381752,1,Coffee Shop,Aquarium,Café,Hotel,Restaurant,Scenic Lookout,Brewery,Fried Chicken Joint,Sports Bar,Park


### Cluster 2

In [114]:
toronto_merged[toronto_merged['Cluster Label'] == 2]

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Label,1st Most Visited Category,2nd Most Visited Category,3rd Most Visited Category,4th Most Visited Category,5th Most Visited Category,6th Most Visited Category,7th Most Visited Category,8th Most Visited Category,9th Most Visited Category,10th Most Visited Category
29,M4T,Central Toronto,"Moore Park , Summerhill East",43.689574,-79.38316,2,Park,Airport,Museum,Martial Arts School,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop


### Cluster 3

In [115]:
toronto_merged[toronto_merged['Cluster Label'] == 3]

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Label,1st Most Visited Category,2nd Most Visited Category,3rd Most Visited Category,4th Most Visited Category,5th Most Visited Category,6th Most Visited Category,7th Most Visited Category,8th Most Visited Category,9th Most Visited Category,10th Most Visited Category
9,M4J,East York/East Toronto,The Danforth East,43.685347,-79.338106,3,Park,Convenience Store,Intersection,Airport,Museum,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant
33,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,3,Park,Playground,Trail,Museum,Martial Arts School,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant


### Cluster 4

In [116]:
toronto_merged[toronto_merged['Cluster Label'] == 4]

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Label,1st Most Visited Category,2nd Most Visited Category,3rd Most Visited Category,4th Most Visited Category,5th Most Visited Category,6th Most Visited Category,7th Most Visited Category,8th Most Visited Category,9th Most Visited Category,10th Most Visited Category
19,M5N,Central Toronto,Roselawn,43.711695,-79.416936,4,Garden,Home Service,Ice Cream Shop,Modern European Restaurant,Museum,Movie Theater,Moroccan Restaurant,Monument / Landmark,Molecular Gastronomy Restaurant,Airport
