<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto - Canada</font></h1>

## Introduction

This Lab Session is part of the **Coursera-IBM course "Applied Data Science Capstone"** and it is the final task for Week 03.\
Students are asked to perform several activities in order to demonstrate how skillful they have become.\

Similarly to the Lab Session for Segmentation and Clustering in New York City, this peer-revised task will demand:\
a) Data extraction from a Wikipedia webspage and Data preparation for use (wrangling, cleansing, etc);\
b) Address conversion into their equivalent latitude and longitude values;\
c) Foursquare API usage to explore neighborhoods in Toronto - Canada;\
d) Venue clustering by employing the _k_-means clustering algorithm;\
e) Clustering visualization by using Folium library.

All these activities will be divided into three parts.

## Summary:
**Part One =>** From Data Extraction (Wikipedia) to a "cleansed" Pandas Dataframe Building (3 columns)\
**Part Two =>** From "cleansed" Pandas Dataframe Building (3 columns) to "expanded" Pandas Dataframe (5 columns)\
**Part Three =>** From "expanded" Pandas Dataframe Building (5 columns) to Map display of Clustering.\
.

**Part One => From Data Extraction (Wikipedia) to a "cleansed" Pandas Dataframe Building (3 columns)**

Instructions for this task are not clear about which method should be used for "scraping" Wikipedia page data nor about which method students are not allowed to employ.\
So, after unsucessfully trying to extract data by examining its html code and any other path, I managed to try somethong very simple:\
##### a) to copy all "Post Code - Borough - Neighbourhood" data (Crtl-C);
##### b) to export them to an Excel spreadsheet (Crtl-V).
This path provided me a file to be be uploaded into this notebook and, obviously, it means the pandas dataframe will be build not from a csv file but from a xlsx file.

In [75]:
import pandas as pd
import numpy as np

In [2]:
pip install lxml

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/bd/78/56a7c88a57d0d14945472535d0df9fb4bbad7d34ede658ec7961635c790e/lxml-4.6.2-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 7.7MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.6.2
Note: you may need to restart the kernel to use updated packages.


In [15]:
#!wget -O toronto_postal_codes.xlsx https://github.com/Alexandre-Neri/Alles/blob/main/toronto_postal_codes.xlsx

In [2]:
# df=pd.read_excel("toronto_postal_codes_mk2.xlsx")

In [3]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

In [4]:
display(df)

[    Postal Code           Borough  \
 0           M1A      Not assigned   
 1           M2A      Not assigned   
 2           M3A        North York   
 3           M4A        North York   
 4           M5A  Downtown Toronto   
 ..          ...               ...   
 175         M5Z      Not assigned   
 176         M6Z      Not assigned   
 177         M7Z      Not assigned   
 178         M8Z         Etobicoke   
 179         M9Z      Not assigned   
 
                                          Neighbourhood  
 0                                         Not assigned  
 1                                         Not assigned  
 2                                            Parkwoods  
 3                                     Victoria Village  
 4                            Regent Park, Harbourfront  
 ..                                                 ...  
 175                                       Not assigned  
 176                                       Not assigned  
 177                

**Note:** 
As we can see above, the extraction of 'Toronto Postal Codes' table from Wikipedia page by using pandas 'read_html' method provide us a list.
The information needed to perform this session´s task is contained on its first position (df[0]), which is, obviouly, not a pandas dataframe.
Therefore a further activity (list converstion to pandas DF) will be required.
Let´s check this out:

In [5]:
display(df[0])
type(df[0])

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


pandas.core.frame.DataFrame

In [6]:
df = pd.DataFrame(df[0])
type(df)

pandas.core.frame.DataFrame

In [7]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [8]:
df.shape

(180, 3)

In [9]:
print('After extracting Wikipedia data and building a Pandas Dataframe, it displays {} rows and {} columns.'
      .format(df.shape[0], df.shape[1]))
print()

After extracting Wikipedia data and building a Pandas Dataframe, it displays 180 rows and 3 columns.



**Step 01:** as requested by instructions, step 01 lies on eliminating all "Not assigned" tags on "Borough" column. In order to accomplish this, it will surely help us knowing beforehand how much there are on this dataframe.

In [10]:
df['Borough'].value_counts().to_frame()

Unnamed: 0,Borough
Not assigned,77
North York,24
Downtown Toronto,19
Scarborough,17
Etobicoke,12
Central Toronto,9
West Toronto,6
York,5
East Toronto,5
East York,5


In [11]:
df['Borough'].value_counts()[0]

77

In [12]:
print("As there are {} 'Not assigned' tags on `Borough` column, we expect new dataframe to display {} rows after eliminating this tag."
      .format(df['Borough'].value_counts()[0], df.shape[0] - df['Borough'].value_counts()[0]))
print()

As there are 77 'Not assigned' tags on `Borough` column, we expect new dataframe to display 103 rows after eliminating this tag.



**Step 02:** Actual "Not assigned" tags on "Borough" column elimination:
##### For the sake of clarity and variable preservation, whenever a dataframe is modified, it will be cast a new name.

In [13]:
new_df = df[df.Borough != 'Not assigned'].reset_index()
new_df = new_df.drop('index', axis=1)
new_df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [14]:
new_df.shape

(103, 3)

**Step 03:** lies on checking if there is any"Not assigned" tag on "Neighbourhood" column. If affirmative, it must be replaced by the correspondent "Borough" tag. Let's get started by checking if there is any:

In [15]:
new_df['Neighbourhood'][2]

'Regent Park, Harbourfront'

In [16]:
counter = 0
list_with = []   # this list will show all neighbourhood names that were updated to their borough names
index_with = []  # this list will show all index positions of neighbourhood names that were updated to their borough names
comp = new_df.shape[0]

for i in range(comp):
    #print(i, counter, new_df['Neighbourhood'][i])
    if new_df['Neighbourhood'][i] == 'Not assigned':
        counter = counter + 1
        new_df['Neighbourhood'][i] = new_df['Borough'][i]
        list_with.append(new_df['Neighbourhood'][i])
        index_with.append(i)        
    else:
        counter = counter
                
if counter == 0:
    print("There is not a single 'Not assigned' tag on the column Neighbourhood.")
    print("Therefore, No Index Positions were updated to Borough names.")
else:
    print("There is/are {} 'Not assigned' tags on the column Neighbourhood.". format(counter))
    print("Therefore, The following Index Position(s) ({}), was/were updated to their Borough names({}).".format(index_with, list_with))
    
print()        

There is not a single 'Not assigned' tag on the column Neighbourhood.
Therefore, No Index Positions were updated to Borough names.



In [17]:
new_df.tail()

Unnamed: 0,Postal Code,Borough,Neighbourhood
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."
102,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


**Step 04:** lies on checking if there are multiple "Neighbourhood" names for common "Postal Codes". If affirmative, all "Neighbourhood" names must be placed in a single row **(for that common "Postal Code")** and separated with commas.

In [18]:
new_df.sort_values(by = 'Postal Code', ascending = True).reset_index()
new_df.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [19]:
new_df.shape

(103, 3)

In [20]:
same = False
how_many = 0
start = new_df['Postal Code'][0]
print(start)
stretch = (len(new_df['Postal Code'])-1)
print(stretch)

for j in range(stretch):   # this 'for' loop will sweep all dataframe rows
    if new_df['Postal Code'][j+1] == start:
        start_inside = new_df['Postal Code'][j+1]
        same = True
        pos = j
        new_df['Postal Code'][pos] = new_df['Postal Code'][j+1]
        start = new_df['Postal Code'][j+1]
        how_many = how_many + 1
        
    else:
        start = new_df['Postal Code'][j+1]
        
if same == True:
    print("There is/are {} repeated Postal Codes whose content must be placed on a single row.".format(how_many))
    #insert function here.
else:
    print("There is not a single repeated Postal Code whose content must be placed on a single row.")
print()

M3A
102
There is not a single repeated Postal Code whose content must be placed on a single row.



In [21]:
print("After performing all requested activities, final data frame shape is {} rows and {} columns."
      .format(new_df.shape[0], new_df.shape[1]))

After performing all requested activities, final data frame shape is 103 rows and 3 columns.


**Part Two =>** From "cleansed" Pandas Dataframe Building (3 columns) to "expanded" Pandas Dataframe (5 columns)\

**Step 05:**\
In this step, we will make the option of downloading a **.csv file** by using the command ! wget:\
This csv file provides us geographic coordinates (latitude and longitude) for each and every Postal Code in Toronto, maybe even the whole Canada.\
After the download, 'csv-data' will be transformed into a pandas dataframe that will be merged with the dataframe we build on **Part One**.\
Eventually, we will have a basis for FourSquare API usage.

**Note:** By the way, it is easy to obtain the csv-file name by clicking on the link provided on "My Submission" tips as it opens a window for saving the file and displaying its name.

In [22]:
!wget -O Geospatial_Coordinates.csv http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv

--2020-12-05 17:36:05--  http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv
Resolving cocl.us (cocl.us)... 169.63.96.176, 169.63.96.194
Connecting to cocl.us (cocl.us)|169.63.96.176|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv [following]
--2020-12-05 17:36:05--  https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv
Connecting to cocl.us (cocl.us)|169.63.96.176|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-12-05 17:36:06--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.29.197
Connecting to ibm.box.com (ibm.box.com)|107.152.29.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.cs

In [23]:
geo_df = pd.read_csv('Geospatial_Coordinates.csv')
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [24]:
geo_df.shape

(103, 3)

In [25]:
# let's sort the dataframe in ascending order
geo_df.sort_values(by = 'Postal Code', ascending = True).reset_index()
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


**Step 06:** \
Now we have all information needed to perform Clusterization. However, they are spread in two different dataframes.\
As these two different dataframes display a commom reference('Postal Code'), we can use it to verify which Postal Codes from **new_df** are also present on **geo_df** and link all the correspondent geographical coordinates on a single, brand new df **fsq-df**.

In [26]:
# let's turn all postal codes into a list in order to create the basis for comparison.
A = new_df['Postal Code'].tolist()

# Now let's look for in the second dataframe all postal codes from the first dataframe. Output will be Boolean type
B = geo_df['Postal Code'].isin(new_df['Postal Code'].tolist())

C = geo_df[B]

# Finally, let's merge the two dfs to create our basis for FourSquare usage
fsq_df = pd.merge(new_df, C)

fsq_df.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


In [27]:
fsq_df.shape

(103, 5)

In [28]:
print("After performing all requested activities, final data frame shape is {} rows and {} columns.".
      format(fsq_df.shape[0], fsq_df.shape[1]))

After performing all requested activities, final data frame shape is 103 rows and 5 columns.


**Part Three =>** From "expanded" Pandas Dataframe Building (5 columns) to Map display of Clustering.\

Let's import the remaining required libraries:

In [29]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print()
print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: done


  current version: 4.9.1
  latest version: 4.9.2

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge
    attrs-20.3.0               |     pyhd3deb0d_0          41 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    brotlipy-0.7.0             |py36he6145b8_1001         347 KB  conda-forge
    ca-certificates-2020.11.8  |       ha878542_0         145 KB  c

**Note:**\
Before anything, it would be interesting to take a closer look at the city's Boroughs distribution to assess how much each one represents: 

In [44]:
fsq_df['Borough'].value_counts().to_frame()

Unnamed: 0,Borough
North York,24
Downtown Toronto,19
Scarborough,17
Etobicoke,12
Central Toronto,9
West Toronto,6
York,5
East York,5
East Toronto,5
Mississauga,1


A Map of the City of Toronto would help implementing the Clustering Process as it provides a better Visualization of the City's Boroughs:

In [30]:
# from Google we get Toronto geographic coordinates

latToronto = 43.651070
lngToronto = -79.347015

In [31]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latToronto, lngToronto], zoom_start=10)

# add markers to map
for lat, lng, borough in zip(fsq_df['Latitude'], fsq_df['Longitude'], fsq_df['Borough']):
    label = '{}'.format(borough) #what will be shown when user hovers a place
    label = folium.Popup(label, parse_html=True)   #nature of label
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

**Step 07:** \
From initial observations of Toronto City Map, there seems to be **no obvious** grounds for clustering as Boroughs looks evenly spread across the city. There is a concentration of "points" near the Lake Ontario (Financial District) but that does not say much.\
Perhaps further research on what sort of venue can be predominant at each Borough might bring some light to the subject. 

##### Define Foursquare Credentials and Version

In [45]:
CLIENT_ID = '113ZWQSUYLUGMKP51IRZVZRVDHJBACXCYTXIXGD5GM5PVO5B' # your Foursquare ID
CLIENT_SECRET = '5QXPGLDYT1SVGN3FVVLMZEV0EKJMYPMJNMJAMF2X3ZZEW0SP' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 113ZWQSUYLUGMKP51IRZVZRVDHJBACXCYTXIXGD5GM5PVO5B
CLIENT_SECRET:5QXPGLDYT1SVGN3FVVLMZEV0EKJMYPMJNMJAMF2X3ZZEW0SP


##### Define Function to sweep all city's Boroughs in order to retrive all interesting Venues to each Borough:

In [59]:
def exploreVenues(names, latitudes, longitudes, radius=500):
    
    got_venues=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name, lat, lng)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
                
        # return only relevant information for each nearby venue and build a list type object
        got_venues.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'],
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])
            

    near_venues = pd.DataFrame([item for got_venues in got_venues for item in got_venues])
    near_venues.columns = ['Borough', 
                  'Borough Latitude', 
                  'Borough Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(near_venues)

In [60]:
toronto_venues = exploreVenues(names = fsq_df['Borough'], latitudes = fsq_df['Latitude'], longitudes = fsq_df['Longitude'])
toronto_venues

Unnamed: 0,Borough,Borough Latitude,Borough Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,North York,43.753259,-79.329656,Brookbanks Park,43.751976,-79.332140,Park
1,North York,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,North York,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,North York,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,North York,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
...,...,...,...,...,...,...,...
2136,Etobicoke,43.628841,-79.520999,Koala Tan Tanning Salon & Sunless Spa,43.631370,-79.519006,Tanning Salon
2137,Etobicoke,43.628841,-79.520999,Once Upon A Child,43.631075,-79.518290,Kids Store
2138,Etobicoke,43.628841,-79.520999,Value Village,43.631269,-79.518238,Thrift / Vintage Store
2139,Etobicoke,43.628841,-79.520999,Kingsway Boxing Club,43.627254,-79.526684,Gym


In [64]:
toronto_venues.groupby('Borough').count()

Unnamed: 0_level_0,Borough Latitude,Borough Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Central Toronto,104,104,104,104,104,104
Downtown Toronto,1248,1248,1248,1248,1248,1248
East Toronto,119,119,119,119,119,119
East York,79,79,79,79,79,79
Etobicoke,74,74,74,74,74,74
Mississauga,13,13,13,13,13,13
North York,241,241,241,241,241,241
Scarborough,90,90,90,90,90,90
West Toronto,153,153,153,153,153,153
York,20,20,20,20,20,20


In [85]:
print('There are {} unique categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 273 unique categories.


**Step 08:**\
Now that we have more information about what is and what is not relevant in aech borough of Toronto, we must transform all these info (categorical values) into numerical values (absolute or relative ones) so that Clustering is feasible.\
**Summarizing:**\
a) Spread venue categories for each borough,\
b) Calculate their relevance to it (frequency of occurence),\
c) Select only a few venues (Top Five),\
d) Build a new dataframe which will serve as the basis for Clustering,\
e) Finally, if possible, we must try to draw conclusions from the Clustering.

In [67]:
# one hot encoding, that is, creating columns for each type of category found on 'toronto_venues'
# the output is binary, that is, for each column there is (1) or there isn´t (0) a categoty
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe because previous line only breaks all venues categories into columns
toronto_onehot['Borough'] = toronto_venues['Borough'] 

# move neighborhood column to the first column.
# 'fixed_columns' does not comprise any data, it is only a list of column names (list method is the tip...)
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1]) #establishes column sequence(order)

toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot

Unnamed: 0,Borough,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,North York,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,North York,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,North York,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,North York,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,North York,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2136,Etobicoke,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2137,Etobicoke,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2138,Etobicoke,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2139,Etobicoke,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [68]:
toronto_grouped = toronto_onehot.groupby('Borough').mean().reset_index()
toronto_grouped

Unnamed: 0,Borough,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Central Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009615,...,0.0,0.0,0.0,0.0,0.009615,0.0,0.0,0.0,0.0,0.009615
1,Downtown Toronto,0.0,0.000801,0.000801,0.000801,0.000801,0.001603,0.001603,0.000801,0.012821,...,0.002404,0.0,0.011218,0.001603,0.004006,0.0,0.00641,0.0,0.0,0.005609
2,East Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02521,...,0.0,0.0,0.0,0.0,0.0,0.0,0.008403,0.0,0.0,0.016807
3,East York,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.012658,0.0,0.0,0.0,0.012658
4,Etobicoke,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013514,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013514,0.0,0.0
5,Mississauga,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,North York,0.004149,0.0,0.004149,0.0,0.0,0.0,0.0,0.0,0.008299,...,0.0,0.0,0.0,0.004149,0.008299,0.0,0.0,0.0,0.016598,0.0
7,Scarborough,0.011111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011111,...,0.011111,0.0,0.0,0.0,0.011111,0.0,0.0,0.0,0.0,0.0
8,West Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.013072,0.0,0.013072,0.0,0.006536,0.0,0.0,0.013072
9,York,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0


In [70]:
num_top_venues = 5

for hood in toronto_grouped['Borough']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Borough'] == hood].T.reset_index()  # T=transpose
    temp.columns = ['venue','freq']  # just column naming
    temp = temp.iloc[1:]             # slicing the DF from pos 1 on...
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Central Toronto----
            venue  freq
0     Coffee Shop  0.08
1  Sandwich Place  0.07
2            Café  0.06
3            Park  0.06
4     Pizza Place  0.05


----Downtown Toronto----
                venue  freq
0         Coffee Shop  0.11
1                Café  0.05
2          Restaurant  0.04
3               Hotel  0.03
4  Seafood Restaurant  0.02


----East Toronto----
                venue  freq
0         Coffee Shop  0.07
1    Greek Restaurant  0.06
2  Italian Restaurant  0.04
3             Brewery  0.04
4              Bakery  0.03


----East York----
                 venue  freq
0                 Bank  0.05
1          Coffee Shop  0.05
2         Intersection  0.05
3  Sporting Goods Shop  0.04
4         Burger Joint  0.04


----Etobicoke----
                  venue  freq
0           Pizza Place  0.11
1           Coffee Shop  0.07
2        Sandwich Place  0.07
3              Pharmacy  0.05
4  Fast Food Restaurant  0.04


----Mississauga----
                      venue  f

In [73]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [86]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Borough']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
borough_venues_sorted = pd.DataFrame(columns=columns)
borough_venues_sorted['Borough'] = toronto_grouped['Borough']

for ind in np.arange(toronto_grouped.shape[0]):
    borough_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

borough_venues_sorted.head(10)

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Central Toronto,Coffee Shop,Sandwich Place,Café,Park,Pizza Place
1,Downtown Toronto,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant
2,East Toronto,Coffee Shop,Greek Restaurant,Brewery,Italian Restaurant,Restaurant
3,East York,Coffee Shop,Bank,Intersection,Burger Joint,Sandwich Place
4,Etobicoke,Pizza Place,Coffee Shop,Sandwich Place,Pharmacy,Grocery Store
5,Mississauga,Coffee Shop,Hotel,Middle Eastern Restaurant,Intersection,Gym
6,North York,Coffee Shop,Clothing Store,Japanese Restaurant,Pizza Place,Sandwich Place
7,Scarborough,Bakery,Coffee Shop,Intersection,Bank,Fast Food Restaurant
8,West Toronto,Café,Bar,Coffee Shop,Italian Restaurant,Restaurant
9,York,Park,Brewery,Trail,Tennis Court,Bus Line


In [90]:
borough_venues_sorted.shape

(10, 7)

### Clustering

In [87]:
# set number of clusters
kclusters = 5

# let´s get started by dropping the 'Neighborhood' column as it displays categorical data not needed for this step.
# reminder => manhattan_grouped id the DF built by calculating mean and frequency of venur occurence per borough
toronto_grouped_clustering = toronto_grouped.drop('Borough', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
# kmeans.labels_[0:10] 
display(kmeans.labels_)
display(len(kmeans.labels_))

array([1, 1, 1, 2, 4, 3, 1, 2, 1, 0], dtype=int32)

10

In [88]:
# add clustering labels to the previous primitive clustering DF
borough_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = fsq_df # because we need geographic coordinates to draw a map

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(borough_venues_sorted.set_index('Borough'), on='Borough')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,1,Coffee Shop,Clothing Store,Japanese Restaurant,Pizza Place,Sandwich Place
1,M4A,North York,Victoria Village,43.725882,-79.315572,1,Coffee Shop,Clothing Store,Japanese Restaurant,Pizza Place,Sandwich Place
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1,Coffee Shop,Clothing Store,Japanese Restaurant,Pizza Place,Sandwich Place
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,1,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant


In [89]:
# create map
map_clusters_toronto = folium.Map(location=[latToronto, lngToronto], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Borough'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters_toronto)
       
map_clusters_toronto

#### CLUSTERING CONCLUSIONS-INSIGHTS:

**Cluster 0 :** \
Area with a high occurence of Parks (20%) which is indeed the highest all-venue occurence in Toronto;\
These Parks are mainly located in the borough of York.\
**Cluster 1 :**\
Area with a high occurence of Coffee Shops (40%) spreaded in 5 different buroughs;\
(North York, Central Toronto, Downtown Toronto, East Toronto and West Toronto).\
**Cluster 2 :**\
Area with some moderate occurence of Banks (5%) in East York and Bakeries (6%) in Scarborough.\
**Cluster 3 :**\
Area with a high occurence of Coffe Shops (15%) and Hotels (15%), which combined are the highest occurence in Toronto;\
One single borough, Mississauga, responds for this stat which could be explained by the proximity of Toronto Intl Airport.\
**Cluster 4 :**\
Area with a high occurence of Pizza Places (11%) which is indeed the highest venue-like occurence in Toronto;\
Etobicoke is the only borough in this cluster. 