# Week #3 Assignment
## Segmenting and Clustering Neighborhoods in Toronto
### Enrico Tomassoli - August 18th, 2020

### 1. Importing Data from Wikipedia
In this section the "read_html" is used to get the table from the Wikipedia page. The data will be used to segmemnt and clusted the neighborhoods in Toronto

In [316]:
import pandas as pd
import numpy as np

from pandas.io.html import read_html
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wikitable = read_html(url, attrs={'class':'wikitable'})
df = wikitable[0]
print('Table from Wikipedia is',df.shape[0],'rows x',df.shape[1],'columns\n')
df.to_csv('Wikitable_DataFrame.csv')
df.head(5)

Table from Wikipedia is 180 rows x 3 columns



Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [317]:
print('Numer of rows that "Borough" is "Not Assigned":',df['Borough'][df['Borough']=='Not assigned'].count())

Numer of rows that "Borough" is "Not Assigned": 77


### 2. Data wrangling and cleaning
In this section we will remove from the data frame the rows that have the "Borough" "Not assigned". Then, a check on both columns "Neighbourhood" and "Borough" is performed to be sure that there are no "Not assigned" records.

In [318]:
df_Tor = df[df['Borough']!='Not assigned'] # Another dataframe is created, so the original dataframe is not overwritten

# This is a check to see if there is any "Not assigned" value in the dataframe
print('"Not assigned" value in the "Borough" column =',df_Tor['Borough'].unique=='Not assigned')
print('"Not assigned" value in the "Neighbourhood" column =',df_Tor['Neighbourhood'].unique=='Not assigned')

"Not assigned" value in the "Borough" column = False
"Not assigned" value in the "Neighbourhood" column = False


### 3. Dataframe info
After the DataFrame was cleaned removing the records (rows) that contained "Not assigned" in the columns, the main info is shown below. Since some rows were removed, the indexes were reset. The Data Frame was alos sorted by "Postal Code".

In [4]:
if df_Tor['Postal Code'].shape==df_Tor['Postal Code'].unique().shape:
    print('Info: Postal code might be used as index')
else: print ('Info: Postal codes are not unique and cannot be used as index')

# The values were sorted by 'Postal Code' and the index reset for clarity.
df_Tor = df_Tor.sort_values(by=['Postal Code'])
df_Tor.reset_index(inplace=True,drop=True)

# This section just provide main info of the cleaned dataframe
print('Total Rows in the DataFrame =',df_Tor.shape[0])
print('Total Columns in the DataFrame =',df_Tor.shape[1])
print('The Headers (columns) in the DataFrame are:',df_Tor.columns.to_list())

df_Tor.head(10)

Info: Postal code might be used as index
Total Rows in the DataFrame = 103
Total Columns in the DataFrame = 3
The Headers (columns) in the DataFrame are: ['Postal Code', 'Borough', 'Neighbourhood']


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


### 4. Import coordinates and connect to Postal Code
Since the Geocoder package is taking a lon time to run and not giving any results, the geographical data is imported from the csv file. The codes used is reported below for clarity only.

In [5]:
# The geographical coordinations are imported from the csv file
geo = pd.read_csv('Geospatial_Coordinates.csv')
geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


The DataFrame for the neighbourhood and the one for the geographical locations are now available. Since both are sorted by "Postal Code", we check if the "Postal Code" values match in both DafaFrame. If so, we can just add the coordiantion columns to the neighbouhood DataFrame.

In [6]:
# This part is to check if the Postal Code in "df_Tor" and the one in "geo" match. If so, we can just add the needed columns.

if geo[geo['Postal Code'] == df_Tor['Postal Code']].shape[0] == geo.shape[0]:
    print('"Postal Code" info match in "df_Tor" and "geo". "Coordinates" columns can be simply added to the DataFrame.')
    df_Tor[['Latitude','Longitude']] = geo[['Latitude','Longitude']]
else:
    print('Columns cannot be added. Sort both DataFrames and check compatibility')
df_Tor.head(5)

"Postal Code" info match in "df_Tor" and "geo". "Coordinates" columns can be simply added to the DataFrame.


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### 5. DataFrame filtering
Once the dafaframe that includes all the non-"Not assigned" records and each "Postal code" is associated to the relattive geographical coordinates, we can filter the info considering the "Neighbourhoods" that contain the word "Toronto".

In [8]:
# A new DataFrame is considered for the recors that contain the word "Toronto"
Toronto = df_Tor[df_Tor['Borough'].str.contains('Toronto')]
print('Filter results:\nOut of',df_Tor.shape[0],'boroughs',Toronto.shape[0],'are in Toronto [',
      round(100*Toronto.shape[0]/df_Tor.shape[0]),'%]')
Toronto.reset_index(inplace=True,drop=True)
Toronto.head(10)

Filter results:
Out of 103 boroughs 39 are in Toronto [ 38 %]


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
7,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
9,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049


### 6. Map visualization
Using "geocoder" and "Folium", we show the map of Toronto with hte info from the DataFrame.

First we need to get the coordinates for the city of Toronto:

In [9]:
from geopy.geocoders import Nominatim

address = 'Toronto, ON'
geolocator = Nominatim(user_agent="Tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto (ON) are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto (ON) are 43.6534817, -79.3839347.


Then, we show the info into the map using "Folium".A map of Toronto using latitude and longitude values shall be created. The markers for the neighbourhood will be shown.
__NOTE__: The info shown in the map contains ONLY the Boroughs that has "Toronto" in their names.

In [10]:
import folium

# create the map of Toronto
map_tor = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(Toronto['Latitude'], Toronto['Longitude'],
                                           Toronto['Borough'], Toronto['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='yellow',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tor)  
    
map_tor

### 7. Foursquare exploration (one neighbourhood)

Using Foursquare we can explore the neighbourhoods in the city of Toronto. We start defining the credential and the info fo use the Foursquare API. We just explore one neighbourhood first)

In [11]:
CLIENT_ID = 'N5ZC0IQJAGXXHSEKQC3Y3NFVLLZDT3XP313D0O54XT2KIALI' # Foursquare ID
CLIENT_SECRET = 'Q52RXWLO3GPQCWBF2OTI3B3UP0RSUB0FOTKYCU51IPWPISAU' # Foursquare Secret
VERSION = '20180605' # Foursquare API version

Let's first investigate the 1st neighbourhood in the DataFrame

In [12]:
#Toronto.loc[0, 'Neighborhood']
neigh_=Toronto.loc[0,'Neighbourhood']
lat_=Toronto.loc[0,'Latitude']
lon_=Toronto.loc[0,'Longitude']
print('1st Neighbourhoos is "{}" located at LAT = {} and LON = {}'.format(neigh_,round(lat_,3),round(lon_,3)))

1st Neighbourhoos is "The Beaches" located at LAT = 43.676 and LON = -79.293


The URL is create to explore the area around the neightbourhood. We assume a limit of 20 with a max radius of 200 meters. In the file there is a warning stating that there are no so many results. Limit was set to __100__ an radius to __500__ meters. After that, we send a request to get the info and receive in a .json format.

In [13]:
limit = 100
radius = 500

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    lat_,lon_,radius,limit) 

import requests # library to handle requests
results = requests.get(url).json() # This variable is a dictionary and all the info is saved here

In the following blocks we take a look more in depth to what kind of info is saved in the "results" dictionary (it's a multiple level dictionary).

In [14]:
print('"Results"is a ',type(results))
print('The keys are',results.keys())
print('The sub-keys are',results['response'].keys())
print('The sub-keys are',results['response']['groups'][0].keys()) # Thess are the keys for the first group.
print('The sub-keys are',results['response']['groups'][0]['items'][0])

"Results"is a  <class 'dict'>
The keys are dict_keys(['meta', 'response'])
The sub-keys are dict_keys(['headerLocation', 'headerFullLocation', 'headerLocationGranularity', 'totalResults', 'suggestedBounds', 'groups'])
The sub-keys are dict_keys(['type', 'name', 'items'])
The sub-keys are {'reasons': {'count': 0, 'items': [{'summary': 'This spot is popular', 'type': 'general', 'reasonName': 'globalInteractionReason'}]}, 'venue': {'id': '4bd461bc77b29c74a07d9282', 'name': 'Glen Manor Ravine', 'location': {'address': 'Glen Manor', 'crossStreet': 'Queen St.', 'lat': 43.67682094413784, 'lng': -79.29394208780985, 'labeledLatLngs': [{'label': 'display', 'lat': 43.67682094413784, 'lng': -79.29394208780985}], 'distance': 89, 'cc': 'CA', 'city': 'Toronto', 'state': 'ON', 'country': 'Canada', 'formattedAddress': ['Glen Manor (Queen St.)', 'Toronto ON', 'Canada']}, 'categories': [{'id': '4bf58dd8d48988d159941735', 'name': 'Trail', 'pluralName': 'Trails', 'shortName': 'Trail', 'icon': {'prefix': 'h

We define the function to get the category type found in the .json file.

In [15]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In this section we get the venues for the neighbour selected.

In [16]:
import json # library to handle JSON files
from pandas.io.json import json_normalize

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

print('{} venues were returned by Foursquare (limit = {} and radius = {} meters).'.format(nearby_venues.shape[0],limit,radius))
nearby_venues.head()

4 venues were returned by Foursquare (limit = 100 and radius = 500 meters).


Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,The Big Carrot Natural Food Market,Health Food Store,43.678879,-79.297734
2,Grover Pub and Grub,Pub,43.679181,-79.297215
3,Upper Beaches,Neighborhood,43.680563,-79.292869


### 8. Foursquare exploration (Toronto Area)
The same thing done for one neighbourhood is repeated for all the neighbourhoods. We define the funcion to automatically do this.

We define the function __getNearbyVenues__.

In [31]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, venues_limit = 50): # We use 300 meters as radius and 100 for the venue limit
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            venues_limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

After the function above is defined, we can det the venues for the Toronto neighbourhoods.

In [32]:
toronto_venues = getNearbyVenues(names=Toronto['Neighbourhood'],
                                   latitudes=Toronto['Latitude'],
                                   longitudes=Toronto['Longitude'])

The Beaches
The Danforth West, Riverdale
India Bazaar, The Beaches West
Studio District
Lawrence Park
Davisville North
North Toronto West, Lawrence Park
Davisville
Moore Park, Summerhill East
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
Rosedale
St. James Town, Cabbagetown
Church and Wellesley
Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North & West, Forest Hill Road Park
The Annex, North Midtown, Yorkville
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Stn A PO Boxes
First Canadian Place, Underground city
Christie
Dufferin, Dovercourt Village
Little Portugal, Trinity
Brockton, Parkdale Village, Exhibition Place
High 

In [59]:
print('Size of Toronto venues found =',toronto_venues.shape[0],'\nBelow an example of what was found:')
toronto_venues.head(10)

Size of Toronto venues found = 1187 
Below an example of what was found:


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West, Riverdale",43.679557,-79.352188,MenEssentials,43.67782,-79.351265,Cosmetics Shop
5,"The Danforth West, Riverdale",43.679557,-79.352188,Dolce Gelato,43.677773,-79.351187,Ice Cream Shop
6,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant
7,"The Danforth West, Riverdale",43.679557,-79.352188,Cafe Fiorentina,43.677743,-79.350115,Italian Restaurant
8,"The Danforth West, Riverdale",43.679557,-79.352188,Mezes,43.677962,-79.350196,Greek Restaurant
9,"The Danforth West, Riverdale",43.679557,-79.352188,La Diperie,43.677702,-79.352265,Ice Cream Shop


In [61]:
print('The list of neighbourhoods with the related number of venues:\n----------------------------------------------------------------')
toronto_venues.groupby('Neighborhood').count()['Venue Category']

The list of neighbourhoods with the related number of venues:
----------------------------------------------------------------


Neighborhood
Berczy Park                                                                                                   50
Brockton, Parkdale Village, Exhibition Place                                                                  24
Business reply mail Processing Centre, South Central Letter Processing Plant Toronto                          17
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport    17
Central Bay Street                                                                                            50
Christie                                                                                                      16
Church and Wellesley                                                                                          50
Commerce Court, Victoria Hotel                                                                                50
Davisville                                                                         

### 9. Analyzing each neighbourhood
From the neighbourhoods that have "Toronro" in their "borough" name, venues for each neighbourhood were extracted. Now each neighbourhood is analyzed more in detail to extract info to cluster them later.

In [95]:
# one hot encoding. It generate a data frame where 1 or 0 ar assigned when the columns for "Venue Category" are created
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

print('Toronto venue one hot shape =',toronto_onehot.shape)
toronto_onehot.head()

# NOTE: For each record in "Toronro_venues" another DataFrame is created with all the venues (213 in this case) that hav
# one columns. If the venues is in the neighboorhood, the column has "1", otherwise it has "0". That's what "dummies" is for.   

Toronto venue one hot shape = (1187, 213)


Unnamed: 0,Yoga Studio,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Thai Restaurant,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


For each neighbourhood, we group the venues by their frequenncy to see what type of venue/s is/are more popular.

In [96]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Thai Restaurant,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.02,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.058824,0.058824,0.058824,0.058824,0.176471,0.117647,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.02


For each neighbourhooh, we print the top 5 venues. For each "block" we have the name of the neighbourhood, the type of venue and the related frequency.

In [98]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("---- "+hood+" ----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

---- Berczy Park ----
                venue  freq
0         Coffee Shop  0.08
1      Farmers Market  0.04
2                 Pub  0.04
3  Seafood Restaurant  0.04
4            Beer Bar  0.04


---- Brockton, Parkdale Village, Exhibition Place ----
            venue  freq
0       Nightclub  0.12
1            Café  0.12
2     Coffee Shop  0.08
3  Breakfast Spot  0.08
4             Gym  0.04


---- Business reply mail Processing Centre, South Central Letter Processing Plant Toronto ----
                venue  freq
0  Light Rail Station  0.12
1         Yoga Studio  0.06
2       Auto Workshop  0.06
3       Garden Center  0.06
4              Garden  0.06


---- CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport ----
                 venue  freq
0      Airport Service  0.18
1     Airport Terminal  0.12
2      Harbor / Marina  0.06
3              Airport  0.06
4  Rental Car Location  0.06


---- Central Bay Street ----
                venu

For each neighbourhood, we put the info in a DataFrame in order to show the top 5 venues. We can use the following function to sort them to have the top 5.

In [118]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

After createing the function above, we can have  a DataFrame where all the neighbourhoods are listed with the related top 5 venues. __Note:__ Only the top 5 venues are considered in this analysis for simplicity, but higher number can be assumed too.

In [152]:
num_top_venues = 5 # In Toronto we just consider the top 5.

indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))#

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)
    #print('Step =',  ind)
    #print(return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues))
    #print('No of venues = ',return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues).shape)
    
neighborhoods_venues_sorted.head(10)

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Berczy Park,Coffee Shop,Bakery,Pub,Restaurant,Cheese Shop
1,"Brockton, Parkdale Village, Exhibition Place",Nightclub,Café,Breakfast Spot,Coffee Shop,Climbing Gym
2,"Business reply mail Processing Centre, South C...",Light Rail Station,Pizza Place,Skate Park,Brewery,Burrito Place
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Terminal,Coffee Shop,Harbor / Marina,Boutique
4,Central Bay Street,Coffee Shop,Café,Sandwich Place,Bubble Tea Shop,Burger Joint
5,Christie,Grocery Store,Café,Park,Baby Store,Candy Store
6,Church and Wellesley,Japanese Restaurant,Gay Bar,Restaurant,Coffee Shop,Yoga Studio
7,"Commerce Court, Victoria Hotel",Coffee Shop,Café,Hotel,Restaurant,Gym
8,Davisville,Dessert Shop,Sandwich Place,Pizza Place,Gym,Café
9,Davisville North,Hotel,Park,Pizza Place,Breakfast Spot,Gym / Fitness Center


### 10. Clustering neighbourhoods using K-means

we can now clusted the neighbourhoods since the DataFrame containing the top 5 venues for each neighbourhood is available. The clustering is made using the venue type. Since there are "only" 39 records in our dataframe, we assume 4 clusters for simplicity.

In [191]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 4
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1) # The first column, which does not contain venue type, is removed
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

After the clustering phase, all the info can be saved into the dataframe to consolidate all the results in one place.

In [192]:
# add clustering labels
# neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = Toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')
toronto_merged.head(10) # This DataFrame shows the recao of the results we found

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Trail,Pub,Health Food Store,Wine Bar,Dance Studio
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0,Greek Restaurant,Coffee Shop,Italian Restaurant,Furniture / Home Store,Restaurant
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,0,Park,Sushi Restaurant,Sandwich Place,Brewery,Liquor Store
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Café,Coffee Shop,American Restaurant,Bakery,Brewery
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,0,Park,Bus Line,Swim School,Deli / Bodega,Dumpling Restaurant
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197,0,Hotel,Park,Pizza Place,Breakfast Spot,Gym / Fitness Center
6,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678,0,Clothing Store,Coffee Shop,Furniture / Home Store,Gym / Fitness Center,Health & Beauty Service
7,M4S,Central Toronto,Davisville,43.704324,-79.38879,0,Dessert Shop,Sandwich Place,Pizza Place,Gym,Café
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316,1,Park,Playground,Summer Camp,Cupcake Shop,Donut Shop
9,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049,0,Pub,Coffee Shop,Bagel Shop,Sports Bar,Bank


In [193]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map with the results
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'],
                                  toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### 11. Result analysis
When the clustering is performed, we consider each single cluster to understand what each cluster contains. In order to have a preliminary idea, the 1st venues for each cluster is considered, for each cluster. __Note:__ some clusted may have only one "1st venue", and the results will show only that and NOT the all three.

As per results below, considering the 3 most popular venues, the first cluster is mainly identified by Coffee Shops and Cafes. The second one is characterized by parks. The third and fourth ones contains only one venue. Also looking at the map, most of the neighbourhood have similar characteristics and most of them belong to the "1st Cluster". As per the hypoteses of this analysis, the distribution of the venues in the main area of Toronto (i.e. for the areas that contain "Toronto" in their names) is pretty uniform, and only few spots cannot be classified in the cluster no 1.  

According to this analysis, new activities like coffe shops or cafes can be open in the areas that do not belong to the 1st cluster. (non-1st clusters).

In [315]:
for i in range(0,4,1):
    cl = toronto_merged.loc[toronto_merged['Cluster Labels'] == i, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
    print('\nSummary of the most 3 popular venues for the cluster no.',i,
          ':\n-------------------------------------------------------------')
    print(cl.groupby('1st Most Common Venue').count().sort_values('Borough',ascending=False,axis=0)['Borough'][0:3],'\n')


Summary of the most 3 popular venues for the cluster no. 0 :
-------------------------------------------------------------
1st Most Common Venue
Coffee Shop    8
Café           7
Bakery         2
Name: Borough, dtype: int64 


Summary of the most 3 popular venues for the cluster no. 1 :
-------------------------------------------------------------
1st Most Common Venue
Park    2
Name: Borough, dtype: int64 


Summary of the most 3 popular venues for the cluster no. 2 :
-------------------------------------------------------------
1st Most Common Venue
Home Service    1
Name: Borough, dtype: int64 


Summary of the most 3 popular venues for the cluster no. 3 :
-------------------------------------------------------------
1st Most Common Venue
Jewelry Store    1
Name: Borough, dtype: int64 

