## INDEX

Please select the section of this workbook that you are interested on looking into:<br>

*[__Section 1:__ Postal codes and neighbourhoods retrieval](#cell1)<br>
*[__Section 2:__ Toronto Neighbourhood Coordinates Extraction](#cell2)<br>
*[__Section 3:__ Toronto Neighbourhoods Clustering](#cell3)<br>

<a id="cell1"></a>

# 1. Toronto postal codes and neighbourhoods retrieval


The present section aims to scrape the postal codes and neighbourhoods of Toronto from the following link from <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">Wikipedia</a>.


## 1.1. Libraries preparation

In [2]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
!conda install -c conda-forge bs4 --yes 
from bs4 import BeautifulSoup
!conda install -c conda-forge lxml --yes 
import lxml
!conda install -c conda-forge requests --yes 
import requests
import csv # library to handle csv files
print("Done!")

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Done!


## 1.2 Web-data scrapping

In [3]:
#----------------
# Web data extraction
#----------------
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
#soup = BeautifulSoup(requests.get(url).content, "lxml")
#print(soup.prettify())


In [4]:
#----------------
# Reading table
#----------------
table=soup.find('table', class_="wikitable sortable")
#print(table.prettify())

#------------------------------------------------------------------------------
# Scanning through the table and extracting row info into a list (table_data)
#------------------------------------------------------------------------------
table_rows=table.find_all('tr')
table_data=[]
for tr in table_rows:
    td = tr.find_all('td')
    entry = [i.text for i in td]
    table_data.append(entry)
    #print(entry)

#------------------------------------------------
# Creating a df based on row info in table_tada
#------------------------------------------------
table_df = pd.DataFrame(table_data, columns=["PostalCode", "Borough", "Neighbourhood"])
print(table_df.shape)
table_df.head()

(181, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,,,
1,M1A\n,Not assigned\n,\n
2,M2A\n,Not assigned\n,\n
3,M3A\n,North York\n,Parkwoods\n
4,M4A\n,North York\n,Victoria Village\n


## 1.3 Data manipulation and preparation

in this chapter, the data collected and stored in the dataframe will be manipulated to meet the criteria from the assignment being it:
* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
* Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [5]:
#----------------------------------
# DF cleaning and re-organisation
#----------------------------------
#Removing the \n 
table_df.replace(to_replace ='\n', value = '', regex = True, inplace=True)
table_df.head()

#Discarding not assigned boroughs (Not assigned / None)
table_df.drop(table_df.loc[ (table_df['Borough'] == 'Not assigned')].index, inplace=True)
table_df.dropna(inplace=True)

#Changing separators from "/" to ","
table_df['Neighbourhood'].replace(to_replace =' /', value = ',', regex = True, inplace=True)

#Updating Bourhoods with missing Neighbourhoods
table_df['Neighbourhood']= np.where(table_df['Neighbourhood']== 'Not assigned', table_df['Borough'], table_df['Neighbourhood'])

#Resetting the index
table_df.reset_index(inplace=True, drop=True)
table_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


## 1.4 Results

This setion will show the result of the previous analyses.

In [6]:
table_df.shape

(103, 3)

<a id="cell2"></a>

# 2. Toronto Neighbourhood Coordinates Extraction

In this section, we will be finding the coordinates for each postal code from the previous dataframe. We will use Geocoder for this purpose. Nevertheless, due to the unreliability of this package, there is available a .csv file to be used if necessary.

The methodology I was initially planning was via the geocoder. Due to the inestability experienced during the retrieval of the coordinates, this method was finally discarded.

```python
!conda install -c conda-forge geocoder --yes 
import geocoder # import geocoder

LatLon=[]
print("Test")
#for row, index in table_df.iterrows():
for row in table_df.itertuples(index=True, name='Pandas'):
#for value in table_df['PostalCode']:
    value=getattr(row, "PostalCode")
    print(value)
    # Initialize the variable to None
    lat_lng_coords = None
    # Loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario, Canada'.format(value))
        lat_lng_coords = g.latlng

    print(lat_lng_coords[0],lat_lng_coords[1])
    LatLon.append([(lat_lng_coords[0],lat_lng_coords[1])])
    
table_df['Latitude', 'Longitude']=LatLon
      
table_df.head(20)
```

The next approach was to directly use the csv file and retrieve the coordinates from there. This was the result:

In [7]:
# Methodology via csv file
!wget -q -O 'Geospatial_Coordinates.csv' https://cocl.us/Geospatial_data
    
with open('Geospatial_Coordinates.csv') as csvfile:
    df_Geocoords = pd.read_csv(csvfile)

df_Geocoords.reset_index(inplace=True) 
df_Geocoords.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
df_Geocoords.set_index('PostalCode', inplace=True)

table_df.set_index('PostalCode',inplace=True)
joined_df=pd.merge(table_df, df_Geocoords, on='PostalCode')
table_df.reset_index(inplace=True)
joined_df.reset_index(inplace=True)

joined_df.head(11)

Unnamed: 0,PostalCode,Borough,Neighbourhood,index,Latitude,Longitude
0,M3A,North York,Parkwoods,25,43.753259,-79.329656
1,M4A,North York,Victoria Village,34,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",53,43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",71,43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",85,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,93,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",0,43.806686,-79.194353
7,M3B,North York,Don Mills,26,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",35,43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",54,43.657162,-79.378937


<a id="cell3"></a>

# 3. Toronto Neighbourhoods Clustering

In this last section, I will be analyzing the different neighbourhoods in Toronto. With the use of Foursquare it will be possible to extract and explore the different businesses that each neighbourhood has to classify them.<br>
*__Note:__ In sake of simplicity, we will be looking at neighbourhoods grouped by postal code, as in the previous section and not expanding the list.*

## 3.1 Data selection

As suggested in the assignment, the analysis will be limited to those boroughs containing the word "Toronto" in it.

In [8]:
toronto_df=joined_df[joined_df['Borough'].str.contains('Toronto')]
print(toronto_df.shape )
toronto_df.reset_index(drop=True, inplace=True)
toronto_df.head(8)

(39, 6)


Unnamed: 0,PostalCode,Borough,Neighbourhood,index,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",53,43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",85,43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",54,43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,55,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,37,43.676357,-79.293031
5,M5E,Downtown Toronto,Berczy Park,56,43.644771,-79.373306
6,M5G,Downtown Toronto,Central Bay Street,57,43.657952,-79.387383
7,M6G,Downtown Toronto,Christie,75,43.669542,-79.422564


In [170]:
print('The total number of neighbourhoods that we will analyse is: {}'.format(len(toronto_df['Neighbourhood'].index)))

The total number of neighbourhoods that we will analyse is: 39


## 3.2 Libraries preparation

First step is to prepare the necessary libraries that we will need to perform the requested tasks:

In [187]:
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim


Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



## 3.3 Preliminary analysis and data visualization

Before starting with the core analysis, it is very important to analyse where we stand, the kind of data that we have and how does it look like. THe following steps help us to do so:

In [10]:
address = 'Toronto, Ontario, Canada'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [11]:
# Create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude+.016, longitude], zoom_start=12)#0.16 correction added to make the city fully centered

# Add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

As ilustrated in the map above, we will be focusing our study in Toronto city. This was also the idea when the data selected for this exercise were those boroughs containing the word "Toronto" in it.

## 3.4 Neighbourhoods POI's extraction from Foursquare

As next step, we will be fetching from <a href="https://foursquare.com">Foursquare.com</a> the different venues of each neighbourhood. <br>
The maximum number of venues to be retrieved, radius of search and other parameters are specified hereunder:

In [128]:
CLIENT_ID = 'LMLVNRBWW3CIYREB1ZHC01KN5TEEPSFCDJ0IKI1DWWC5DV4I' # your Foursquare ID
CLIENT_SECRET = 'JDMCB5IUYR2H4GTGOJ1FFUCATLSJVO0IXWGDBPGID4GSA1Q1' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
radius=500
LIMIT=30

The following function will be used to retrieve the locations:

In [40]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now we are ready to proceed with the extraction and visualize the result!

In [160]:
toronto_venues_df=getNearbyVenues(names=toronto_df['Neighbourhood'], latitudes=toronto_df['Latitude'], longitudes=toronto_df['Longitude'], radius=radius)
toronto_venues_df.head(8)

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West
High Park, The Junction South
North Toronto West
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Rosedale
Stn A PO Boxes
St. James Town,

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
5,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant
6,"Regent Park, Harbourfront",43.65426,-79.360636,Corktown Common,43.655618,-79.356211,Park
7,"Regent Park, Harbourfront",43.65426,-79.360636,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot


The total venues extracted per neighbourhood is as follows:

In [159]:
toronto_venues_df[['Neighbourhood', 'Venue']].groupby('Neighbourhood').count().head(15)

Unnamed: 0_level_0,Venue
Neighbourhood,Unnamed: 1_level_1
Berczy Park,30
"Brockton, Parkdale Village, Exhibition Place",23
Business reply mail Processing CentrE,17
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",17
Central Bay Street,30
Christie,18
Church and Wellesley,30
"Commerce Court, Victoria Hotel",30
Davisville,30
Davisville North,7


As it can be observed, some areas are more densely occupied by businesses, while others are emptier. After a quick look, we can assume that this could be explained by the number of green areas contained in each, like in the case of "Forest Hill North & West".

## 3.4 One-hot encoding

One-hot encoding, according to <a href="https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f">Hackernoon</a>, is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction. In other words, we will be converting the different venues in each neighbourhood into binary variables, from the original categorical value.

In [163]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues_df[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues_df['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head(8)

Unnamed: 0,Neighbourhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Art Gallery,Art Museum,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now that we have the categorical values converted into binary ones, a better way to visualize the previous table is by grouping the venues.

In [162]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped.head(8)

Unnamed: 0,Neighbourhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Art Gallery,Art Museum,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,...,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478
2,Business reply mail Processing CentrE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.058824,0.058824,0.058824,0.117647,0.176471,0.117647,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,...,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.033333
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.033333,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Once the one-hot encoding is finalised, let's have a look at the top 5 number of venues in each neighbourhood

In [171]:
num_top_venues = 5

for hood in toronto_grouped['Neighbourhood']:
    print("------- "+hood+" -------")
    temp = toronto_grouped[toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

------- Berczy Park -------
                venue  freq
0         Coffee Shop  0.10
1  Seafood Restaurant  0.07
2      Farmers Market  0.07
3              Bakery  0.07
4                Café  0.07


------- Brockton, Parkdale Village, Exhibition Place -------
               venue  freq
0               Café  0.13
1     Breakfast Spot  0.09
2        Coffee Shop  0.09
3        Yoga Studio  0.04
4  Convenience Store  0.04


------- Business reply mail Processing CentrE -------
                  venue  freq
0           Yoga Studio  0.06
1                   Spa  0.06
2         Garden Center  0.06
3    Light Rail Station  0.06
4  Fast Food Restaurant  0.06


------- CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport -------
              venue  freq
0   Airport Service  0.18
1    Airport Lounge  0.12
2  Airport Terminal  0.12
3           Airport  0.06
4      Airport Gate  0.06


------- Central Bay Street -------
                venue  fr

## 3.5 Data preparation for clustering

Before preparing the clustering, the previous information needs to be put into a dataframe. We will include the top 10 venues in each neighbourhood only. The following function will help us to do so:

In [134]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [135]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head(8)

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Seafood Restaurant,Café,Beer Bar,Cocktail Bar,Farmers Market,Bakery,Park,Breakfast Spot,Bistro
1,"Brockton, Parkdale Village, Exhibition Place",Café,Coffee Shop,Breakfast Spot,Yoga Studio,Bakery,Nightclub,Convenience Store,Climbing Gym,Performing Arts Venue,Pet Store
2,Business reply mail Processing CentrE,Yoga Studio,Spa,Gym / Fitness Center,Fast Food Restaurant,Farmers Market,Light Rail Station,Comic Shop,Park,Pizza Place,Butcher
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Airport Terminal,Airport,Harbor / Marina,Coffee Shop,Plane,Sculpture Garden,Boutique,Bar
4,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Yoga Studio,Sushi Restaurant,Sandwich Place,Ice Cream Shop,Modern European Restaurant,Bubble Tea Shop,Ramen Restaurant
5,Christie,Grocery Store,Café,Park,Athletics & Sports,Coffee Shop,Candy Store,Restaurant,Diner,Italian Restaurant,Baby Store
6,Church and Wellesley,Burger Joint,Mexican Restaurant,Japanese Restaurant,Pizza Place,Smoke Shop,Creperie,Beer Bar,Diner,Bubble Tea Shop,Men's Store
7,"Commerce Court, Victoria Hotel",Café,Coffee Shop,Restaurant,Japanese Restaurant,Gastropub,Deli / Bodega,Gym,Pub,Museum,Hotel


## 3.6 Clustering neighbourhoods

Now everything is ready and in place to proceed with the clustering!<br> The number of clusters was chosen after analysing different values. This one provides a higher distribution of neighbourhoods among the clusters, which will be helpful later on for the cluster analysis.

In [149]:
# set number of clusters
kclusters = 8

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([6, 6, 2, 2, 1, 6, 2, 1, 2, 2], dtype=int32)

Once the clusters have been calculated, they have to be assigned back to their neighbourhoods and included in the dataframe.

In [156]:
# add clustering labels
if 'Cluster Labels' in neighborhoods_venues_sorted.columns:
    neighborhoods_venues_sorted.drop(['Cluster Labels'], axis=1, inplace=True)
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_merged.head() # check the last columns!


Unnamed: 0,PostalCode,Borough,Neighbourhood,index,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",53,43.65426,-79.360636,6,Coffee Shop,Park,Bakery,Café,Breakfast Spot,Historic Site,Performing Arts Venue,Mexican Restaurant,Pub,Restaurant
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",85,43.662301,-79.389494,6,Coffee Shop,Sushi Restaurant,Diner,Yoga Studio,Italian Restaurant,Sandwich Place,Distribution Center,Juice Bar,Burger Joint,Burrito Place
2,M5B,Downtown Toronto,"Garden District, Ryerson",54,43.657162,-79.378937,1,Coffee Shop,Clothing Store,Café,Sandwich Place,Plaza,Electronics Store,Shopping Mall,Hotel,Restaurant,Diner
3,M5C,Downtown Toronto,St. James Town,55,43.651494,-79.375418,1,Gastropub,Café,Coffee Shop,Farmers Market,BBQ Joint,Park,Japanese Restaurant,Italian Restaurant,Diner,Ice Cream Shop
4,M4E,East Toronto,The Beaches,37,43.676357,-79.293031,0,Health Food Store,Neighborhood,Pub,Trail,Coffee Shop,Colombian Restaurant,College Rec Center,Distribution Center,Discount Store,Diner


For a better understanding of the clusters, let's visualize them over the city of Toronto.

In [190]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
    
       
map_clusters

From a visual analysis, 3 neighbourhoods stand out: #1: Purple, #2: Blue & #6: Yellow

In [139]:
#Uncomment to visualize specific clusters
#cluster_num=0 '<-- Indicate teh cluster you want to visualize'
#toronto_merged.loc[toronto_merged['Cluster Labels'] == cluster_num, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

## 3.7 Analyzing clusters

Now that we obtained the different clusters, it is very important to understand the characteristics of each of them. To do so, we will be repeating the one-hot encoding process, but this time, focusing on the clusters instead of the neighbourhoods.

As a first step, let's visualize the data that we have and the contents of each cluster.

In [151]:
print('Number of neighbourhoods analysed is ', len(toronto_grouped.index), '\n')
#Displaying the number of neighbourhoods contained in each cluster
print('The number of neighbourhoods per cluster is:\n')
toronto_merged[['Cluster Labels', 'Neighbourhood']].groupby('Cluster Labels').count().reset_index()

Number of neighbourhoods analysed is  39 

The number of neighbourhoods per cluster is:



Unnamed: 0,Cluster Labels,Neighbourhood
0,0,1
1,1,11
2,2,11
3,3,1
4,4,2
5,5,1
6,6,11
7,7,1


Now let's have a look into the contents of each cluster and obtain the most frequent venues. To do so, we first need to combine the initial venue dataframe with the result of the clusters.

In [152]:
#toronto_venues_df #has neighbourhood data and venues
#toronto_merged # has cluster data
toronto_clusters_merged = toronto_merged[['Neighbourhood','Cluster Labels']].join(toronto_venues_df.set_index('Neighbourhood'), on='Neighbourhood')
toronto_clusters_merged.head(12)

Unnamed: 0,Neighbourhood,Cluster Labels,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",6,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
0,"Regent Park, Harbourfront",6,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
0,"Regent Park, Harbourfront",6,43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
0,"Regent Park, Harbourfront",6,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
0,"Regent Park, Harbourfront",6,43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
0,"Regent Park, Harbourfront",6,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant
0,"Regent Park, Harbourfront",6,43.65426,-79.360636,Corktown Common,43.655618,-79.356211,Park
0,"Regent Park, Harbourfront",6,43.65426,-79.360636,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot
0,"Regent Park, Harbourfront",6,43.65426,-79.360636,The Distillery Historic District,43.650244,-79.359323,Historic Site
0,"Regent Park, Harbourfront",6,43.65426,-79.360636,Dominion Pub and Kitchen,43.656919,-79.358967,Pub


Now we are all set to start with the one-hot encoding.

In [153]:
# one hot encoding
toronto_clusters_onehot = pd.get_dummies(toronto_clusters_merged[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_clusters_onehot['Cluster Labels'] = toronto_clusters_merged['Cluster Labels'] 

# move neighborhood column to the first column
fixed_columns = [toronto_clusters_onehot.columns[-1]] + list(toronto_clusters_onehot.columns[:-1])
toronto_clusters_onehot = toronto_clusters_onehot[fixed_columns]

toronto_clusters_onehot.head(12)
toronto_clusters_grouped = toronto_clusters_onehot.groupby('Cluster Labels').mean().reset_index()
toronto_clusters_grouped.head(12)

Unnamed: 0,Cluster Labels,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Art Gallery,Art Museum,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.025974,0.019481,0.003247,...,0.006494,0.0,0.0,0.0,0.003247,0.019481,0.0,0.003247,0.0,0.003247
2,2,0.004,0.004,0.004,0.008,0.012,0.008,0.0,0.008,0.0,...,0.004,0.004,0.004,0.004,0.0,0.004,0.0,0.008,0.004,0.016
3,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.285714,0.0,0.0,0.0,0.0,0.0,0.0
5,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,6,0.0,0.0,0.0,0.0,0.0,0.0,0.003497,0.003497,0.0,...,0.01049,0.0,0.0,0.0,0.0,0.01049,0.003497,0.006993,0.006993,0.013986
7,7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0


In [154]:
toronto_clusters_grouped = toronto_clusters_onehot.groupby('Cluster Labels').mean().reset_index()
toronto_clusters_grouped.head(12)

Unnamed: 0,Cluster Labels,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Art Gallery,Art Museum,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.025974,0.019481,0.003247,...,0.006494,0.0,0.0,0.0,0.003247,0.019481,0.0,0.003247,0.0,0.003247
2,2,0.004,0.004,0.004,0.008,0.012,0.008,0.0,0.008,0.0,...,0.004,0.004,0.004,0.004,0.0,0.004,0.0,0.008,0.004,0.016
3,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.285714,0.0,0.0,0.0,0.0,0.0,0.0
5,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,6,0.0,0.0,0.0,0.0,0.0,0.0,0.003497,0.003497,0.0,...,0.01049,0.0,0.0,0.0,0.0,0.01049,0.003497,0.006993,0.006993,0.013986
7,7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0


Once the one-hot encoding is performed, we can calculate the top 5 venues per cluster, similar to the way we calculated the top 5 venues per neighbourhood at the end of the the 3.4 section.

In [155]:
num_top_venues = 5

for cluster in toronto_clusters_grouped['Cluster Labels']:
    print("----- Cluster ",cluster,"-----")
    temp = toronto_clusters_grouped[toronto_clusters_grouped['Cluster Labels'] == cluster].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----- Cluster  0 -----
               venue  freq
0        Coffee Shop   0.2
1              Trail   0.2
2       Neighborhood   0.2
3  Health Food Store   0.2
4                Pub   0.2


----- Cluster  1 -----
                 venue  freq
0          Coffee Shop  0.11
1                 Café  0.10
2           Restaurant  0.05
3            Gastropub  0.04
4  Japanese Restaurant  0.04


----- Cluster  2 -----
                venue  freq
0    Greek Restaurant  0.04
1                Park  0.04
2  Italian Restaurant  0.03
3         Pizza Place  0.03
4         Coffee Shop  0.03


----- Cluster  3 -----
                       venue  freq
0                       Park  0.33
1                Swim School  0.33
2                   Bus Line  0.33
3          Korean Restaurant  0.00
4  Latin American Restaurant  0.00


----- Cluster  4 -----
                       venue  freq
0                       Park  0.43
1                 Playground  0.29
2                      Trail  0.29
3         Miscellaneous

Now we are in a better position to understand the characteristics of each cluster!

## 3.8 Results and findings

Let's summarise our current findings for the 3 mayor clusters, #1, #2 and #6:
- __Cluster #1:__ In this cluster we can see there's a very high presence of cafe's and coffee places, as well as gastronomy-related businesses.
- __Cluster #2:__ This cluster is represented by green areas as well as restaurants.
- __Cluster #3:__ Similar to #1 in the presence of cafe's and coffee shops, but differentiates from it in the higher number of bakeries and bars.