# Segmenting and Clustering Neighborhoods in Toronto

The goal of this work is, as its title suggests, segmenting the neighbourhoods in Toronto.
Before actually getting into segmentation, we need to scrape our target data regarding neighbourhoods in Toronto from a website and then execute some data cleaning.
Then, we employ Foursquare to get information about popular venues in each neighbourhood and finally cluster the neighbourhoods by their most popular venues.

Content:
    1. Getting the data
    2. Data cleaning and obtaining the target data
    3. Exploring a neighbourhood
    4. Venues in all neighbourhoods
    5. Segmentation

# 1. Getting the data

In [39]:
#Importing the libraries needed to scrap information from the website
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup as bf
import requests

# Tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

Now that we have our libraries imported, let's get our data from the website. Even the instructions of the assignment suggest to use a website scraping library such as BrautifulSoup to do it, it is also possible to achieve the same result by using **Pandas**.
By reading the content from the website using **_pandas.read_html('source')_**, we can scrap all the tables from the source website, just as shown blow.

In [126]:
dfs_pd = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header=0)
print("There are {} tables in the source website.".format(len(dfs_pd)))

There are 3 tables in the source website.


Now as we can see, this wikipedia website contains several tables, so what we're going to do is to retrieve the one of our interest and name it as df_pd.

In [99]:
df_pd = dfs_pd[0]
df_pd.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


### Employing _BeautifulSoup_

Getting all the content from the website.

In [8]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = bf(source, 'lxml')

print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":890001695,"wgRevisionId":890001695,"wgArticleId":539066,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communications in Ontario","Postal codes in Canada","Toronto","Ontario-related lists"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wg

Finding the table in interest and print its content.

In [9]:
table = soup.find('table', class_ = 'wikitable sortable')

#table_body = table.tbody.text

print(table)

<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td></tr>
<tr>
<td>M6A</td>

By exmaniming the results from the previous cell, it is clear that every row, including the column names, are encolosed by <_tr_>.

Use the following line of code to get all the rows in the table.

In [10]:
table_rows = table.find_all('tr')
print(table_rows)


[<tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>, <tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>, <tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>, <tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>, <tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>, <tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td></tr>, <tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td></tr>, <tr>
<td>M6A</td>
<td><a href="/wiki/North_York" ti

The titles are enclosed by <_th_> and the rows by <_td_>.

In [11]:
#Getting the titles

titles = []
for th in table.find_all('th'):
    title = th.text
    titles.append(title)

titles[2] = titles[2].split('\n')[0]
titles

['Postcode', 'Borough', 'Neighbourhood']

In [12]:
#Getting the rows

rows = []
for tr in table.find_all('tr'):
    td = tr.find_all('td')
    row = [i.text for i in td]
    rows.append(row)

rows = rows[1:len(rows)]
for row in rows:
    row[2]=row[2].split('\n')[0]

rows

[['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront'],
 ['M5A', 'Downtown Toronto', 'Regent Park'],
 ['M6A', 'North York', 'Lawrence Heights'],
 ['M6A', 'North York', 'Lawrence Manor'],
 ['M7A', "Queen's Park", 'Not assigned'],
 ['M8A', 'Not assigned', 'Not assigned'],
 ['M9A', 'Etobicoke', 'Islington Avenue'],
 ['M1B', 'Scarborough', 'Rouge'],
 ['M1B', 'Scarborough', 'Malvern'],
 ['M2B', 'Not assigned', 'Not assigned'],
 ['M3B', 'North York', 'Don Mills North'],
 ['M4B', 'East York', 'Woodbine Gardens'],
 ['M4B', 'East York', 'Parkview Hill'],
 ['M5B', 'Downtown Toronto', 'Ryerson'],
 ['M5B', 'Downtown Toronto', 'Garden District'],
 ['M6B', 'North York', 'Glencairn'],
 ['M7B', 'Not assigned', 'Not assigned'],
 ['M8B', 'Not assigned', 'Not assigned'],
 ['M9B', 'Etobicoke', 'Cloverdale'],
 ['M9B', 'Etobicoke', 'Islington'],
 ['M9B', 

Creating the DataFrame with the obtained lists: titles & rows.

In [13]:
df_soup = pd.DataFrame(data = rows, columns = titles)
df_soup.head()


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


# 2. Cleaning the Data from the Website

As shown in the previous table, the some values in both 'Borough' and/or 'Neighbourhood' columns are 'Not assigned'.
In order to solve this, we drop the rows with value 'Not assigned' in the 'Borough' column and replace the rows with 'Not assigned' in 'Neighbourhood' with their corresponding value in 'Borough'.

In [18]:
#Droping the rows with value 'Not assigned' in the 'Borough' column 

df_soup.drop(df_soup[df_soup['Borough'] == 'Not assigned'].index, inplace = True, axis = 0)

Replace 'Not assigned' in 'Neighbourhood' with its corresponding borough.

In [19]:
#Define a function that checks wheter 'Neighbourhood' == 'Not assigned' and then act acoordingly

def replace_not_assigned(row):
    
    if row['Neighbourhood'] == 'Not assigned':
        return row['Borough']
    else:
        return row['Neighbourhood']

In [20]:
#Use df.apply(function) to apply the function to the dataframe

df_soup['Neighbourhood'] = df_soup.apply(replace_not_assigned, axis=1)
df_soup.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [21]:
#We can check whether the elimination of 'Not assigned' was executed successfully.

'Not assigned' in df_soup['Neighbourhood']

False

--------------/
The same result can be achieved in this way as well:
 1. We loop through the column of the dataframe by using df.iterrows() and check the value of the target column,
 2. If the value is 'Not assigned' we take the value of 'Borough' column of the same row, otherwise we take the value from 'Neighbourhood' column
 3. We append the values to a list which is going to be the new value for the 'Neighbourhood' column.

In [177]:
#new_neigh = []
#for neigh in df_soup[['Borough','Neighbourhood']].iterrows():
#    if neigh[1][1] == 'Not assigned':
#        new_neigh.append(neigh[1][0])
#    else:
#        new_neigh.append(neigh[1][1])

Check if the previous loop worked well:

In [178]:
#'Not assigned' in new_neigh

Redefine the 'Neighbourhood' column.

In [179]:
#df_soup['Neighbourhood']=new_neigh
#df_soup

--------------\

Putting the neighbourhoods that belongs to a same borough into a same row.

In [22]:
df = df_soup.groupby(['Postcode','Borough'], as_index = False, sort = False).agg(', '.join)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [23]:
print("The shape of the final dataframe is:", df.shape)

The shape of the final dataframe is: (103, 3)


The target dataframe contains, aside from what shows in the dataframe df, it also has geographical coordinates of every each neighbourhood.
These values can be obtained by employing _geocoder_, but since it's a much slower process, we take the easy way and get the values from the prepared csv file.

Getting geographical coordinates of the neighbourhoods.

In [24]:
path = 'http://cocl.us/Geospatial_data'
df_coords = pd.read_csv(path)

In [25]:
df_coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We join the rows of df_coords with the ones of df by 'Postcode', to achieve this, we first rename 'Postal Code' to 'Postcode' and then set it to index. Then we join df_coords to df by using the df.join() function. 

In [26]:
df_coords.rename(columns={'Postal Code':'Postcode'}, inplace=True)

df = df.join(df_coords.set_index('Postcode'), on = 'Postcode')
df.sort_values(by='Postcode', inplace=True)

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
12,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
18,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
22,M1G,Scarborough,Woburn,43.770992,-79.216917
26,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


# 3. Exploring a Neighbourhood

#### In this section we are going to explore a neighbourhood of Toronto selected randomly. 
    The tasks to be completed are:

    1. Create a folium map of Toronto with popups of its all neighbourhoods.
    2. Randomly select a neighbourhood and use Foursquare to obtaine informations about it venues,  i. e. popular venues in the neighbourhood within a certain radius.

Getting the geographical coordinates of Toronto.

In [30]:
from geopy.geocoders import Nominatim

address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="tr_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


#### Creating a map of Toronto with its neighbourhoods superimposed on top.

In [28]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium

print('Folium installed and imported!')

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00  51.24 MB/s
branca-0.3.1-p 100% |################################| Time: 0:00:00  37.37 MB/s
vincent-0.4.4- 100% |################################| Time: 0:00:00  41.38 MB/s
folium-0.5.0-p 100% |################################| Time: 0:00:00  49.44 MB/s
Folium installed and imported!


In [33]:
map_tr = folium.Map(location = [latitude, longitude], zoom_start = 11)

for lat, lng, borough, neighbourhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tr)  
    
map_tr

#### Define Foursquare Credentials and Version.

In [81]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 
CLIENT_SECRET:


#### Getting the top 100 venues that are in Parkwoods within a radius of 500 meters.

In [35]:
#The neighbourhood of Woburn is randomly chosen to be explored
df.loc[0,'Neighbourhood']

'Parkwoods'

In [36]:
neigh_lat = df.loc[0, 'Latitude']
neigh_lng = df.loc[0, 'Longitude']
neigh_name = df.loc[0, 'Neighbourhood']

print('The latitude and longitude of {} is {}, {}'.format(neigh_name, neigh_lat, neigh_lng))

The latitude and longitude of Parkwoods is 43.7532586, -79.3296565


In [37]:
# Creating the URL
limit = 100

radius=500

url='https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    neigh_lat,
    neigh_lng,
    VERSION,radius,limit)

We use _GET_ to obtain the information contained in the url and turn it into json format, which is a combination of dictionary and list.

In [38]:
results = requests.get(url).json()

Take a look at the obtained results.

In [42]:
venues = results['response']['groups'][0]['items']

#Flatten the results
nearby_venues = json_normalize(venues)

nearby_venues.head()

Unnamed: 0,reasons.count,reasons.items,referralId,venue.categories,venue.id,venue.location.address,venue.location.cc,venue.location.city,venue.location.country,venue.location.distance,venue.location.formattedAddress,venue.location.labeledLatLngs,venue.location.lat,venue.location.lng,venue.location.state,venue.name,venue.photos.count,venue.photos.groups
0,0,"[{'summary': 'This spot is popular', 'reasonNa...",e-0-4e8d9dcdd5fbbbb6b3003c7b-0,"[{'name': 'Park', 'shortName': 'Park', 'plural...",4e8d9dcdd5fbbbb6b3003c7b,Toronto,CA,Toronto,Canada,245,"[Toronto, Toronto ON, Canada]","[{'label': 'display', 'lat': 43.75197604605557...",43.751976,-79.33214,ON,Brookbanks Park,0,[]
1,0,"[{'summary': 'This spot is popular', 'reasonNa...",e-0-4e6696b6d16433b9ffff47c3-1,"[{'name': 'Fast Food Restaurant', 'shortName':...",4e6696b6d16433b9ffff47c3,,CA,,Canada,298,[Canada],"[{'label': 'display', 'lat': 43.75438666345904...",43.754387,-79.333021,,KFC,0,[]
2,0,"[{'summary': 'This spot is popular', 'reasonNa...",e-0-4cb11e2075ebb60cd1c4caad-2,"[{'name': 'Food & Drink Shop', 'shortName': 'F...",4cb11e2075ebb60cd1c4caad,29 Valley Woods Road,CA,Toronto,Canada,312,"[29 Valley Woods Road, Toronto ON, Canada]","[{'label': 'display', 'lat': 43.75197441585782...",43.751974,-79.333114,ON,Variety Store,0,[]


In [43]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [44]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns, selecting only those of our interest
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Brookbanks Park,Park,43.751976,-79.33214
1,KFC,Fast Food Restaurant,43.754387,-79.333021
2,Variety Store,Food & Drink Shop,43.751974,-79.333114


In [45]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

3 venues were returned by Foursquare.


# 4. Venues in all neighbourhoods

In this section we repeat the same process for all the neighbourhoods in Toronto.

#### The function that repeats the same process as before for all the neighbourhoods.

In [47]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    LIMIT=10
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [48]:

tr_venues = getNearbyVenues(names = df['Neighbourhood'],
                                   latitudes = df['Latitude'],
                                   longitudes = df['Longitude'])

Rouge, Malvern
Highland Creek, Rouge Hill, Port Union
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park, Ionview, Kennedy Park
Clairlea, Golden Mile, Oakridge
Cliffcrest, Cliffside, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Scarborough Town Centre, Wexford Heights
Maryvale, Wexford
Agincourt
Clarks Corners, Sullivan, Tam O'Shanter
Agincourt North, L'Amoreaux East, Milliken, Steeles East
L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
Silver Hills, York Mills
Newtonbrook, Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park, Don Mills South
Bathurst Manor, Downsview North, Wilson Heights
Northwood Park, York University
CFB Toronto, Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Woodbine Gardens, Parkview Hill
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
The Danforth West, 

In [49]:
tr_venues.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
3,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Marina Spa,43.766,-79.191,Spa


Not all the neighbourhoods has 10 venues, just as shown in the next cell.

In [33]:
tr_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",10,10,10,10,10,10
Agincourt,4,4,4,4,4,4
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",2,2,2,2,2,2
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",8,8,8,8,8,8
"Alderwood, Long Branch",9,9,9,9,9,9
"Bathurst Manor, Downsview North, Wilson Heights",10,10,10,10,10,10
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",10,10,10,10,10,10
Berczy Park,10,10,10,10,10,10
"Birch Cliff, Cliffside West",4,4,4,4,4,4


In [52]:
print('There are {} neighbourhoods in our original dataframe. \
And the number of neighbourhoods that has at least one venue returned by Foursquare is {}.'\
      .format(df.shape[0], tr_venues['Neighbourhood'].unique().shape[0]))

There are 103 neighbourhoods in our original dataframe. And the number of neighbourhoods that has at least one venue returned by Foursquare is 101.


This means that, acoording to Foursquare, there are two neighbourhoods that has no venues in their database.

In [53]:
print('There are {} uniques categories.'.format(len(tr_venues['Venue Category'].unique())))

There are 179 uniques categories.


# 5. Neighbourhood Segmentation

The criteria being used to segmente neighbourhoods is the different type of venues contained in each one of them, for this, the first step is to turn the categorical values into numerical ones.

In [54]:
tr_onehot = pd.get_dummies(tr_venues[['Venue Category']], prefix="", prefix_sep="")
tr_onehot['Neighbourhood'] = tr_venues['Neighbourhood']

#Reordering the columns
fixed_columns = [tr_onehot.columns[-1]] + list(tr_onehot.columns[:-1])
tr_onehot = tr_onehot[fixed_columns]

tr_onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Art Gallery,...,Thrift / Vintage Store,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Warehouse Store,Wings Joint,Women's Store,Yoga Studio
0,"Rouge, Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Highland Creek, Rouge Hill, Port Union",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [36]:
tr_onehot.shape

(695, 181)

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category.

In [58]:
tr_grouped = tr_onehot.groupby('Neighbourhood').mean().reset_index()
tr_grouped.head(5)

Unnamed: 0,Neighbourhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Art Gallery,...,Thrift / Vintage Store,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Warehouse Store,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0
4,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's re-examine the shape.

In [59]:
tr_grouped.shape

(101, 180)

Now we proceed to cluster the neighbourhoods by the mean of the frequency of occurrence of each category.

In [60]:
# set number of clusters
kclusters = 5

tr_grouped_clustering = tr_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(tr_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 2, 3, 2, 2, 0, 0, 0, 2, 0], dtype=int32)

We create a function that sorts the venues in ascending order.

In [62]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:] #taking all the columns of the row except the one of 'Neighbourhood'
    row_categories_sorted = row_categories.sort_values(ascending=False) #sorting the columns
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [63]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = tr_grouped['Neighbourhood']

for ind in np.arange(tr_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(tr_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Greek Restaurant,Café,Coffee Shop,Speakeasy,Steakhouse,Asian Restaurant,Plaza,Hotel,Concert Hall,Vegetarian / Vegan Restaurant
1,Agincourt,Lounge,Sandwich Place,Breakfast Spot,Skating Rink,Comic Shop,Department Store,Drugstore,College Stadium,Dog Run,Discount Store
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Coffee Shop,Playground,Park,Curling Ice,Drugstore,Dog Run,Discount Store,Diner,Dessert Shop,Department Store
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Pharmacy,Pizza Place,Video Store,Coffee Shop,Beer Store,Sandwich Place,Fried Chicken Joint,Fast Food Restaurant,Yoga Studio
4,"Alderwood, Long Branch",Pizza Place,Pharmacy,Athletics & Sports,Pool,Pub,Sandwich Place,Skating Rink,Coffee Shop,Gym,Furniture / Home Store


In [64]:
# add clustering labels
# insert label column called 'Cluster Labels', with values/rows of kmenas.labels_ at column position 0

neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

tr_merged = df

# merge tr_grouped with df to add latitude/longitude for each neighborhood
tr_merged = tr_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

tr_merged.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,1.0,Fast Food Restaurant,Yoga Studio,Electronics Store,Drugstore,Dog Run,Discount Store,Diner,Dessert Shop,Department Store,Deli / Bodega
12,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,2.0,Bar,Yoga Studio,Dance Studio,Eastern European Restaurant,Drugstore,Dog Run,Discount Store,Diner,Dessert Shop,Department Store
18,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,2.0,Rental Car Location,Intersection,Spa,Electronics Store,Medical Center,Mexican Restaurant,Breakfast Spot,Pizza Place,Construction & Landscaping,Convenience Store
22,M1G,Scarborough,Woburn,43.770992,-79.216917,0.0,Coffee Shop,Korean Restaurant,Pharmacy,Golf Course,Cuban Restaurant,Dog Run,Discount Store,Diner,Dessert Shop,Department Store
26,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,2.0,Hakka Restaurant,Fried Chicken Joint,Bank,Bakery,Lounge,Thai Restaurant,Athletics & Sports,Caribbean Restaurant,Cosmetics Shop,Concert Hall


There are two neighbourhoods that has no venues, and if we check the unique values in the 'Cluster Labels' column, we can tell that there are rows that has value as 'nan'.

In [66]:
print(tr_merged.shape[0])
print(tr_merged['Cluster Labels'].unique())

103
[  1.   2.   0.   3.  nan   4.]


Droping the rows that has no venues.

In [68]:
#Droping the rows in interest
tr_merged.dropna(axis=0, inplace=True)

#Check the unique values in 'Cluster Labels'
tr_merged['Cluster Labels'].unique()

array([ 1.,  2.,  0.,  3.,  4.])

In [69]:
#Cast the 'Cluster Labels' column into integers
tr_merged['Cluster Labels'] = tr_merged['Cluster Labels'].astype(int)

tr_merged.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,1,Fast Food Restaurant,Yoga Studio,Electronics Store,Drugstore,Dog Run,Discount Store,Diner,Dessert Shop,Department Store,Deli / Bodega
12,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,2,Bar,Yoga Studio,Dance Studio,Eastern European Restaurant,Drugstore,Dog Run,Discount Store,Diner,Dessert Shop,Department Store
18,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,2,Rental Car Location,Intersection,Spa,Electronics Store,Medical Center,Mexican Restaurant,Breakfast Spot,Pizza Place,Construction & Landscaping,Convenience Store
22,M1G,Scarborough,Woburn,43.770992,-79.216917,0,Coffee Shop,Korean Restaurant,Pharmacy,Golf Course,Cuban Restaurant,Dog Run,Discount Store,Diner,Dessert Shop,Department Store
26,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,2,Hakka Restaurant,Fried Chicken Joint,Bank,Bakery,Lounge,Thai Restaurant,Athletics & Sports,Caribbean Restaurant,Cosmetics Shop,Concert Hall


Finally, let's visualize the resulting clusters

In [70]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tr_merged['Latitude'], tr_merged['Longitude'], tr_merged['Neighbourhood'], tr_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examining the clusters

#### Cluster 1

In [71]:
tr_merged.loc[tr_merged['Cluster Labels'] == 0, tr_merged.columns[[2] + list(range(5, tr_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Woburn,0,Coffee Shop,Korean Restaurant,Pharmacy,Golf Course,Cuban Restaurant,Dog Run,Discount Store,Diner,Dessert Shop,Department Store
38,"East Birchmount Park, Ionview, Kennedy Park",0,Discount Store,Coffee Shop,Hobby Shop,Department Store,Bus Station,Dance Studio,Eastern European Restaurant,Drugstore,Dog Run,Diner
58,"Birch Cliff, Cliffside West",0,Skating Rink,College Stadium,General Entertainment,Café,Curling Ice,Drugstore,Dog Run,Discount Store,Diner,Dessert Shop
39,Bayview Village,0,Chinese Restaurant,Café,Bank,Japanese Restaurant,Yoga Studio,Deli / Bodega,Eastern European Restaurant,Drugstore,Dog Run,Discount Store
59,Willowdale South,0,Café,Grocery Store,Pet Store,Coffee Shop,Plaza,Movie Theater,Ramen Restaurant,Steakhouse,Indonesian Restaurant,Cuban Restaurant
7,Don Mills North,0,Gym / Fitness Center,Basketball Court,Caribbean Restaurant,Baseball Field,Café,Japanese Restaurant,Yoga Studio,Department Store,Eastern European Restaurant,Drugstore
13,"Flemingdon Park, Don Mills South",0,Japanese Restaurant,Coffee Shop,Gym,Clothing Store,General Entertainment,Bike Shop,Beer Store,Discount Store,Restaurant,Italian Restaurant
28,"Bathurst Manor, Downsview North, Wilson Heights",0,Coffee Shop,Restaurant,Bridal Shop,Fast Food Restaurant,Diner,Bank,Fried Chicken Joint,Deli / Bodega,Sushi Restaurant,Dance Studio
34,"Northwood Park, York University",0,Coffee Shop,Massage Studio,Metro Station,Bar,Furniture / Home Store,Miscellaneous Shop,Deli / Bodega,Dog Run,Discount Store,Diner
1,Victoria Village,0,Coffee Shop,Portuguese Restaurant,Hockey Arena,Intersection,Curling Ice,Drugstore,Dog Run,Discount Store,Diner,Dessert Shop


#### Cluster 2

In [74]:
tr_merged.loc[tr_merged['Cluster Labels'] == 1, tr_merged.columns[[2] + list(range(5, tr_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,"Rouge, Malvern",1,Fast Food Restaurant,Yoga Studio,Electronics Store,Drugstore,Dog Run,Discount Store,Diner,Dessert Shop,Department Store,Deli / Bodega
56,"Del Ray, Keelesdale, Mount Dennis, Silverthorn",1,Fast Food Restaurant,Sandwich Place,Yoga Studio,Dance Studio,Drugstore,Dog Run,Discount Store,Diner,Dessert Shop,Department Store


#### Cluster 3

In [78]:
tr_merged.loc[tr_merged['Cluster Labels'] == 2, tr_merged.columns[[2] + list(range(5, tr_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,"Highland Creek, Rouge Hill, Port Union",2,Bar,Yoga Studio,Dance Studio,Eastern European Restaurant,Drugstore,Dog Run,Discount Store,Diner,Dessert Shop,Department Store
18,"Guildwood, Morningside, West Hill",2,Rental Car Location,Intersection,Spa,Electronics Store,Medical Center,Mexican Restaurant,Breakfast Spot,Pizza Place,Construction & Landscaping,Convenience Store
26,Cedarbrae,2,Hakka Restaurant,Fried Chicken Joint,Bank,Bakery,Lounge,Thai Restaurant,Athletics & Sports,Caribbean Restaurant,Cosmetics Shop,Concert Hall
32,Scarborough Village,2,Women's Store,Playground,Curling Ice,Drugstore,Dog Run,Discount Store,Diner,Dessert Shop,Department Store,Deli / Bodega
44,"Clairlea, Golden Mile, Oakridge",2,Bakery,Soccer Field,Intersection,Bus Station,Bus Line,Metro Station,Fast Food Restaurant,Park,Furniture / Home Store,Curling Ice
51,"Cliffcrest, Cliffside, Scarborough Village West",2,Motel,American Restaurant,Yoga Studio,Eastern European Restaurant,Drugstore,Dog Run,Discount Store,Diner,Dessert Shop,Department Store
65,"Dorset Park, Scarborough Town Centre, Wexford ...",2,Indian Restaurant,Light Rail Station,Pet Store,Latin American Restaurant,Vietnamese Restaurant,Chinese Restaurant,Dog Run,Discount Store,Diner,Dessert Shop
71,"Maryvale, Wexford",2,Breakfast Spot,Bakery,Middle Eastern Restaurant,Smoke Shop,Yoga Studio,Deli / Bodega,Drugstore,Dog Run,Discount Store,Diner
78,Agincourt,2,Lounge,Sandwich Place,Breakfast Spot,Skating Rink,Comic Shop,Department Store,Drugstore,College Stadium,Dog Run,Discount Store
82,"Clarks Corners, Sullivan, Tam O'Shanter",2,Pizza Place,Thai Restaurant,Chinese Restaurant,Noodle House,Fast Food Restaurant,Fried Chicken Joint,Italian Restaurant,Cuban Restaurant,Diner,Dessert Shop


#### Cluster 4

In [79]:
tr_merged.loc[tr_merged['Cluster Labels'] == 3, tr_merged.columns[[2] + list(range(5, tr_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
85,"Agincourt North, L'Amoreaux East, Milliken, St...",3,Coffee Shop,Playground,Park,Curling Ice,Drugstore,Dog Run,Discount Store,Diner,Dessert Shop,Department Store
66,York Mills West,3,Park,Bank,Yoga Studio,Dance Studio,Eastern European Restaurant,Drugstore,Dog Run,Discount Store,Diner,Dessert Shop
0,Parkwoods,3,Fast Food Restaurant,Park,Food & Drink Shop,Yoga Studio,Curling Ice,Dog Run,Discount Store,Diner,Dessert Shop,Department Store
40,"CFB Toronto, Downsview East",3,Bus Stop,Airport,Park,Yoga Studio,Drugstore,Dog Run,Discount Store,Diner,Dessert Shop,Department Store
19,The Beaches,3,Coffee Shop,Park,Health Food Store,Neighborhood,Pub,Cuban Restaurant,Dog Run,Discount Store,Diner,Dessert Shop
35,East Toronto,3,Park,Coffee Shop,Convenience Store,Rental Car Location,Curling Ice,Drugstore,Dog Run,Discount Store,Diner,Dessert Shop
61,Lawrence Park,3,Park,Swim School,Bus Line,Yoga Studio,Curling Ice,Dog Run,Discount Store,Diner,Dessert Shop,Department Store
91,Rosedale,3,Park,Playground,Trail,Yoga Studio,Cuban Restaurant,Dog Run,Discount Store,Diner,Dessert Shop,Department Store
21,Caledonia-Fairbanks,3,Park,Women's Store,Pharmacy,Fast Food Restaurant,Market,Yoga Studio,Dance Studio,Dog Run,Discount Store,Diner
49,"Downsview, North Park, Upwood Park",3,Park,Basketball Court,Construction & Landscaping,Bakery,Yoga Studio,Deli / Bodega,Eastern European Restaurant,Drugstore,Dog Run,Discount Store


#### Cluster 5

In [80]:
tr_merged.loc[tr_merged['Cluster Labels'] == 4, tr_merged.columns[[2] + list(range(5, tr_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
101,"Humber Bay, King's Mill Park, Kingsway Park So...",4,Baseball Field,Yoga Studio,Dance Studio,Eastern European Restaurant,Drugstore,Dog Run,Discount Store,Diner,Dessert Shop,Department Store
57,"Emery, Humberlea",4,Baseball Field,Yoga Studio,Dance Studio,Eastern European Restaurant,Drugstore,Dog Run,Discount Store,Diner,Dessert Shop,Department Store
