<h1 align=center><font size = 5>Segmenting and Clustering Neighborhouds in Toronto</font></h1>
<h1 align=center><font size = 2>Ilan Benchetrit</font></h1>

# I - Scrapping data

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [87]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: \ 
  - anaconda/osx-64::ca-certificates-2020.1.1-0, anaconda/osx-64::openssl-1.1.1d-h1de35cc_4
  - anaconda/osx-64::ca-certificates-2020.1.1-0, defaults/osx-64::openssl-1.1.1d-h1de35cc_4
  - anaconda/osx-64::openssl-1.1.1d-h1de35cc_4, defaults/osx-64::ca-certificates-2020.1.1-0
  - defaults/osx-64::ca-certificates-2020.1.1-0, defaults/osx-64::openssl-1.1.1d-h1de35ccdone

## Package Plan ##

  environment location: /opt/anaconda3

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py37_0         148 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         148 KB

The following packages will be UPDATED:

  conda                        anaconda::conda-4.8.3-p

Now let's import Beautiful Soup and its dependecies to scrape the Wikipedia page

In [88]:
!conda install -c anaconda beautifulsoup4 --yes
from bs4 import BeautifulSoup

!conda install -c anaconda lxml --yes
!conda install -c anaconda html5lib --yes
!conda install -c anaconda requests --yes
import requests

print('BeautifulSoup and its dependecies imported')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/anaconda3

  added / updated specs:
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py37_0         156 KB  anaconda
    ------------------------------------------------------------
                                           Total:         156 KB

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi                                       conda-forge --> anaconda
  conda              conda-forge::conda-4.8.3-py37hc8dfbb8~ --> anaconda::conda-4.8.3-py37_0



Downloading and Extracting Packages
certifi-2019.11.28   | 156 KB    | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Collecting pac

#### Load the html page and scrap it with BeautifulSoup

In [89]:
!wget -q -O 'toronto_data.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
print('Data downloaded!')

Data downloaded!


In [90]:
with open("toronto_data.html") as html_file:
    wikipage = BeautifulSoup(html_file,'lxml')

body = wikipage.find('tbody')

#print(body.prettify())

Then, we extract the usefull data within the HTML page.
<br>In the following code, we assumed that : 
- if a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 
- if a cell has a neighbourhood but a Not assigned borough, then the borough will be the same as the neighbourhood (M7A for instance). **EDIT : this code was written before knowing M7A is populated in the location file.**
- as it is not requested, we get rid of cardinal specifications

In [91]:
data = []
for cell in body.find_all('p'):
    #we first extract the postal code
    pcode = cell.b.text
    
    #then we extract the borough and the neighbourhood
    try :
        try :
            borough = cell.i.text #in this case, the borough is not assigned so it is formated in italic
            neighbourhood = borough
        except :
            borough = cell.span.text.split('(')[0] 
            neighbourhood = cell.span.text.split('(')[1] #we split borough from neighbourhoods
            neighbourhood = neighbourhood.split(')')[0] #we get rid of cardinal specifications
            neighbourhood = neighbourhood.replace(' /',',')
    except : #this case is for postal code without borough like M7A
        borough = 'Not assigned'
        neighbourhood = cell.span.text
        neighbourhood = neighbourhood.split(')')[0] #we get rid of cardinal specifications
        neighbourhood = neighbourhood.replace(' /',',')
    
    #we append this instance of the loop into the postal_code list
    l = [pcode, borough, neighbourhood]
    data.append(l)

#print(data)

Tranform the data into a *pandas* dataframe

In [92]:
# define the dataframe columns
column_names = ['Postal Code', 'Borough', 'Neighbourhood'] 

# instantiate the dataframe
df = pd.DataFrame(columns=column_names)

#df

In [93]:
for l in data:
    postal_code = l[0]
    borough = l[1]
    neighbourhood = l[2]
    
    df = df.append({'Postal Code': postal_code,
                    'Borough': borough,
                    'Neighbourhood': neighbourhood}, 
                   ignore_index=True)

#df

**Here is the adding of a Borough (Downtown Toronto) for M7A Postal Code**

In [94]:
df.loc[df['Postal Code'] == 'M7A','Borough'] = 'Downtown Toronto'

Now we deletre rows for which Borough is not assigned


In [95]:
indexNames = df[ df['Borough'] == 'Not assigned' ].index
df.drop(indexNames , inplace=True)

In [96]:
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


As requested, here is the shape of the final dataframe with clean data

In [97]:
df.shape

(103, 3)

# II - Scrapping locations

We are using the CSV file to retreive the coordinates of each postal code

First, we download the location data

In [98]:
!wget -q -O 'toronto_localisation.csv' https://cocl.us/Geospatial_data
print('Data downloaded!')

loc_data = pd.read_csv("toronto_localisation.csv")

Data downloaded!


Now we merge the two dataframe thanks to the same column 'Postal Code'

In [99]:
clean_df = pd.merge(left=df, right=loc_data, how='right', left_on='Postal Code', right_on='Postal Code')

In [100]:
cdf = clean_df
cdf

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


# III - Explore venues in Etobicoke

Let's Create a map of Toronto with neighbourhoods superimposed on top.

In [101]:
address = 'Toronto, CA'

from geopy.exc import GeocoderTimedOut

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(cdf['Latitude'], cdf['Longitude'], cdf['Borough'], cdf['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Let's slice the original dataframe and create a new dataframe of the Etobicoke data.

In [102]:
etobicoke_data = cdf[cdf['Borough'] == 'Etobicoke'].reset_index(drop=True)
etobicoke_data.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
1,M9B,Etobicoke,"West Deane Park, Princess Gardens, Martin Grov...",43.650943,-79.554724
2,M9C,Etobicoke,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",43.643515,-79.577201
3,M9P,Etobicoke,Westmount,43.696319,-79.532242
4,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724


Now let's create a function to explore all the neighbourhoods in Etobicoke

My Foursquare credentials are willingly not shown here, but code cell was just down there

In [103]:
CLIENT_ID = 'IWR4ZOL3UDN2WMELH1EE3RXF02PLJD5KF3ZZZRTND4XS2W4U'
CLIENT_SECRET = 'CKPX1DL4B0GKUK2SJ4PPIK525FIM5EF5J2YWAJG0WPE3VQRM'
VERSION = '20200327'

In [104]:
def getNearbyVenues(names, latitudes, longitudes, radius=500,LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

etobicoke_venues= getNearbyVenues(names=etobicoke_data['Neighbourhood'],
                                   latitudes=etobicoke_data['Latitude'],
                                   longitudes=etobicoke_data['Longitude']
                                  )

Islington Avenue
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Westmount
Kingsview Village, St. Phillips, Martin Grove Gardens, Richview Gardens
New Toronto, Mimico South, Humber Bay Shores
South Steeles, Silverstone, Humbergate, Jamestown, Mount Olive, Beaumond Heights, Thistletown, Albion Gardens
Alderwood, Long Branch
The Kingsway, Montgomery Road, Old Mill North
Old Mill South, King's Mill Park, Sunnylea, Humber Bay, Mimico NE, The Queensway East, Royal York South East, Kingsway Park South East
Mimico NW, The Queensway West, South of Bloor, Kingsway Park South West, Royal York South West


Let's check the size of the resulting dataframe

In [105]:
print(etobicoke_venues.shape)
etobicoke_venues.head()

(72, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",43.643515,-79.577201,LCBO,43.642099,-79.576592,Liquor Store
1,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",43.643515,-79.577201,Starbucks,43.641312,-79.576924,Coffee Shop
2,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",43.643515,-79.577201,The Beer Store,43.641313,-79.576925,Beer Store
3,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",43.643515,-79.577201,Shoppers Drug Mart,43.641312,-79.576924,Cosmetics Shop
4,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",43.643515,-79.577201,Pizza Hut,43.641845,-79.576556,Pizza Place


Let's check how many venues were returned for each neighborhood

In [106]:
etobicoke_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Alderwood, Long Branch",9,9,9,9,9,9
"Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood",9,9,9,9,9,9
"Kingsview Village, St. Phillips, Martin Grove Gardens, Richview Gardens",4,4,4,4,4,4
"Mimico NW, The Queensway West, South of Bloor, Kingsway Park South West, Royal York South West",15,15,15,15,15,15
"New Toronto, Mimico South, Humber Bay Shores",14,14,14,14,14,14
"Old Mill South, King's Mill Park, Sunnylea, Humber Bay, Mimico NE, The Queensway East, Royal York South East, Kingsway Park South East",3,3,3,3,3,3
"South Steeles, Silverstone, Humbergate, Jamestown, Mount Olive, Beaumond Heights, Thistletown, Albion Gardens",10,10,10,10,10,10
"The Kingsway, Montgomery Road, Old Mill North",2,2,2,2,2,2
Westmount,6,6,6,6,6,6


Let's find out how many unique categories can be curated from all the returned venues

In [107]:
print('There are {} uniques categories.'.format(len(etobicoke_venues['Venue Category'].unique())))

There are 41 uniques categories.


# IV - Analyse each neighbourhood

In [108]:
# one hot encoding
etobicoke_onehot = pd.get_dummies(etobicoke_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhood column back to dataframe
etobicoke_onehot['Neighbourhood'] = etobicoke_venues['Neighbourhood'] 

# move neighbourhood column to the first column
fixed_columns = [etobicoke_onehot.columns[-1]] + list(etobicoke_onehot.columns[:-1])
etobicoke_onehot = etobicoke_onehot[fixed_columns]

etobicoke_onehot.head()

Unnamed: 0,Neighbourhood,American Restaurant,Bakery,Baseball Field,Beer Store,Burger Joint,Burrito Place,Business Service,Café,Chinese Restaurant,Coffee Shop,Construction & Landscaping,Convenience Store,Cosmetics Shop,Discount Store,Fast Food Restaurant,Flower Shop,Fried Chicken Joint,Grocery Store,Gym,Hardware Store,Home Service,Intersection,Japanese Restaurant,Kids Store,Liquor Store,Mobile Phone Shop,Park,Pet Store,Pharmacy,Pizza Place,Pool,Pub,Rental Service,Restaurant,River,Sandwich Place,Shopping Plaza,Skating Rink,Supplement Shop,Tanning Salon,Wings Joint
0,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0


Next, let's group rows by neighbourhood and by taking the mean of the frequency of occurrence of each category


In [109]:
etobicoke_grouped = etobicoke_onehot.groupby('Neighbourhood').mean().reset_index()
etobicoke_grouped

Unnamed: 0,Neighbourhood,American Restaurant,Bakery,Baseball Field,Beer Store,Burger Joint,Burrito Place,Business Service,Café,Chinese Restaurant,Coffee Shop,Construction & Landscaping,Convenience Store,Cosmetics Shop,Discount Store,Fast Food Restaurant,Flower Shop,Fried Chicken Joint,Grocery Store,Gym,Hardware Store,Home Service,Intersection,Japanese Restaurant,Kids Store,Liquor Store,Mobile Phone Shop,Park,Pet Store,Pharmacy,Pizza Place,Pool,Pub,Rental Service,Restaurant,River,Sandwich Place,Shopping Plaza,Skating Rink,Supplement Shop,Tanning Salon,Wings Joint
0,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.222222,0.111111,0.111111,0.0,0.0,0.0,0.111111,0.0,0.111111,0.0,0.0,0.0
1,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.111111,0.0,0.111111,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.111111,0.111111,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0
2,"Kingsview Village, St. Phillips, Martin Grove ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.25,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0
3,"Mimico NW, The Queensway West, South of Bloor,...",0.0,0.066667,0.0,0.0,0.066667,0.066667,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.066667,0.066667,0.066667,0.0,0.066667,0.066667,0.066667,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.066667,0.066667,0.066667
4,"New Toronto, Mimico South, Humber Bay Shores",0.071429,0.071429,0.0,0.0,0.0,0.0,0.142857,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.071429,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.071429,0.071429,0.071429,0.0,0.0,0.071429,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Old Mill South, King's Mill Park, Sunnylea, Hu...",0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"South Steeles, Silverstone, Humbergate, Jamest...",0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.1,0.0,0.1,0.2,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.1,0.1,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0
7,"The Kingsway, Montgomery Road, Old Mill North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0
8,Westmount,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.166667,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0


Let's print each neighbourhood along with the top 5 most common venues

In [110]:
num_top_venues = 5

for hood in etobicoke_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = etobicoke_grouped[etobicoke_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Alderwood, Long Branch----
            venue  freq
0     Pizza Place  0.22
1             Gym  0.11
2    Skating Rink  0.11
3  Sandwich Place  0.11
4     Coffee Shop  0.11


----Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood----
            venue  freq
0    Liquor Store  0.11
1      Beer Store  0.11
2            Park  0.11
3       Pet Store  0.11
4  Shopping Plaza  0.11


----Kingsview Village, St. Phillips, Martin Grove Gardens, Richview Gardens----
                 venue  freq
0    Mobile Phone Shop  0.25
1                 Park  0.25
2       Sandwich Place  0.25
3          Pizza Place  0.25
4  American Restaurant  0.00


----Mimico NW, The Queensway West, South of Bloor, Kingsway Park South West, Royal York South West----
               venue  freq
0        Wings Joint  0.07
1  Convenience Store  0.07
2         Kids Store  0.07
3             Bakery  0.07
4     Hardware Store  0.07


----New Toronto, Mimico South, Humber Bay Shores----
                 venue  freq
0 

Let's put that into a *pandas* dataframe


In [111]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = etobicoke_grouped['Neighbourhood']

for ind in np.arange(etobicoke_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(etobicoke_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Alderwood, Long Branch",Pizza Place,Pharmacy,Gym,Skating Rink,Sandwich Place,Pub,Pool,Coffee Shop,Chinese Restaurant,Fast Food Restaurant
1,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",Pizza Place,Park,Shopping Plaza,Beer Store,Cosmetics Shop,Café,Liquor Store,Coffee Shop,Pet Store,Flower Shop
2,"Kingsview Village, St. Phillips, Martin Grove ...",Pizza Place,Sandwich Place,Mobile Phone Shop,Park,Coffee Shop,Flower Shop,Fast Food Restaurant,Discount Store,Cosmetics Shop,Convenience Store
3,"Mimico NW, The Queensway West, South of Bloor,...",Wings Joint,Convenience Store,Gym,Hardware Store,Tanning Salon,Flower Shop,Fast Food Restaurant,Kids Store,Discount Store,Grocery Store
4,"New Toronto, Mimico South, Humber Bay Shores",Business Service,American Restaurant,Gym,Fast Food Restaurant,Liquor Store,Pet Store,Fried Chicken Joint,Pizza Place,Café,Pharmacy


# V - Cluster Neighbourhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [112]:
# set number of clusters
kclusters = 5

etobicoke_grouped_clustering = etobicoke_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(etobicoke_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 0, 4, 0, 0, 2, 0, 1, 3], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [113]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

etobicoke_merged = etobicoke_data
etobicoke_merged = etobicoke_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

etobicoke_merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242,,,,,,,,,,,
1,M9B,Etobicoke,"West Deane Park, Princess Gardens, Martin Grov...",43.650943,-79.554724,,,,,,,,,,,
2,M9C,Etobicoke,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",43.643515,-79.577201,0.0,Pizza Place,Park,Shopping Plaza,Beer Store,Cosmetics Shop,Café,Liquor Store,Coffee Shop,Pet Store,Flower Shop
3,M9P,Etobicoke,Westmount,43.696319,-79.532242,3.0,Pizza Place,Coffee Shop,Intersection,Sandwich Place,Discount Store,Chinese Restaurant,Fried Chicken Joint,Flower Shop,Fast Food Restaurant,Cosmetics Shop
4,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724,4.0,Pizza Place,Sandwich Place,Mobile Phone Shop,Park,Coffee Shop,Flower Shop,Fast Food Restaurant,Discount Store,Cosmetics Shop,Convenience Store


Let's change Cluster Labels from floats to integers

In [114]:
etobicoke_merged = etobicoke_merged.dropna()
etobicoke_merged['Cluster Labels'] = etobicoke_merged['Cluster Labels'].astype(int)

Finally, let's visualize the resulting clusters

In [115]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(etobicoke_merged['Latitude'], etobicoke_merged['Longitude'], etobicoke_merged['Neighbourhood'], etobicoke_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# VI - Examine Clusters

**Cluster 1 :**
<br>This cluster represents the night life of Etobicoke witj lots of Beer Stores, Liquor Stores and Fast Foods restaurants.

In [116]:
etobicoke_merged.loc[etobicoke_merged['Cluster Labels'] == 0, etobicoke_merged.columns[[1] + list(range(5, etobicoke_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Etobicoke,0,Pizza Place,Park,Shopping Plaza,Beer Store,Cosmetics Shop,Café,Liquor Store,Coffee Shop,Pet Store,Flower Shop
5,Etobicoke,0,Business Service,American Restaurant,Gym,Fast Food Restaurant,Liquor Store,Pet Store,Fried Chicken Joint,Pizza Place,Café,Pharmacy
6,Etobicoke,0,Grocery Store,Pizza Place,Fast Food Restaurant,Japanese Restaurant,Discount Store,Pharmacy,Fried Chicken Joint,Sandwich Place,Beer Store,Burrito Place
10,Etobicoke,0,Wings Joint,Convenience Store,Gym,Hardware Store,Tanning Salon,Flower Shop,Fast Food Restaurant,Kids Store,Discount Store,Grocery Store


**Cluster 2 :**
<br>Mainly focused about sport. After all, it's tacky to pronouce yourself on an one-member cluster...

In [117]:
etobicoke_merged.loc[etobicoke_merged['Cluster Labels'] == 1, etobicoke_merged.columns[[1] + list(range(5, etobicoke_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,Etobicoke,1,River,Park,Wings Joint,Coffee Shop,Fried Chicken Joint,Flower Shop,Fast Food Restaurant,Discount Store,Cosmetics Shop,Convenience Store


**Cluster 3 :**
<br>This Cluster definitely represents the place to go to grab a slice of pizza.

In [118]:
etobicoke_merged.loc[etobicoke_merged['Cluster Labels'] == 2, etobicoke_merged.columns[[1] + list(range(5, etobicoke_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,Etobicoke,2,Home Service,Construction & Landscaping,Baseball Field,Grocery Store,Fried Chicken Joint,Flower Shop,Fast Food Restaurant,Discount Store,Cosmetics Shop,Convenience Store


**Cluster 4 :**
<br>Mainly focused about nature and wilderness related shops. After all, it's tacky to pronouce yourself on an one-member cluster...

In [119]:
etobicoke_merged.loc[etobicoke_merged['Cluster Labels'] == 3, etobicoke_merged.columns[[1] + list(range(5, etobicoke_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Etobicoke,3,Pizza Place,Coffee Shop,Intersection,Sandwich Place,Discount Store,Chinese Restaurant,Fried Chicken Joint,Flower Shop,Fast Food Restaurant,Cosmetics Shop
7,Etobicoke,3,Pizza Place,Pharmacy,Gym,Skating Rink,Sandwich Place,Pub,Pool,Coffee Shop,Chinese Restaurant,Fast Food Restaurant


**Cluster 5 :**
<br>Mainly focused about travel related shops. After all, it's tacky to pronouce yourself on an one-member cluster...

In [120]:
etobicoke_merged.loc[etobicoke_merged['Cluster Labels'] == 4, etobicoke_merged.columns[[1] + list(range(5, etobicoke_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Etobicoke,4,Pizza Place,Sandwich Place,Mobile Phone Shop,Park,Coffee Shop,Flower Shop,Fast Food Restaurant,Discount Store,Cosmetics Shop,Convenience Store


Thank you for reading me through all this.