# Assignment week 3 - Neighbourhoods of Toronto


### Aim
exploring, segmenting and clustering the neighbourhoods of Toronto by using data from Wikipedia and FourSquare


## 1. Get  and clean the data about Toronto neighbourhoods from Wikipedia



In [2]:
# first install wikipedia-api
!conda install -c conda-forge wikipedia --yes 

Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/Elisabeth/anaconda3

  added / updated specs:
    - wikipedia


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    wikipedia-1.4.0            |             py_2          13 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          13 KB

The following NEW packages will be INSTALLED:

  wikipedia          conda-forge/noarch::wikipedia-1.4.0-py_2



Downloading and Extracting Packages
wikipedia-1.4.0      | 13 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done


In [1]:
### Import libraries needed for whole project ###

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import wikipedia

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 

# library to handle requests
import requests 
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# import k-means from clustering stage
from sklearn.cluster import KMeans

# map rendering library
import folium 

print('Libraries imported.')

Libraries imported.


#### 1.1. Load data from wikipedia

In [4]:
# Get the html source
html = wikipedia.page("List_of_postal_codes_of_Canada:_M").html().encode("UTF-8")

# save it as pd.dataframe 
df = pd.read_html(html)[0]
df.rename(columns={'Postcode': 'PostalCode'}, inplace=True)
df.rename(columns={'District': 'Borough'}, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### 1.2 Clean data

In [5]:
# exlcude rows with borough not assigned
df2=df[df.Borough!= 'Not assigned']
df2.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [6]:
# number of valid Toronto postcodes
postcode=df2.PostalCode.unique()
len(postcode)

103

In [7]:
# find not assigned neighbourhoods
df2[df2.Neighbourhood=='Not assigned']

Unnamed: 0,PostalCode,Borough,Neighbourhood
7,M7A,Queen's Park,Not assigned


In [8]:
# assign Borough name to not assigned neighbourhoods
df2.loc[(df2.Neighbourhood == 'Not assigned'),'Neighbourhood']= df2['Borough']
df2


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Queen's Park
9,M9A,Downtown Toronto,Queen's Park
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


In [9]:
# combine neighborhoods with same postal code
df_neighbourhoods=df2.groupby("PostalCode")['Neighbourhood'].apply(lambda Neighbourhood: ','.join(Neighbourhood))

df_neighbourhoods

PostalCode
M1B                                        Rouge,Malvern
M1C                 Highland Creek,Rouge Hill,Port Union
M1E                      Guildwood,Morningside,West Hill
M1G                                               Woburn
M1H                                            Cedarbrae
M1J                                  Scarborough Village
M1K            East Birchmount Park,Ionview,Kennedy Park
M1L                        Clairlea,Golden Mile,Oakridge
M1M        Cliffcrest,Cliffside,Scarborough Village West
M1N                           Birch Cliff,Cliffside West
M1P    Dorset Park,Scarborough Town Centre,Wexford He...
M1R                                     Maryvale,Wexford
M1S                                            Agincourt
M1T                Clarks Corners,Sullivan,Tam O'Shanter
M1V    Agincourt North,L'Amoreaux East,Milliken,Steel...
M1W                                      L'Amoreaux West
M1X                                          Upper Rouge
M2H                 

In [10]:
# make dataframe with unique postcode and borough
df_borough=df2[['PostalCode','Borough']].drop_duplicates()

In [11]:
# merge both dataframes on Postcode
df_final=df_borough.join(df_neighbourhoods,'PostalCode')
df_final.reset_index(drop=True,inplace=True)
df_final.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [12]:
# save dataframe as .csv file
df_final.to_csv('df_final.csv')

In [13]:
# check if data of all postcodes is included use .shape function
df_final.shape
print('Toronto has',df_final.shape[0],'unique postcodes.')

Toronto has 103 unique postcodes.


## 2. Getting lattitude and longitude data 

(all needed libraries where imported above as part of 1.)

In [None]:
# install geocoder libray
!conda install -c conda-forge geopy --yes

Collecting package metadata (repodata.json): - 

##### 2.1 using geopy library

In [None]:
#from geopy.geocoders import Nominatim

# initialize your variable to None
#lat_lng_coords = None

# loop until you get the coordinates
#while(lat_lng_coords is None):
    #g = geocoder.google('{}, Toronto, Ontario'.format(df_final.PostalCode[0]))
    #lat_lng_coords = g.latlng

#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]

Using the Geopy library did not get any results, therefore use .csv file (see below)

#### 2.2 using provided csv file since geopy library is slow and not reliable 

In [15]:
lat_lng_coords = pd.read_csv('Geospatial_Coordinates.csv')
lat_lng_coords.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
lat_lng_coords.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### 2.3 Join longitude and latitude data with neighborhood/borough data

In [16]:
df_final.set_index('PostalCode',inplace=True )
df_final_lat_lng=df_final.join(lat_lng_coords.set_index('PostalCode'))
df_final_lat_lng.reset_index(drop=False,inplace=True)
df_final_lat_lng.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494




## 3. Cluster analysis of Toronto Neighbourhoods

#### 3.1 Create a map of Toronto with Neighbourhoods superimposed on top.

In [17]:
# get latitude and longitude of toronto
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="toronto_explorer", timeout=3)
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [18]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location = [latitude, longitude], zoom_start = 10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df_final_lat_lng['Latitude'], df_final_lat_lng['Longitude'], df_final_lat_lng['Borough'], df_final_lat_lng['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood,borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.5,
        parse_html=False).add_to(map_toronto)

map_toronto

#### 3.2 Focus analysis on inner Toronto neighbourhoods

For the further analysis of Toronto neighbourhoods I will focus on inner Toronto. Therefore I will only select boroughs that contain the word Toronto (East Toronto, West Toronto, Central Toronto and Downtown Toronto)


In [19]:
# new dataframe containing only the data from inner Toronto neighbourhoods
Toronto_inner = df_final_lat_lng[df_final_lat_lng['Borough'].str.contains('Toronto')].reset_index(drop=True)
Toronto_inner.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
1,M9A,Downtown Toronto,Queen's Park,43.667856,-79.532242
2,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


As coordinates for creating a map of inner Toronto I will use the median latitude and longitude of the inner Toronto dataset  

In [20]:
latitude_in = np.median(Toronto_inner['Latitude'])
longitude_in = np.median(Toronto_inner['Longitude'])

Visualize inner Toronto with its neighbourhoods

In [21]:
# create map of Toronto using latitude and longitude values
map_toronto_in = folium.Map(location = [latitude_in, longitude_in], zoom_start = 12)

# add markers to map
for lat, lng, borough, neighbourhood in zip(Toronto_inner['Latitude'], Toronto_inner['Longitude'], Toronto_inner['Borough'], Toronto_inner['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood,borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.5,
        parse_html=False).add_to(map_toronto_in)

map_toronto_in

### 3.3 Explore Neighbourhoods using FourSquare API

##### Define FourSquare Credentials and Version

In [54]:
CLIENT_ID = # your Foursquare ID
CLIENT_SECRET = # your Foursquare Secret
VERSION = '20191210'

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

##### Explore one example neighbourhood

In [23]:
# take first neighbourhood in dataframe as example
Toronto_inner.loc[0]

PostalCode                    M5A
Borough          Downtown Toronto
Neighbourhood        Harbourfront
Latitude                  43.6543
Longitude                -79.3606
Name: 0, dtype: object

#### get the venues in a 500 m radius in the example neighbourhood

In [24]:
latitude = Toronto_inner.loc[0, 'Latitude']
longitude = Toronto_inner.loc[0, 'Longitude']
radius = 500
LIMIT = 100

## define URL
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?client_id=BFZ2DHZATUV05QWRS3TZPT3QSBI2UKMJZMZQWXN53BBIGX2U&client_secret=OLDAZSAFN1M05UULULX4FBR0NXL0ICJWF1JA2URZUVZ5IDWR&ll=43.6542599,-79.3606359&v=20191210&radius=500&limit=100'

In [25]:
# request results
results = requests.get(url).json() # radius etc set before

In [26]:
# get venue data
venues = results['response']['groups'][0]['items']

In [31]:
# save results as dataframe
nearby = json_normalize(venues) 

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng'] + [col for col in nearby.columns if col.startswith('venue.location.')] + ['venue.id']
nearby_filtered = nearby.loc[:, filtered_columns]


# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


# filter the category for each row
nearby_filtered['venue.categories'] = nearby_filtered.apply(get_category_type, axis=1)

# clean columns
nearby_filtered.columns = [col.split('.')[-1] for col in nearby_filtered.columns]

#nearby_filtered.head(10)

In [33]:
# How many venues are in a distance of 500 m from the center of the example neighbourhood

print('In a 500 m radius from', Toronto_inner.loc[0, 'Neighbourhood'], 'you can find', nearby.shape[0], 'venues rated on FourSquare.')

In a 500 m radius from Harbourfront you can find 50 venues rated on FourSquare.


#### Write function to repeat analysis for all neighbourhoods in inner Toronto

In [34]:
def getNearbyVenues(PostalCode, neighbourhood_names, latitudes, longitudes,  radius=500):
    
    venues_list=[]
    for postcode, name, lat, lng in zip(PostalCode, neighbourhood_names, latitudes, longitudes):
        #print(name)
        
        # create API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            lat, lng, 
            VERSION, 
            radius, 
            LIMIT)
    
        # get request
        results = requests.get(url).json()['response']['groups'][0]['items']
    
        # get venue data
        venues_list.append([(
        postcode,    
        name,
        lat, lng,
        v['venue']['name'], 
        v['venue']['location']['lat'],
        v['venue']['location']['lng'],
        v['venue']['categories'][0]['name'])    
        for v in results])
    
    # make dataframe with all the important data
    nearby = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby.columns= ['Postal Code','Neighbourhood', 'Neighbourhood Latitude', 'Neighbourhood Longitude',
                     'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
    
    return(nearby)
    

#### Get data of all inner Toronto neighbourhoods

Use PostalCode as neighbourhood identifier

In [35]:
Toronto_inner_venues =  getNearbyVenues(PostalCode=Toronto_inner['PostalCode'],
                                    neighbourhood_names=Toronto_inner['Neighbourhood'], 
                                    latitudes=Toronto_inner['Latitude'],
                                    longitudes= Toronto_inner['Longitude'],
                                    radius=500)
Toronto_inner_venues.head(10)

Unnamed: 0,Postal Code,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M5A,Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,M5A,Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,M5A,Harbourfront,43.65426,-79.360636,Cooper Koo Family YMCA,43.653191,-79.357947,Gym / Fitness Center
3,M5A,Harbourfront,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,M5A,Harbourfront,43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
5,M5A,Harbourfront,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant
6,M5A,Harbourfront,43.65426,-79.360636,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot
7,M5A,Harbourfront,43.65426,-79.360636,Corktown Common,43.655618,-79.356211,Park
8,M5A,Harbourfront,43.65426,-79.360636,The Distillery Historic District,43.650244,-79.359323,Historic Site
9,M5A,Harbourfront,43.65426,-79.360636,Dominion Pub and Kitchen,43.656919,-79.358967,Pub


In [36]:
# how many venues were found for inner Toronto?
print(Toronto_inner_venues.shape[0], 'venues were found on FourSquare in inner Toronto')

# how many venues were found per neighbourhood?
Toronto_inner_venues.groupby('Neighbourhood').count()


1685 venues were found on FourSquare in inner Toronto


Unnamed: 0_level_0,Postal Code,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"Adelaide,King,Richmond",100,100,100,100,100,100,100
Berczy Park,55,55,55,55,55,55,55
"Brockton,Exhibition Place,Parkdale Village",22,22,22,22,22,22,22
Business Reply Mail Processing Centre 969 Eastern,16,16,16,16,16,16,16
"CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara",15,15,15,15,15,15,15
"Cabbagetown,St. James Town",43,43,43,43,43,43,43
Central Bay Street,84,84,84,84,84,84,84
"Chinatown,Grange Park,Kensington Market",94,94,94,94,94,94,94
Christie,17,17,17,17,17,17,17
Church and Wellesley,83,83,83,83,83,83,83


Note: For some neighbourhoods the number of venues is 100, hence equal to the limit we set in the beginning, therefore it is possible that in these neighbourhood more than 100 nearby venues exist.

In [37]:
# How many unique categories are there?
print('There are',len(Toronto_inner_venues['Venue Category'].unique()) , 'unique venue categories.')

There are 232 unique venue categories.


### Analyzing each neighbourhood

In [38]:
# encode venue categories with dummies (one-hot encoding for further analysis)
toronto_onehot = pd.get_dummies(Toronto_inner_venues[['Venue Category']], prefix="",prefix_sep="")

# add neighbourhood data so we know in which neighbourhood which category is present and move it to the first column
toronto_onehot['Neighbourhood'] =  Toronto_inner_venues['Neighbourhood'] 
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head(10)


Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [39]:
# How many categories do we have (number of columns-1)?
toronto_onehot.shape

(1685, 233)

In [40]:
# Group the data by neighbourhood and calculate mean occurence of each category per neighbourhood
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,"Adelaide,King,Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0
2,"Brockton,Exhibition Place,Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",0.0,0.066667,0.066667,0.066667,0.133333,0.133333,0.133333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Cabbagetown,St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011905,0.0,...,0.0,0.0,0.0,0.0,0.011905,0.0,0.0,0.011905,0.0,0.011905
7,"Chinatown,Grange Park,Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.010638,0.0,0.0,0.0,0.031915,0.0,0.042553,0.010638,0.0,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.012048,0.0,0.0,0.0,0.0,0.0,0.0,0.012048,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.012048,0.0,0.012048,0.012048


### What are the top 5 most common venue categories per neighbourhood?

In [41]:
num_top_venues = 5

for neighbourhood in toronto_grouped['Neighbourhood']:
    # header
    print("----"+neighbourhood+"----") 
    
    # generate temporary dataframe with neighbourhood, venue type and frequency data
    temp = toronto_grouped[toronto_grouped['Neighbourhood'] == neighbourhood].T.reset_index()
    temp.columns = ['venue','freq']   
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    
    # sort values and print top 5
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')
    
    
    

----Adelaide,King,Richmond----
             venue  freq
0      Coffee Shop  0.07
1             Café  0.05
2  Thai Restaurant  0.04
3       Steakhouse  0.04
4       Restaurant  0.03


----Berczy Park----
                venue  freq
0         Coffee Shop  0.09
1        Cocktail Bar  0.04
2          Steakhouse  0.04
3  Seafood Restaurant  0.04
4         Cheese Shop  0.04


----Brockton,Exhibition Place,Parkdale Village----
            venue  freq
0  Breakfast Spot  0.09
1            Café  0.09
2     Coffee Shop  0.09
3             Gym  0.05
4          Bakery  0.05


----Business Reply Mail Processing Centre 969 Eastern----
              venue  freq
0           Brewery  0.06
1     Burrito Place  0.06
2  Recording Studio  0.06
3            Garden  0.06
4     Garden Center  0.06


----CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara----
                 venue  freq
0       Airport Lounge  0.13
1      Airport Service  0.13
2     Airport Term

In [42]:
# Sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [43]:
# put data into dataframe for further analysis

num_top_venues = 10

# indicators for column names
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()


Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Coffee Shop,Café,Steakhouse,Thai Restaurant,Salad Place,Burger Joint,Bar,Bakery,Sushi Restaurant,Asian Restaurant
1,Berczy Park,Coffee Shop,Steakhouse,Cocktail Bar,Café,Cheese Shop,Beer Bar,Bakery,Seafood Restaurant,Farmers Market,Bistro
2,"Brockton,Exhibition Place,Parkdale Village",Breakfast Spot,Café,Coffee Shop,Yoga Studio,Gym,Restaurant,Pet Store,Performing Arts Venue,Italian Restaurant,Intersection
3,Business Reply Mail Processing Centre 969 Eastern,Burrito Place,Smoke Shop,Fast Food Restaurant,Farmers Market,Spa,Pizza Place,Light Rail Station,Recording Studio,Garden,Garden Center
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",Airport Lounge,Airport Service,Airport Terminal,Boat or Ferry,Rental Car Location,Airport,Airport Food Court,Airport Gate,Boutique,Harbor / Marina


### Clustering Neighbourhoods
Use k-means clustering methods to cluster the Neighbourhoods into 5 clusters according to the 10 most common venue categories 

In [44]:
# define number of clusters
kclusters = 5

# remove all columns that identifies the neighbourhood
toronto_grouped_clus = toronto_grouped.drop(columns=['Neighbourhood'])

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clus)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

Create a new dataframe that includes cluster as well as top 10 venues for each neigbourhood

In [45]:
# add cluster labels 
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = Toronto_inner

# merge grouped data with general data to add latitude/longitude for each neighbourhood
toronto_merged = toronto_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on= 'Neighbourhood')

# remove row with no venue data
toronto_merged = toronto_merged.dropna(subset=['1st Most Common Venue'])

# change cluster label to integer not float
toronto_merged['Cluster Labels']=toronto_merged['Cluster Labels'].astype(int)

toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636,0,Coffee Shop,Park,Bakery,Pub,Breakfast Spot,Restaurant,Café,Theater,Mexican Restaurant,Dessert Shop
2,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937,0,Coffee Shop,Clothing Store,Cosmetics Shop,Middle Eastern Restaurant,Fast Food Restaurant,Café,Bakery,Diner,Italian Restaurant,Ramen Restaurant
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Café,Coffee Shop,Restaurant,Clothing Store,Hotel,Italian Restaurant,Beer Bar,American Restaurant,Cosmetics Shop,Breakfast Spot
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Other Great Outdoors,Trail,Pub,Neighborhood,Health Food Store,Doner Restaurant,Dog Run,Donut Shop,Dance Studio,Discount Store
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,0,Coffee Shop,Steakhouse,Cocktail Bar,Café,Cheese Shop,Beer Bar,Bakery,Seafood Restaurant,Farmers Market,Bistro


#### Visualize the resulting clusters

In [46]:
# create map
map_clusters = folium.Map(location=[latitude_in, longitude_in], zoom_start=12)

import matplotlib.colors as colors

# set color scheme for clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
color_array = plt.cm.rainbow(np.linspace(0,1,len(ys)))
rainbow = [colors.rgb2hex(i) for i in color_array]

# add markers to the map
markers_color=[]
for lat, lng, hood, cluster in zip(toronto_merged['Latitude'],toronto_merged['Longitude'],toronto_merged['Neighbourhood'],toronto_merged['Cluster Labels']):
    label = folium.Popup(str(hood) + ': Cluster Label ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=4,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.6
        ).add_to(map_clusters)
    
map_clusters


Note: It seems like there is one big cluster and 3 neighbourhoods that are not similar to any other neighbourhood. Let's explore them a bit more.

### Examine Clusters

#### Cluster 1

In [47]:
cluster1= toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5,toronto_merged.shape[1]))]]
print('Number of cluster members', cluster1.shape[0])

# groupby 1st most comon venue and display 3 most common venues
cluster1_group=cluster1.groupby('1st Most Common Venue').count().sort_values(by='Cluster Labels', ascending=False)
cluster1_group.head(3)


Number of cluster members 33


Unnamed: 0_level_0,Borough,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1st Most Common Venue,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Coffee Shop,12,12,12,12,12,12,12,12,12,12,12
Café,5,5,5,5,5,5,5,5,5,5,5
Sandwich Place,2,2,2,2,2,2,2,2,2,2,2


Since Coffee Shop and Café are the two most common venues let's call this cluster 1 'Coffee'

#### Cluster 2

In [48]:
cluster2= toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5,toronto_merged.shape[1]))]]
print('Number of cluster members', cluster2.shape[0])
# groupby 1st most comon venue and display 3 most common venues
cluster2_group=cluster2.groupby('1st Most Common Venue').count().sort_values(by='Cluster Labels', ascending=False)
cluster2_group.head(3)

Number of cluster members 2


Unnamed: 0_level_0,Borough,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1st Most Common Venue,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Park,2,2,2,2,2,2,2,2,2,2,2


Cluster 2 has only two member and both have 'Park' as most common venue, therefore let's call cluster 2 'Park'

#### Cluster 3

In [49]:
cluster3= toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5,toronto_merged.shape[1]))]]
print('Number of cluster members', cluster3.shape[0])

cluster3

Number of cluster members 1


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,Central Toronto,2,Park,Gym / Fitness Center,Swim School,Bus Line,Yoga Studio,Department Store,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


Cluster 3 has only one member and it's first common venue is Park, this label is already taken by cluster 2. When looking at the 2nd, 3rd and 5th most common venues, these can be summarised as Sport venues. Therefore cluster 3 is called 'Sport'.

#### Cluster 4

In [50]:
cluster4= toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5,toronto_merged.shape[1]))]]
print('Number of cluster members', cluster4.shape[0])

cluster4

Number of cluster members 1


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,Central Toronto,3,Health & Beauty Service,Garden,Yoga Studio,Deli / Bodega,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant


Cluster 4 has only one member and let's call it 'Health&Beauty' according to its first most common venue.

#### Cluster 5

In [51]:
cluster5= toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5,toronto_merged.shape[1]))]]
print('Number of cluster members', cluster5.shape[0])

cluster5

Number of cluster members 1


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
29,Central Toronto,4,Restaurant,Playground,Intersection,Deli / Bodega,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


Cluster 5 has only one member and let's call it 'Restaurant' according to its first most common venue.

## Summary

Using k-means clustering to cluster inner Toronto neighbourhoods according to their venues on FourSquare revealed one large cluster which contains neigbourhoods with mainly coffee shops and cafés venues. This cluster contains 33 out of the 37 neighbourhoods with venues on FourSquare (there was one neighbourhood in inner Toronto that didn't have any venues on FourSquare). The other 4 clusters are very small (1-2 members). When looking at the map one can see that these neighbourhoods belonging to these clusters are located further north and further away from downtown Toronto and therefore potentially being more residential areas hence a different type of venues there. For further analysis it would be interesting to subcluster the first large cluster based on the 2nd to 10th clusters or rather on data excluding coffee shops and cafés.
