# Toronto Postcode/Neighbourhood Project

In this project, we explore the neighbourhoods in the city of Toronto using segmentation and clustering. Each postcode has an associated borough and neighbourhood(s). First, we parse the borough and neighbourhood data for each postcode from a table on Wikipedia. We then add coordinates for each postcode and create a map of Toronto with markers for each postcode. Finally, we use the FourSquare API to get information about venues around each postcode in Toronto and use this to cluster the postcodes (and their associated boroughs and neighbourhoods) according to this information. 


### PART 1

#### Parse the table data from the Wikipedia page and create a new dataframe

In [1]:
# Import the dependencies we need
import pandas as pd
from bs4 import BeautifulSoup
import requests

# We will use BeautifulSoup to parse the data
res = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(res.text, 'html.parser')
table = soup.find('table', class_='wikitable sortable')

# Make a list of the headers
headers = table.tbody.tr.text.split('\n')
headers = headers[1:4]

# Make a list of all the row data
rowdata = []
for record in table.findAll('tr'):
    for data in record.findAll('td'):
        rowdata.append(data.text.replace('\n',''))
        
# Make a nested list, where each inner list is the data for one row
finalrow = []
i = 0
for row in rowdata:
    finalrow.append(rowdata[i:i+3])
    i+=3

# Create a dataframe using the row data and column headers
df = pd.DataFrame(finalrow, columns=headers)

#### Clean up the dataframe

In [2]:
# Remove any rows with no data
df.dropna(inplace=True)

# Remove any rows where the Borough column has a value of 'Not assigned'
df.drop(df[df.Borough == 'Not assigned'].index, inplace=True)

# Replace any values of 'Not assigned' in the Neighbourhood column with the value provided in the Borough column
df.Neighbourhood.replace("Not assigned", df.Borough, inplace=True)

# Remove any duplicate entries
df.drop_duplicates(inplace=True)

# Groups together all the neighbourhood values for each postcode
df = df.groupby('Postcode').agg({'Borough':'first', 'Neighbourhood': ', '.join}).reset_index()

print(df.shape)
df

(103, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


-------------

### PART 2

#### Add coordinates for each postcode to the dataframe

In [3]:
# Create a dataframe from the coordinates .csv file
coord_data = pd.read_csv('Geospatial_Coordinates.csv')

# Rename the column header to 'Postcode' so that it matches the header in the first dataframe
coord_data.rename(columns={"Postal Code":"Postcode"}, inplace=True)
coord_data.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [4]:
# Merge the two existing dataframes using the Postcode column to create a new dataframe
tor_data = pd.merge(df, coord_data, how='left', on=['Postcode'])

# Check to see if there are any missing values after joining dataframes
tor_data.isna().sum()

Postcode         0
Borough          0
Neighbourhood    0
Latitude         0
Longitude        0
dtype: int64

In [5]:
tor_data

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv...",43.688905,-79.554724
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437


_______


### PART 3

#### Analyse dataset and cluster neighbourhoods

In [6]:
# Import dependencies
import json
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
import numpy as np
from geopy.geocoders import Nominatim
import folium
from sklearn.cluster import KMeans

##### Make a map with markers for each postcode

In [7]:
# Get coordinates for Toronto
address = 'Toronto, Canada'
geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [8]:
# Create a map of Toronto
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# Add markers for each postcode
for lat, lng, postcode, borough in zip(tor_data['Latitude'], tor_data['Longitude'], tor_data['Postcode'], tor_data['Borough']):
    label = postcode + '\n' + borough
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='indigo',
        fill=True,
        fill_color='indigo',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

# Display the map
map_toronto

##### Define FourSquare credentials and version

In [9]:
CLIENT_ID = 'X1MLFRW2LIGOLQFBIW22NAEDBHBW0MNDL21JPNSR4B1014QW' # your Foursquare ID
CLIENT_SECRET = 'E3MR0RSHMI3OW5OFFD0525WUIFXIVGZRUOI32EVTSH2RYITY' # your Foursquare Secret
VERSION = '20201902' # Foursquare API version

##### Create function to produce a dataframe with all the venues for each postcode

In [10]:
def getNearbyVenues(postcodes, boroughs, neighbourhoods, latitudes, longitudes, radius=500):
    
    venues_list = []
    for postcode, borough, neighbourhood, lat, lng in zip(postcodes, boroughs, neighbourhoods, latitudes, longitudes):
        print(postcode)
        
        # Create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
        
        # Create the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        venues_list.append([(postcode,
                             borough,
                             neighbourhood,
                             lat, 
                             lng,
                             v['venue']['name'], 
                             v['venue']['location']['lat'], 
                             v['venue']['location']['lng'], 
                             v['venue']['categories'][0]['name']) for v in results])
        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postcode', 
                             'Borough', 
                             'Neighbourhood', 
                             'Postcode Latitude', 
                             'Postcode Longitude', 
                             'Venue', 
                             'Venue Latitude', 
                             'Venue Longitude', 
                             'Venue Category']
                  
    return(nearby_venues)

In [11]:
# Limiting the number of venues returned by the API to 100
limit = 100

tor_venues = getNearbyVenues(tor_data['Postcode'], tor_data['Borough'], tor_data['Neighbourhood'], tor_data['Latitude'], tor_data['Longitude'])

M1B
M1C
M1E
M1G
M1H
M1J
M1K
M1L
M1M
M1N
M1P
M1R
M1S
M1T
M1V
M1W
M1X
M2H
M2J
M2K
M2L
M2M
M2N
M2P
M2R
M3A
M3B
M3C
M3H
M3J
M3K
M3L
M3M
M3N
M4A
M4B
M4C
M4E
M4G
M4H
M4J
M4K
M4L
M4M
M4N
M4P
M4R
M4S
M4T
M4V
M4W
M4X
M4Y
M5A
M5B
M5C
M5E
M5G
M5H
M5J
M5K
M5L
M5M
M5N
M5P
M5R
M5S
M5T
M5V
M5W
M5X
M6A
M6B
M6C
M6E
M6G
M6H
M6J
M6K
M6L
M6M
M6N
M6P
M6R
M6S
M7A
M7R
M7Y
M8V
M8W
M8X
M8Y
M8Z
M9A
M9B
M9C
M9L
M9M
M9N
M9P
M9R
M9V
M9W


In [12]:
# Check the size of the resulting dataframe
print(tor_venues.shape)

tor_venues.head(10)

(2228, 9)


Unnamed: 0,Postcode,Borough,Neighbourhood,Postcode Latitude,Postcode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
3,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Marina Spa,43.766,-79.191,Spa
5,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Big Bite Burrito,43.766299,-79.19072,Mexican Restaurant
6,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Enterprise Rent-A-Car,43.764076,-79.193406,Rental Car Location
7,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Wood Floor Polishing Inc,43.7665,-79.185207,Moving Target
8,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Woburn Medical Centre,43.766631,-79.192286,Medical Center
9,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Lawrence Ave E & Kingston Rd,43.767704,-79.18949,Intersection


In [13]:
# Number of venues returned for each postcode
tor_venues.groupby('Postcode').count()

Unnamed: 0_level_0,Borough,Neighbourhood,Postcode Latitude,Postcode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
M1B,1,1,1,1,1,1,1,1
M1C,1,1,1,1,1,1,1,1
M1E,9,9,9,9,9,9,9,9
M1G,3,3,3,3,3,3,3,3
M1H,8,8,8,8,8,8,8,8
...,...,...,...,...,...,...,...,...
M9N,1,1,1,1,1,1,1,1
M9P,8,8,8,8,8,8,8,8
M9R,4,4,4,4,4,4,4,4
M9V,9,9,9,9,9,9,9,9


In [14]:
print('There are {} unique categories.'.format(len(tor_venues['Venue Category'].unique())))

There are 272 unique categories.


##### Analyse each postcode

In [15]:
tor_onehot = pd.get_dummies(tor_venues[['Venue Category']], prefix="", prefix_sep="")

tor_onehot['Postcode'] = tor_venues['Postcode']
tor_onehot['Borough'] = tor_venues['Borough']
tor_onehot['Neighbourhood'] = tor_venues['Neighbourhood']

fixed_columns = [tor_onehot.columns[-3]] + [tor_onehot.columns[-2]] + [tor_onehot.columns[-1]] + list(tor_onehot.columns[:-3])
tor_onehot = tor_onehot[fixed_columns]

print(tor_onehot.shape)
tor_onehot.head()

(2228, 275)


Unnamed: 0,Postcode,Borough,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M1B,Scarborough,"Rouge, Malvern",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M1E,Scarborough,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M1E,Scarborough,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


##### Rows are grouped by postcode and the mean of the frequency of occurrence of each category is shown

In [16]:
tor_grouped = tor_onehot.groupby('Postcode').mean().reset_index()
tor_grouped

Unnamed: 0,Postcode,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,M9N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
95,M9P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,M9R,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97,M9V,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### I printed the top 5 most common venues for each postcode

In [17]:
num_top_venues = 5

for hood in tor_grouped['Postcode']:
    print("----"+hood+"----")
    temp = tor_grouped[tor_grouped['Postcode'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----M1B----
                             venue  freq
0             Fast Food Restaurant   1.0
1               Mexican Restaurant   0.0
2              Monument / Landmark   0.0
3  Molecular Gastronomy Restaurant   0.0
4       Modern European Restaurant   0.0


----M1C----
                             venue  freq
0                              Bar   1.0
1                Accessories Store   0.0
2        Middle Eastern Restaurant   0.0
3              Monument / Landmark   0.0
4  Molecular Gastronomy Restaurant   0.0


----M1E----
                 venue  freq
0          Pizza Place  0.11
1  Rental Car Location  0.11
2       Breakfast Spot  0.11
3                  Spa  0.11
4       Medical Center  0.11


----M1G----
                             venue  freq
0                      Coffee Shop  0.67
1                Korean Restaurant  0.33
2                Accessories Store  0.00
3              Monument / Landmark  0.00
4  Molecular Gastronomy Restaurant  0.00


----M1H----
                  ve

### Make a dataframe showing the most common type of venues for each postcode

In [18]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending = False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [19]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

columns = ['Postcode']

for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
        
postcode_venues_sorted = pd.DataFrame(columns=columns)

postcode_venues_sorted['Postcode'] = tor_grouped['Postcode']
# postcode_venues_sorted['Borough'] = tor_grouped['Borough']
# postcode_venues_sorted['Neighbourhood'] = tor_grouped['Neighbourhood']

for ind in np.arange(tor_grouped.shape[0]):
    postcode_venues_sorted.iloc[ind, 1:] = return_most_common_venues(tor_grouped.iloc[ind, 1:], num_top_venues)

postcode_venues_sorted.head()

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Fast Food Restaurant,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Yoga Studio,Dessert Shop
1,M1C,Bar,Yoga Studio,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Fast Food Restaurant
2,M1E,Moving Target,Breakfast Spot,Electronics Store,Pizza Place,Spa,Rental Car Location,Mexican Restaurant,Intersection,Medical Center,Yoga Studio
3,M1G,Coffee Shop,Korean Restaurant,Yoga Studio,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Electronics Store
4,M1H,Hakka Restaurant,Bakery,Caribbean Restaurant,Gas Station,Thai Restaurant,Athletics & Sports,Fried Chicken Joint,Bank,Dumpling Restaurant,Drugstore


### Clustering

In [20]:
k_clusters = 5

tor_grouped_clustering = tor_grouped.drop('Postcode', axis=1)

kmeans = KMeans(n_clusters=k_clusters, random_state=0).fit(tor_grouped_clustering)

kmeans.labels_[0:10] 

array([1, 2, 2, 2, 2, 3, 2, 2, 2, 2], dtype=int32)

In [21]:
# add clustering labels
postcode_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [22]:
tor_merged = tor_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
tor_merged = tor_merged.join(postcode_venues_sorted.set_index('Postcode'), on='Postcode')

#tor_merged = tor_merged.astype({'Cluster Labels':int})

tor_merged.head() # check the last columns!
# tor_merged.isna().sum()

tor_merged = tor_merged.dropna(0)
tor_merged['Cluster Labels'] = tor_merged['Cluster Labels'].astype(int)
tor_merged.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,1,Fast Food Restaurant,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Yoga Studio,Dessert Shop
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,2,Bar,Yoga Studio,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Fast Food Restaurant
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,2,Moving Target,Breakfast Spot,Electronics Store,Pizza Place,Spa,Rental Car Location,Mexican Restaurant,Intersection,Medical Center,Yoga Studio
3,M1G,Scarborough,Woburn,43.770992,-79.216917,2,Coffee Shop,Korean Restaurant,Yoga Studio,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Electronics Store
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,2,Hakka Restaurant,Bakery,Caribbean Restaurant,Gas Station,Thai Restaurant,Athletics & Sports,Fried Chicken Joint,Bank,Dumpling Restaurant,Drugstore
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476,3,Playground,Convenience Store,Yoga Studio,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029,2,Discount Store,Department Store,Coffee Shop,Bus Station,Convenience Store,Dumpling Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577,2,Bakery,Bus Line,Ice Cream Shop,Metro Station,Bus Station,Soccer Field,Park,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476,2,Motel,American Restaurant,Falafel Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dessert Shop,Dumpling Restaurant
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848,2,College Stadium,Skating Rink,General Entertainment,Café,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant


In [23]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k_clusters)
ys = [i + x + (i*x)**2 for i in range(k_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tor_merged['Latitude'], tor_merged['Longitude'], tor_merged['Postcode'], tor_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Breakdown of the most common venues in each cluster

##### Cluster 1

In [24]:
c1 = tor_merged.loc[tor_merged['Cluster Labels']==0, tor_merged.columns[[0] + list(range(6, tor_merged.shape[1]))]]
c1

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,M1V,Park,Playground,Yoga Studio,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop
23,M2P,Park,Bank,Bar,Convenience Store,Yoga Studio,Dumpling Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore
25,M3A,Food & Drink Shop,Park,Construction & Landscaping,Yoga Studio,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
30,M3K,Airport,Park,Snack Place,Yoga Studio,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop
31,M3L,Grocery Store,Park,Bank,Yoga Studio,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
40,M4J,Park,Convenience Store,Yoga Studio,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
44,M4N,Park,Swim School,Bus Line,Yoga Studio,Donut Shop,Diner,Discount Store,Dog Run,Doner Restaurant,Drugstore
50,M4W,Park,Trail,Playground,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
64,M5P,Park,Trail,Sushi Restaurant,Jewelry Store,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
74,M6E,Park,Market,Pool,Yoga Studio,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant


##### Cluster 2

In [25]:
c2 = tor_merged.loc[tor_merged['Cluster Labels']==1, tor_merged.columns[[0] + list(range(6, tor_merged.shape[1]))]]
c2

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Fast Food Restaurant,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Yoga Studio,Dessert Shop
80,M6M,Fast Food Restaurant,Restaurant,Sandwich Place,Discount Store,Yoga Studio,Doner Restaurant,Dim Sum Restaurant,Diner,Dog Run,Donut Shop


##### Cluster 3

In [36]:
c3 = tor_merged.loc[tor_merged['Cluster Labels']==2, tor_merged.columns[[0] + list(range(6, tor_merged.shape[1]))]]
c3.head(50)

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,M1C,Bar,Yoga Studio,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Fast Food Restaurant
2,M1E,Moving Target,Breakfast Spot,Electronics Store,Pizza Place,Spa,Rental Car Location,Mexican Restaurant,Intersection,Medical Center,Yoga Studio
3,M1G,Coffee Shop,Korean Restaurant,Yoga Studio,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Electronics Store
4,M1H,Hakka Restaurant,Bakery,Caribbean Restaurant,Gas Station,Thai Restaurant,Athletics & Sports,Fried Chicken Joint,Bank,Dumpling Restaurant,Drugstore
6,M1K,Discount Store,Department Store,Coffee Shop,Bus Station,Convenience Store,Dumpling Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore
7,M1L,Bakery,Bus Line,Ice Cream Shop,Metro Station,Bus Station,Soccer Field,Park,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
8,M1M,Motel,American Restaurant,Falafel Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dessert Shop,Dumpling Restaurant
9,M1N,College Stadium,Skating Rink,General Entertainment,Café,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
10,M1P,Indian Restaurant,Vietnamese Restaurant,Brewery,Chinese Restaurant,Pet Store,Comic Shop,Dim Sum Restaurant,Event Space,Ethiopian Restaurant,Colombian Restaurant
11,M1R,Breakfast Spot,Shopping Mall,Auto Garage,Middle Eastern Restaurant,Sandwich Place,Bakery,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Diner


##### Cluster 4

In [27]:
c4 = tor_merged.loc[tor_merged['Cluster Labels']==3, tor_merged.columns[[0] + list(range(6, tor_merged.shape[1]))]]
c4

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,M1J,Playground,Convenience Store,Yoga Studio,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
98,M9N,Convenience Store,Yoga Studio,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant


##### Cluster 5

In [28]:
c5 = tor_merged.loc[tor_merged['Cluster Labels']==4, tor_merged.columns[[0] + list(range(6, tor_merged.shape[1]))]]
c5.head(50)

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
91,M8Y,Baseball Field,Locksmith,Yoga Studio,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant
97,M9M,Baseball Field,Yoga Studio,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Fast Food Restaurant


#### To observe the frequency each category appears in each cluster to understand how each cluster can be labeled, I created the following function:
(this does not take into account how frequently each category is)

In [29]:
def cat_freq(cluster_name):
    columns = cluster_name.columns[1:]
    cluster_venues = cluster_name[columns]
    categories = {}
    
    for a in range(0, cluster_venues.shape[0]):
        for b in range(0, cluster_venues.shape[1]):
            if cluster_venues.iloc[a,b] in categories:
                categories[cluster_venues.iloc[a,b]]+=1
            else:
                categories[cluster_venues.iloc[a,b]]=1
    
    cat_sorted = {k: v for k, v in sorted(categories.items(), key=lambda item: item[1], reverse=True)}
    cat_df = pd.DataFrame.from_dict(cat_sorted, orient='index')
    cat_df.columns=['Frequency']
    return(cat_df.reset_index().rename(columns={'index':'Venue Category'}))

In [30]:
cat_freq(c1)

Unnamed: 0,Venue Category,Frequency
0,Park,13
1,Dog Run,12
2,Doner Restaurant,12
3,Donut Shop,12
4,Discount Store,11
5,Drugstore,10
6,Diner,9
7,Yoga Studio,8
8,Dumpling Restaurant,5
9,Dim Sum Restaurant,4


In [31]:
cat_freq(c2)

Unnamed: 0,Venue Category,Frequency
0,Fast Food Restaurant,2
1,Dim Sum Restaurant,2
2,Diner,2
3,Discount Store,2
4,Dog Run,2
5,Doner Restaurant,2
6,Donut Shop,2
7,Yoga Studio,2
8,Drugstore,1
9,Dessert Shop,1


In [32]:
cat_freq(c3)

Unnamed: 0,Venue Category,Frequency
0,Coffee Shop,46
1,Café,30
2,Restaurant,26
3,Dog Run,22
4,Discount Store,21
...,...,...
167,Garden Center,1
168,Hardware Store,1
169,Kids Store,1
170,Shopping Plaza,1


In [33]:
cat_freq(c4)

Unnamed: 0,Venue Category,Frequency
0,Convenience Store,2
1,Yoga Studio,2
2,Dumpling Restaurant,2
3,Diner,2
4,Discount Store,2
5,Dog Run,2
6,Doner Restaurant,2
7,Donut Shop,2
8,Drugstore,2
9,Playground,1


In [34]:
cat_freq(c5)

Unnamed: 0,Venue Category,Frequency
0,Baseball Field,2
1,Yoga Studio,2
2,Dumpling Restaurant,2
3,Discount Store,2
4,Dog Run,2
5,Doner Restaurant,2
6,Donut Shop,2
7,Drugstore,2
8,Locksmith,1
9,Eastern European Restaurant,1


In [35]:
# # To join all the dataframes together:
# total_category = pd.merge(pd.merge(cat_freq(c1),cat_freq(c2),on='Venue Category', how='outer'), cat_freq(c3),on='Venue Category', how='outer')
# total_category = pd.merge(pd.merge(total_category,cat_freq(c4), on='Venue Category', how='outer'), cat_freq(c5),on='Venue Category', how='outer')
# total_category.fillna(0, inplace=True)
# total_category.iloc[:,1:] = total_category.iloc[:,1:].astype(int)
# total_category.columns=['Venue Category','C1 Freq', 'C2 Freq', 'C3 Freq', 'C4 Freq', 'C5 Freq']
# total_category

-----

### Conclusions

##### Analysing the frequency of each venue category in each cluster, in combination with the top 10 categories in each cluster, allows us to observe the main feature of each cluster.


Cluster 1 -- parks feature prominently in the top two most common categories

Cluster 2 -- fast food restaurants are the most common category in this cluster

Cluster 3 -- more difficult to see a clear trend in the most common categories, but coffee shops and cafes are the most frequently occurring categories in the top 10

Cluster 4 -- convenience store, yoga studio and dumpling restaurant all feature hihgly in the most common categories

Cluster 5 -- baseball fields are the most common category in this cluster 