# Segmentation and clustering of select Boroughs in Toronto<br>

> ## IBM Data Science Specialztion - Coursera

This notebook contains code to scrape wikipedia for Postcodes, Boroughs and Neighbourhoods of Toronto.<br>

This data, in HTML formate is extracted into a Pandas Dataframe for segmentation and clustring.<br>

Each cell below contains notes for each step in the process!<br>

Enjoy!

In [1]:
# Importing relevant libraries.
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import requests

### Collecting the data from Wikipedia.

In [2]:
# Assigning the url and collecting the raw HTML data
html_doc = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_doc = requests.get(url=html_doc)
soup = BeautifulSoup(html_doc.content)

### Extracting the Table data into Pandas Dataframe

In [3]:
#Selects the table data from the total data
table = soup.table
#Selects the row data from the table data
table_rows = table.find_all('tr')
#Lists all cell data for each row
row = []
for tr in table_rows:
    td = tr.find_all('td')
    row.append( [i.text for i in td])
# Selects headings from the raw HTML data
headers = table.tr.text.split()
#Builds the dataframe, using the selcted headings and extracted row data
df_toronto = pd.DataFrame(data=row[1:],columns= headers)    
df_toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


### Wrangle data

In [4]:
# Removes the rows without usable data in the 'Borough' column and resets the index
index_places = df_toronto[df_toronto['Borough']=='Not assigned'].index
df_toronto.drop(index_places, inplace=True)
df_toronto.reset_index(drop=True, inplace=True)

# Removes the space from the end of the "Neighbourhood" column
df_toronto["Neighbourhood"] = [n[:-1] for n in df_toronto["Neighbourhood"]]

# Replaces any unsable data in the 'Neighbourhoods' column with Borough names
df_toronto.replace({'Neighbourhood': 'Not assigned'} , {'Neighbourhood' : df_toronto['Borough']}, inplace=True)

# Consolidates Neiourhoods into a single row where their Postcodes and Boroughs are the same
df_cleaned = df_toronto.groupby("Postcode").agg({'Borough': 'first',
                                             'Neighbourhood': ', '.join}).reset_index()
df_cleaned.head(30), df_cleaned.shape


(   Postcode      Borough                                      Neighbourhood
 0       M1B  Scarborough                                     Rouge, Malvern
 1       M1C  Scarborough             Highland Creek, Rouge Hill, Port Union
 2       M1E  Scarborough                  Guildwood, Morningside, West Hill
 3       M1G  Scarborough                                             Woburn
 4       M1H  Scarborough                                          Cedarbrae
 5       M1J  Scarborough                                Scarborough Village
 6       M1K  Scarborough        East Birchmount Park, Ionview, Kennedy Park
 7       M1L  Scarborough                    Clairlea, Golden Mile, Oakridge
 8       M1M  Scarborough    Cliffcrest, Cliffside, Scarborough Village West
 9       M1N  Scarborough                        Birch Cliff, Cliffside West
 10      M1P  Scarborough  Dorset Park, Scarborough Town Centre, Wexford ...
 11      M1R  Scarborough                                  Maryvale, Wexford

### Adding the longitude and latitude coordinates for each Neighbourhood 

In [5]:
# Import the geocoding as well as geopy libraries
import geocoder
#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim

#### Attempt 1 - Retrieving coordinates (the code below did not work for me)

In [6]:
# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(df_toronto['Postcode']))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

KeyboardInterrupt: 

#### Attempt 2 - Using the provided geocodes CSV file to store coordinates

In [7]:
# Opening the provided CSV file into a dataframe
lonlat = pd.read_csv('Geospatial_Coordinates.csv')
lonlat.rename(columns={'Postal Code' : 'Postcode'}, inplace=True)
lonlat.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [8]:
# Combind the dataframes so that the matching postcode have the corresponding coordinates
df_complete = pd.merge(df_cleaned, lonlat, on=['Postcode'])
df_complete.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [9]:
# Counting the total number of postcodes and Boroughs
print('The dataframe has {} postcodes and {} boroughs.'.format(
        len(df_complete['Postcode'].unique()),
        len(df_complete['Borough'].unique())
    )
)
df_complete['Borough'].unique()

The dataframe has 103 postcodes and 10 boroughs.


array(['Scarborough', 'North York', 'East York', 'East Toronto',
       'Central Toronto', 'Downtown Toronto', 'York', 'West Toronto',
       'Mississauga', 'Etobicoke'], dtype=object)

In [11]:
# Finding the geo coordinates for Toronto
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.653963, -79.387207.


In [12]:
# Mapping the all locations
import folium


map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, Neighbourhood, Borough, Postcode in zip(df_complete['Latitude'], df_complete['Longitude'], df_complete['Neighbourhood'], df_complete['Borough'], df_complete['Postcode']):
    label = '{}, {}, {}'.format(Neighbourhood, Borough, Postcode)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Filtering Boroughs which contain the word 'Toronto'

In [13]:
# Filtering for Boroughs with the word Toronto
df_complete = df_complete[df_complete['Borough'].str.contains('Toronto')]
df_complete.reset_index(drop=True, inplace=True)
df_complete.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [14]:
# Mapping the filtered data
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, Neighbourhood, Borough, Postcode in zip(df_complete['Latitude'], df_complete['Longitude'], df_complete['Neighbourhood'], df_complete['Borough'], df_complete['Postcode']):
    label = '{}, {}, {}'.format(Neighbourhood, Borough, Postcode)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Using the Foursquare API to explore the first Postcode in our dataframe

In [15]:
# Defining Foursquare credentials and version
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
radius = 600
LIMIT=100
print('Your credentails have been captured!')

Your credentails have been captured!


In [16]:
print(df_complete.loc[0, 'Neighbourhood'],', ', df_complete.loc[0, 'Borough'],', ',df_complete.loc[0, 'Postcode'],'\n')
post = df_complete.loc[0, 'Postcode']
post_lat = df_complete.loc[0, 'Latitude']
post_long = df_complete.loc[0, 'Longitude']
print('Latitude and longitude values of {} are {}, {}.\n'.format(post, post_lat, post_long))

The Beaches ,  East Toronto ,  M4E 

Latitude and longitude values of M4E are 43.67635739999999, -79.2930312.



In [17]:
url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(CLIENT_ID, CLIENT_SECRET, VERSION, post_lat, post_long, radius, LIMIT)
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e584920c94979001bf02a7a'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'The Beaches',
  'headerFullLocation': 'The Beaches, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 19,
  'suggestedBounds': {'ne': {'lat': 43.6817574054, 'lng': -79.28557885738863},
   'sw': {'lat': 43.67095739459999, 'lng': -79.30048354261137}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bd461bc77b29c74a07d9282',
       'name': 'Glen Manor Ravine',
       'location': {'address': 'Glen Manor',
        'crossStreet': 'Queen St.',
        'lat': 43.67682094413784,
        'lng': -79.29394208780985,
        'labeledLatLngs': [{'label': 'dis

In [18]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [19]:
import json # library to handle JSON files
from pandas.io.json import json_normalize # transform JSON file into a pandas dataframe
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,Glen Stewart Ravine,Other Great Outdoors,43.6763,-79.294784
2,Tori's Bakeshop,Vegetarian / Vegan Restaurant,43.672114,-79.290331
3,Beaches Bake Shop,Bakery,43.680363,-79.289692
4,The Beech Tree,Gastropub,43.680493,-79.288846


## Exploring select Boroughs in Toronto

In [20]:
# Function stitch Boroughs, Neighbourhood, Latitude and Longitude data together for each postcode
def getNearbyVenues(names, hoods, latitudes, longitudes, radius=600):
    
    venues_list=[]
    for name, hood, lat, lng in zip(names, hoods, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            hood,
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postcode', 
                             'Neighbourhood',
                             'Postcode Latitude',
                             'Postcode Longitude',
                             'Venue',
                             'Venue Latitude',
                             'Venue Longitude',
                             'Venue Category']
    
    return(nearby_venues)

In [21]:
# Running function with parameters for building the dataframe containing venue info
Tor_venues = getNearbyVenues(names=df_complete['Postcode'],
                             hoods=df_complete['Neighbourhood'],
                                   latitudes=df_complete['Latitude'],
                                   longitudes=df_complete['Longitude']
                                  )

M4E
M4K
M4L
M4M
M4N
M4P
M4R
M4S
M4T
M4V
M4W
M4X
M4Y
M5A
M5B
M5C
M5E
M5G
M5H
M5J
M5K
M5L
M5N
M5P
M5R
M5S
M5T
M5V
M5W
M5X
M6G
M6H
M6J
M6K
M6P
M6R
M6S
M7A
M7Y


In [22]:
# Checking the resultant data
print(Tor_venues.shape)
Tor_venues.head()

(2147, 8)


Unnamed: 0,Postcode,Neighbourhood,Postcode Latitude,Postcode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M4E,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,M4E,The Beaches,43.676357,-79.293031,Glen Stewart Ravine,43.6763,-79.294784,Other Great Outdoors
2,M4E,The Beaches,43.676357,-79.293031,Tori's Bakeshop,43.672114,-79.290331,Vegetarian / Vegan Restaurant
3,M4E,The Beaches,43.676357,-79.293031,Beaches Bake Shop,43.680363,-79.289692,Bakery
4,M4E,The Beaches,43.676357,-79.293031,The Beech Tree,43.680493,-79.288846,Gastropub


In [23]:
# Number of venues per Postcode
Tor_venues.groupby(['Postcode', 'Neighbourhood']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Postcode Latitude,Postcode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Postcode,Neighbourhood,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
M4E,The Beaches,19,19,19,19,19,19
M4K,"The Danforth West, Riverdale",66,66,66,66,66,66
M4L,"The Beaches West, India Bazaar",33,33,33,33,33,33
M4M,Studio District,68,68,68,68,68,68
M4N,Lawrence Park,4,4,4,4,4,4
M4P,Davisville North,12,12,12,12,12,12
M4R,North Toronto West,29,29,29,29,29,29
M4S,Davisville,44,44,44,44,44,44
M4T,"Moore Park, Summerhill East",4,4,4,4,4,4
M4V,"Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West",54,54,54,54,54,54


In [24]:
print('There are {} uniques categories.'.format(len(Tor_venues['Venue Category'].unique())))

There are 251 uniques categories.


### Analyze Each Neighborhood

In [25]:
# one hot encoding
tor_onehot = pd.get_dummies(Tor_venues[['Venue Category']], prefix="", prefix_sep="")

# add postcode and Neighbourhood columns back to dataframe
tor_onehot['Postcode'] = Tor_venues['Postcode']
tor_onehot['Neighbourhood'] = Tor_venues['Neighbourhood']

# move postcode column to the first column
fixed_columns = [tor_onehot.columns[-2]] +[tor_onehot.columns[-1]]+ list(tor_onehot.columns[:-2])
tor_onehot = tor_onehot[fixed_columns]

#Checking dataframe size
print(tor_onehot.shape)
tor_onehot.head()

(2147, 253)


Unnamed: 0,Postcode,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Tram Station,Udon Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,M4E,The Beaches,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M4E,The Beaches,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M4E,The Beaches,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,M4E,The Beaches,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M4E,The Beaches,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [26]:
tor_grouped = tor_onehot.groupby(['Postcode', 'Neighbourhood']).mean().reset_index()
tor_grouped

Unnamed: 0,Postcode,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Tram Station,Udon Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,M4E,The Beaches,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0
1,M4K,"The Danforth West, Riverdale",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015152,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015152
2,M4L,"The Beaches West, India Bazaar",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M4M,Studio District,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.044118,...,0.0,0.0,0.0,0.0,0.0,0.0,0.029412,0.014706,0.0,0.014706
4,M4N,Lawrence Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,M4P,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,M4R,North Toronto West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034483
7,M4S,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727
8,M4T,"Moore Park, Summerhill East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,M4V,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018519,...,0.0,0.0,0.0,0.0,0.0,0.0,0.018519,0.0,0.0,0.018519


### Print each neighborhood along with the top 5 most common venues

In [27]:
num_top_venues = 5

for hood in tor_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = tor_grouped[tor_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[2:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----The Beaches----
         venue  freq
0       Bakery  0.11
1          Pub  0.11
2    Gastropub  0.05
3        Trail  0.05
4  Cheese Shop  0.05


----The Danforth West, Riverdale----
                venue  freq
0    Greek Restaurant  0.14
1         Coffee Shop  0.06
2                 Pub  0.06
3  Italian Restaurant  0.05
4                Café  0.05


----The Beaches West, India Bazaar----
            venue  freq
0  Sandwich Place  0.09
1             Gym  0.06
2            Park  0.06
3     Pizza Place  0.06
4            Café  0.06


----Studio District----
                venue  freq
0                Café  0.07
1                 Bar  0.06
2              Bakery  0.04
3         Coffee Shop  0.04
4  Italian Restaurant  0.04


----Lawrence Park----
                  venue  freq
0           Swim School  0.25
1                  Park  0.25
2  Gym / Fitness Center  0.25
3              Bus Line  0.25
4                Museum  0.00


----Davisville North----
               venue  freq
0     Sand

### Building a dataframe displaying the top 10 venues per postcode

In [28]:
# A function to sort the venues in descending order.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [29]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe

neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)

neighbourhoods_venues_sorted['Neighbourhood'] = tor_grouped['Neighbourhood']

for ind in np.arange(tor_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(tor_grouped.iloc[ind, 1:], num_top_venues)

neighbourhoods_venues_sorted['Postcode'] = tor_grouped['Postcode']

In [30]:
# move neighborhood column to the first column
fixed_columns = [neighbourhoods_venues_sorted.columns[-1]] + list(neighbourhoods_venues_sorted.columns[:-1])
neighbourhoods_venues_sorted = neighbourhoods_venues_sorted[fixed_columns]
neighbourhoods_venues_sorted.head()

Unnamed: 0,Postcode,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,The Beaches,Pub,Bakery,Health Food Store,Mexican Restaurant,Cheese Shop,Indian Restaurant,Ice Cream Shop,Ramen Restaurant,Breakfast Spot,French Restaurant
1,M4K,"The Danforth West, Riverdale",Greek Restaurant,Pub,Coffee Shop,Italian Restaurant,Café,Bookstore,Grocery Store,Ice Cream Shop,Restaurant,Fruit & Vegetable Store
2,M4L,"The Beaches West, India Bazaar",Sandwich Place,Café,Gym,Pizza Place,Park,Burrito Place,Liquor Store,Brewery,Restaurant,Farmers Market
3,M4M,Studio District,Café,Bar,American Restaurant,Diner,Bakery,Italian Restaurant,Coffee Shop,Sandwich Place,Brewery,Vietnamese Restaurant
4,M4N,Lawrence Park,Gym / Fitness Center,Park,Swim School,Bus Line,Discount Store,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant


### Clustering Postcode data


In [31]:
from sklearn.cluster import *

# set 6 of clusters
kclusters = 6


tor_grouped_clustering = tor_grouped.drop(['Postcode','Neighbourhood' ], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(tor_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:39] 


array([3, 3, 3, 3, 0, 3, 3, 3, 4, 5, 1, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       3, 2, 3, 3, 3, 5, 5, 5, 3, 3, 3, 3, 3, 3, 3, 5, 3])

In [32]:
# Adding cluster values to corresponding postcodes
neighbourhoods_venues_sorted.drop('Neighbourhood', axis=1, inplace= True)
neighbourhoods_venues_sorted.insert(0, "Cluster Labels", kmeans.labels_)
df_clustered = pd.merge(df_complete,neighbourhoods_venues_sorted, on="Postcode")
df_clustered.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,3,Pub,Bakery,Health Food Store,Mexican Restaurant,Cheese Shop,Indian Restaurant,Ice Cream Shop,Ramen Restaurant,Breakfast Spot,French Restaurant
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,3,Greek Restaurant,Pub,Coffee Shop,Italian Restaurant,Café,Bookstore,Grocery Store,Ice Cream Shop,Restaurant,Fruit & Vegetable Store
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,3,Sandwich Place,Café,Gym,Pizza Place,Park,Burrito Place,Liquor Store,Brewery,Restaurant,Farmers Market
3,M4M,East Toronto,Studio District,43.659526,-79.340923,3,Café,Bar,American Restaurant,Diner,Bakery,Italian Restaurant,Coffee Shop,Sandwich Place,Brewery,Vietnamese Restaurant
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,0,Gym / Fitness Center,Park,Swim School,Bus Line,Discount Store,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant


In [33]:
import matplotlib.cm as cm
import matplotlib.colors as colors
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_clustered['Latitude'], df_clustered['Longitude'], df_clustered['Neighbourhood'], df_clustered['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=6,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine Clusters

#### Cluster 1

In [34]:
df_clustered.loc[df_clustered['Cluster Labels'] == 0, df_clustered.columns[[1] + list(range(5, df_clustered.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Central Toronto,0,Gym / Fitness Center,Park,Swim School,Bus Line,Discount Store,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant


#### Cluster 2

In [35]:
df_clustered.loc[df_clustered['Cluster Labels'] == 1, df_clustered.columns[[1] + list(range(5, df_clustered.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,Downtown Toronto,1,Park,Playground,Trail,Donut Shop,Discount Store,Distribution Center,Dive Bar,Dog Run,Doner Restaurant,Dumpling Restaurant


#### Cluster 3

In [36]:
df_clustered.loc[df_clustered['Cluster Labels'] == 2, df_clustered.columns[[1] + list(range(5, df_clustered.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
23,Central Toronto,2,Sushi Restaurant,Park,Jewelry Store,Trail,Donut Shop,Distribution Center,Dive Bar,Dog Run,Doner Restaurant,Yoga Studio


#### Cluster 4

In [37]:
df_clustered.loc[df_clustered['Cluster Labels'] == 3, df_clustered.columns[[1] + list(range(5, df_clustered.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,East Toronto,3,Pub,Bakery,Health Food Store,Mexican Restaurant,Cheese Shop,Indian Restaurant,Ice Cream Shop,Ramen Restaurant,Breakfast Spot,French Restaurant
1,East Toronto,3,Greek Restaurant,Pub,Coffee Shop,Italian Restaurant,Café,Bookstore,Grocery Store,Ice Cream Shop,Restaurant,Fruit & Vegetable Store
2,East Toronto,3,Sandwich Place,Café,Gym,Pizza Place,Park,Burrito Place,Liquor Store,Brewery,Restaurant,Farmers Market
3,East Toronto,3,Café,Bar,American Restaurant,Diner,Bakery,Italian Restaurant,Coffee Shop,Sandwich Place,Brewery,Vietnamese Restaurant
5,Central Toronto,3,Pharmacy,Hotel,Brewery,Café,Sushi Restaurant,Sandwich Place,Food & Drink Shop,Breakfast Spot,Gym,Park
6,Central Toronto,3,Clothing Store,Sporting Goods Shop,Café,Coffee Shop,Yoga Studio,Gym / Fitness Center,Salon / Barbershop,Restaurant,Rental Car Location,Pizza Place
7,Central Toronto,3,Pizza Place,Sandwich Place,Café,Italian Restaurant,Dessert Shop,Gym,Indian Restaurant,Coffee Shop,Sushi Restaurant,Yoga Studio
11,Downtown Toronto,3,Park,Coffee Shop,Restaurant,Pizza Place,Gastropub,Italian Restaurant,Bakery,Japanese Restaurant,Pub,Café
22,Central Toronto,3,Playground,Music Venue,Garden,Business Service,Home Service,Spa,Dumpling Restaurant,Donut Shop,Doner Restaurant,Eastern European Restaurant
24,Central Toronto,3,Sandwich Place,Café,Park,Pub,Coffee Shop,Liquor Store,Historic Site,Modern European Restaurant,Jewish Restaurant,Indian Restaurant


#### Cluster 5

In [38]:
df_clustered.loc[df_clustered['Cluster Labels'] == 4, df_clustered.columns[[1] + list(range(5, df_clustered.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,Central Toronto,4,Park,Gym,Tennis Court,Dim Sum Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


#### Cluster 6

In [39]:
df_clustered.loc[df_clustered['Cluster Labels'] == 5, df_clustered.columns[[1] + list(range(5, df_clustered.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,Central Toronto,5,Coffee Shop,Italian Restaurant,Restaurant,Sushi Restaurant,Grocery Store,Pizza Place,Bagel Shop,Pub,Thai Restaurant,Gym
12,Downtown Toronto,5,Coffee Shop,Japanese Restaurant,Gay Bar,Sushi Restaurant,Restaurant,Burger Joint,Café,Juice Bar,Gastropub,Grocery Store
13,Downtown Toronto,5,Coffee Shop,Bakery,Pub,Park,Theater,Mexican Restaurant,Breakfast Spot,Café,Beer Store,Shoe Store
14,Downtown Toronto,5,Coffee Shop,Clothing Store,Restaurant,Bubble Tea Shop,Burger Joint,Japanese Restaurant,Bookstore,Pizza Place,Italian Restaurant,Sushi Restaurant
15,Downtown Toronto,5,Coffee Shop,Café,Restaurant,Hotel,Bakery,Breakfast Spot,Cosmetics Shop,Italian Restaurant,Seafood Restaurant,Gym
16,Downtown Toronto,5,Coffee Shop,Restaurant,Café,Seafood Restaurant,Hotel,Japanese Restaurant,Cocktail Bar,Creperie,Breakfast Spot,Sporting Goods Shop
17,Downtown Toronto,5,Coffee Shop,Japanese Restaurant,Italian Restaurant,Sushi Restaurant,Sandwich Place,Bubble Tea Shop,Burger Joint,Café,Bar,Tea Room
18,Downtown Toronto,5,Coffee Shop,Café,Sushi Restaurant,Thai Restaurant,Restaurant,Bar,Steakhouse,Cosmetics Shop,Gym,Theater
19,Downtown Toronto,5,Coffee Shop,Hotel,Café,Italian Restaurant,Bar,Park,Aquarium,Scenic Lookout,Brewery,Sandwich Place
20,Downtown Toronto,5,Coffee Shop,Hotel,Café,Restaurant,American Restaurant,Gastropub,Bar,Seafood Restaurant,Italian Restaurant,Gym
