# Week 3 Peer Graded Assignment:
# Segmenting and Clustering Neighborhoods in Toronto
### Aaron Armour

We start off by importing all of the modules which we will use in this notebook.

In [1]:
import requests
import pandas as pd
import numpy as np
# We will use the BeautifulSoup module to help extract data from the html of a Wikipedia page
from bs4 import BeautifulSoup

# Uncomment the below line if geocoder is not installed yet
!pip install geocoder
import geocoder
from time import sleep

from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

# Uncomment the below line if folium is not installed yet
!pip install folium
import folium

print('Packages successfully installed and imported!')

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 8.0MB/s ta 0:00:011
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/fd/a0/ccb3094026649cda4acd55bf2c3822bb8c277eb11446d13d384e5be35257/folium-0.10.1-py2.py3-none-any.whl (91kB)
[K     |████████████████████████████████| 92kB 9.2MB/s eta 0:00:011
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/81/6d/31c83485189a2521a75b4130f1fee5364f772a0375f81afff619004e5237/branca-0.4.0-py3-none-any.whl
I

## Part 1 - scraping Wikipedia page to build a dataframe with postal code, borough and neighborhood

We make a request for the Wikipedia webpage, and then make an alteration to clean one of the data items so that it will get properly processed in a later step.

In [2]:
# URL of Wikipedia page with the table of data we will use
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

response = requests.get(url)
webdata = response.content

# Making replacements in the raw html to fix up the neighborhood data in the row for postal code M5V
webdata = webdata.replace(b'\n<pre>', b'')
webdata = webdata.replace(b'</pre>\n', b'')


We create a BeautifulSoup object and use the objects 'find' method to obtain the table we are interested in from amongst the raw html.

In [3]:
soup = BeautifulSoup(webdata)
table = soup.find('tbody')


With our BeautifulSoup object we can find the rows of the table and process these as described in the assignment instructions. We then display the first five items in our list of data.

In [4]:
# This function processes the data in a row.
# Returns: a tuple of data - (postal_code, borough, neighborhoods)
def process_row(row):
    items = [item.contents for item in row.find_all('td')]
    assert len(items) == 3  # Expect 3 items, some might just be a '\n'
    assert len(items[0]) == len(items[1]) == len(items[2]) == 1 # Each should just be one item
    
    return (items[0][0].rstrip(), items[1][0].rstrip(), ', '.join(items[2][0].rstrip().split(' / ')))

data = []
for i, row in enumerate(table.children):
    if i == 0:
        # Skip the first row which has the table headings
        continue
        
    if row.name == 'tr':  # Just process the rows of the table which have <tr> tags
        postalCode, borough, neighborhood = process_row(row)
        if borough != 'Not assigned':
            # Only processed rows which have a valid Borough assigned (i.e. all those which aren't "Not assigned")
            if neighborhood == 'Not assigned':
                # If a Borough has been assigned, but not a Neighborhood then the Neighborhood is the same as the Borough
                neighborhood = borough
                        
            data.append((postalCode, borough, neighborhood))

data[:5]

[('M3A', 'North York', 'Parkwoods'),
 ('M4A', 'North York', 'Victoria Village'),
 ('M5A', 'Downtown Toronto', 'Regent Park, Harbourfront'),
 ('M6A', 'North York', 'Lawrence Manor, Lawrence Heights'),
 ('M7A', 'Downtown Toronto', "Queen's Park, Ontario Provincial Government")]

Now we create a Pandas DataFrame, df, from the list of data we created above, and display the first five rows of df.

In [5]:
df = pd.DataFrame(data, columns = ['PostalCode', 'Borough', 'Neighborhood'])
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


We use the shape attribute to find out the number of rows and columns in our DataFrame df.

In [6]:
print('The DataFrame df has {} rows (and {} columns).'.format(*df.shape))

The DataFrame df has 103 rows (and 3 columns).


## Part 2 - obtaining geographic coordinates for the neighborhoods

From the example code given in the assignment instructions, we create a function to assist with obtaining the latitude and longitude values for a postal code in Toronto. (We also add arguments to avoid being stuck in the while loop if the geocoder.google calls fail to return a non-None value.)

In [7]:
# A function to assist with getting coordinates for the postal codes
def get_lat_long(postal_code, max_attempts=50, pause_time=0.05):
    # initialize your variable to None
    lat_lng_coords = None
    attempt = 0
    
    # loop until you get the coordinates
    while(lat_lng_coords is None and attempt < max_attempts):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
        sleep(pause_time)
        attempt += 1
        
    if lat_lng_coords is not None:
        return (lat_lng_coords[0], lat_lng_coords[1])
    else:
        return None


Let's test this function out on a particular postal code.

In [8]:
latlong = get_lat_long('M5G')

latlong is None

True

It seems that the geocoder approach is not working for us. So we must fall back to using the csv file with the geospatial data.

Rather than downloading a local copy, we can supply the URL for the geospatial data directly to Pandas 'read_csv' method to create the DataFrame, geo_df. Let's examine the first few rows of geo_df.

In [9]:
geo_df = pd.read_csv('https://cocl.us/Geospatial_data')
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now, we wish to merge the DataFrame geo_df with our DataFrame df created in the previous section. To do this, we need to rename the column "Postal Code" (note the space) in geo_df to "PostalCode" so that the column name matches that in the DataFrame df. We display the first few rows of df to see that these operations have had the desired effect of adding each postal code's latitude and longitude into this DataFrame.

In [10]:
geo_df.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
df = df.merge(geo_df, on='PostalCode')
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [11]:
print('The DataFrame df has {} rows (and {} columns).'.format(*df.shape))

The DataFrame df has 103 rows (and 5 columns).


## Part 3 - clustering and analysis of neighborhoods in Toronto

#### Please note: while this part of the assignment talks about clustering neighborhoods in Toronto, and performing the analysis on the level of *individual* neighborhoods would be closer to what we did in the lab, we will instead *group together the neighborhoods with a given postcode*. The csv file of geolocation data and the DataFrames we constructed in the previous two parts suggest that this is what is intended for this part of the assignment.

In this part of the assignment we will follow the same process as used in this week's lab.

Set up variables for Foursquare credentials, and the API version.

In [12]:
# TO DO: Remove these values before pushing to github!
CLIENT_ID = '<My_Foursquare_ID>' # your Foursquare ID
CLIENT_SECRET = '<My_Foursquare_client_secret>' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version


Define a function (as used in this week's lab) to simplify the process of creating a DataFrame with nearby venues from queries to Foursquares API.

In [13]:
LIMIT = 100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    print('Querying Foursquare for venues near to:')
    for name, lat, lng in zip(names, latitudes, longitudes):
        print('\t-{}'.format(name))
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

As allowed for in the assignment instructions, we will restrict our data set to just those Boroughs containing "Toronto" in their name, so as to reduce the number of calls to the Foursquare API. Let's see the first few rows of the filtered DataFrame.

In [14]:
filtered_df = df[df['Borough'].apply(lambda x: 'Toronto' in x)]

filtered_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031


Now we use the 'getNearbyVenues' function to create a DataFrame with information about the neighborhoods in filtered_df.

In [15]:
# TO DO: remove [:4] - this is just to limit the calls to the Foursquare API while I'm getting it to work!
toronto_venues = getNearbyVenues(names=filtered_df['Neighborhood'],
                                 latitudes=filtered_df['Latitude'],
                                 longitudes=filtered_df['Longitude'])


Querying Foursquare for venues near to:
	-Regent Park, Harbourfront
	-Queen's Park, Ontario Provincial Government
	-Garden District, Ryerson
	-St. James Town
	-The Beaches
	-Berczy Park
	-Central Bay Street
	-Christie
	-Richmond, Adelaide, King
	-Dufferin, Dovercourt Village
	-Harbourfront East, Union Station, Toronto Islands
	-Little Portugal, Trinity
	-The Danforth West, Riverdale
	-Toronto Dominion Centre, Design Exchange
	-Brockton, Parkdale Village, Exhibition Place
	-India Bazaar, The Beaches West
	-Commerce Court, Victoria Hotel
	-Studio District
	-Lawrence Park
	-Roselawn
	-Davisville North
	-Forest Hill North & West
	-High Park, The Junction South
	-North Toronto West
	-The Annex, North Midtown, Yorkville
	-Parkdale, Roncesvalles
	-Davisville
	-University of Toronto, Harbord
	-Runnymede, Swansea
	-Moore Park, Summerhill East
	-Kensington Market, Chinatown, Grange Park
	-Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
	-CN Tower, King and Spadina, Railway Land

Let's see how many venues were returned for each neighborhood.

In [16]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,55,55,55,55,55,55
"Brockton, Parkdale Village, Exhibition Place",23,23,23,23,23,23
Business reply mail Processing CentrE,17,17,17,17,17,17
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",15,15,15,15,15,15
Central Bay Street,62,62,62,62,62,62
Christie,18,18,18,18,18,18
Church and Wellesley,71,71,71,71,71,71
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,36,36,36,36,36,36
Davisville North,7,7,7,7,7,7


Now, we will construct a DataFrame containing the one hot encoding of each of the venue categories.

In [17]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,American Restaurant,Antique Shop,Aquarium,Art Gallery,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We group together the rows by the neighborhood, and calculate the average occurence of the given venue category (amongst the venues which were returned for this neighborhood).

In [18]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,American Restaurant,Antique Shop,Aquarium,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Business reply mail Processing CentrE,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.066667,0.066667,0.066667,0.133333,0.2,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.016129,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.0
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.028169,0.0,0.0,0.0,0.0,0.0,0.014085,0.0,0.0,...,0.014085,0.014085,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's find the top 10 most common venues in each of the neighborhoods.

In [19]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10
indicators = ['st', 'nd', 'rd'] + ['th'] * (num_top_venues - 3)

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Restaurant,Bakery,Beer Bar,Cocktail Bar,Seafood Restaurant,Farmers Market,Cheese Shop,Italian Restaurant,Café
1,"Brockton, Parkdale Village, Exhibition Place",Café,Nightclub,Coffee Shop,Breakfast Spot,Bakery,Convenience Store,Performing Arts Venue,Pet Store,Climbing Gym,Restaurant
2,Business reply mail Processing CentrE,Light Rail Station,Yoga Studio,Auto Workshop,Comic Shop,Park,Pizza Place,Restaurant,Burrito Place,Brewery,Farmers Market
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Boutique,Harbor / Marina,Boat or Ferry,Rental Car Location,Bar,Coffee Shop,Sculpture Garden,Airport
4,Central Bay Street,Coffee Shop,Italian Restaurant,Café,Sandwich Place,Bubble Tea Shop,Burger Joint,Japanese Restaurant,Salad Place,Ice Cream Shop,Fried Chicken Joint


We will now use the k-means algorithm to cluster the neighborhoods. Since the number of neighborhoods is similar to the number of neighborhoods in New York we analysed in the lab, let's also use try to group these neighborhoods into 5 clusters.

In [20]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

Let's create a new DataFrame which includes the cluster label assigned by running the k-means algorithm, and the top 10 most common venues for each neighborhood.

In [21]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = filtered_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Bakery,Pub,Park,Breakfast Spot,Restaurant,Café,Theater,Mexican Restaurant,Shoe Store
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Coffee Shop,Diner,Sushi Restaurant,Gym,Park,Mexican Restaurant,Juice Bar,Italian Restaurant,Hobby Shop,Fried Chicken Joint
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Clothing Store,Coffee Shop,Café,Restaurant,Bubble Tea Shop,Japanese Restaurant,Middle Eastern Restaurant,Cosmetics Shop,Tea Room,Ramen Restaurant
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Coffee Shop,Café,Hotel,Gastropub,American Restaurant,Cocktail Bar,Italian Restaurant,Seafood Restaurant,Cosmetics Shop,Department Store
19,M4E,East Toronto,The Beaches,43.676357,-79.293031,4,Health Food Store,Trail,Pub,Women's Store,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


We will now visualise these clusters of neighborhoods on a map of Toronto.

In [22]:
# Latitude and Longitude for Toronto, found with a Google search
latitude = 43.6532
longitude = -79.3832

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Now we can examine the clusters and see if we can determine the discriminating venue categories which make up a cluster.

##### Cluster 1:

In [23]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1, 2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Downtown Toronto,"Regent Park, Harbourfront",0,Coffee Shop,Bakery,Pub,Park,Breakfast Spot,Restaurant,Café,Theater,Mexican Restaurant,Shoe Store
4,Downtown Toronto,"Queen's Park, Ontario Provincial Government",0,Coffee Shop,Diner,Sushi Restaurant,Gym,Park,Mexican Restaurant,Juice Bar,Italian Restaurant,Hobby Shop,Fried Chicken Joint
9,Downtown Toronto,"Garden District, Ryerson",0,Clothing Store,Coffee Shop,Café,Restaurant,Bubble Tea Shop,Japanese Restaurant,Middle Eastern Restaurant,Cosmetics Shop,Tea Room,Ramen Restaurant
15,Downtown Toronto,St. James Town,0,Coffee Shop,Café,Hotel,Gastropub,American Restaurant,Cocktail Bar,Italian Restaurant,Seafood Restaurant,Cosmetics Shop,Department Store
20,Downtown Toronto,Berczy Park,0,Coffee Shop,Restaurant,Bakery,Beer Bar,Cocktail Bar,Seafood Restaurant,Farmers Market,Cheese Shop,Italian Restaurant,Café
24,Downtown Toronto,Central Bay Street,0,Coffee Shop,Italian Restaurant,Café,Sandwich Place,Bubble Tea Shop,Burger Joint,Japanese Restaurant,Salad Place,Ice Cream Shop,Fried Chicken Joint
25,Downtown Toronto,Christie,0,Grocery Store,Café,Park,Gas Station,Coffee Shop,Diner,Baby Store,Restaurant,Italian Restaurant,Athletics & Sports
30,Downtown Toronto,"Richmond, Adelaide, King",0,Coffee Shop,Café,Restaurant,Gym,Clothing Store,American Restaurant,Hotel,Deli / Bodega,Thai Restaurant,Salad Place
31,West Toronto,"Dufferin, Dovercourt Village",0,Pharmacy,Bakery,Supermarket,Brazilian Restaurant,Café,Recording Studio,Bar,Bank,Middle Eastern Restaurant,Brewery
36,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",0,Coffee Shop,Aquarium,Restaurant,Café,Hotel,Italian Restaurant,Brewery,Scenic Lookout,Sporting Goods Shop,Fried Chicken Joint


##### Cluster 2:

In [24]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1, 2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
83,Central Toronto,"Moore Park, Summerhill East",1,Park,Playground,Summer Camp,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
91,Downtown Toronto,Rosedale,1,Park,Trail,Playground,Department Store,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


##### Cluster 3:

In [25]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1, 2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
62,Central Toronto,Roselawn,2,Pool,Garden,Women's Store,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


##### Cluster 4:

In [26]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1, 2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
68,Central Toronto,Forest Hill North & West,3,Jewelry Store,Trail,Bus Line,Sushi Restaurant,Women's Store,Discount Store,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store


##### Cluster 5:

In [27]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1, 2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,East Toronto,The Beaches,4,Health Food Store,Trail,Pub,Women's Store,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


The results of the clustering are a little disappointing, most of the neighborhoods have been assigned to cluster 1, with clusters 2-5 being made up of only one or two neighborhoods each. The neighborhoods in cluster 1 do appear to have lots of restaurants and cafes amongst their most common venues. However, there are also some differences too.

If we look back at the counts in the toronto_venues DataFrame grouped by neighborhood, we see that the neighborhoods which are in clusters 2-5 each have at most 4 different venues. So for each of these neighborhoods, the venues which are present will be a very large proportion of the venues for that neighborhood. I believe this is what has caused the clustering results we have obtained; the majority of the neighborhoods have a larger number of venues, so each type of venue is a more moderate proportion of the total venues. In this sense the neighborhoods in Cluster 1 do all belong together.

However, it might be interesting to find a coarser clustering of the neighborhoods in Cluster 1. One might expect that we could obtain such a clustering if we decided to reduce our dataset by dropping all neighborhoods which have had fewer than, say 15 or 20, venues present.