# Applied Data Science Capstone - Toronto Clustering Assignment

# Question 1

### Start off with importing some libraries to help scrape the Wikipedia article containing the postal codes

In [1]:
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

### Assign the url to a variable, request it, and scrape the entity "table" from the webpage using Beautiful Soup

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

page = urllib.request.urlopen(url)

soup = BeautifulSoup(page, "lxml")

wiki_table = soup.find_all("table")

### Using a simple for loop to scrape each column from the table, assign them to lists and then build a Pandas DataFrame using those lists

In [3]:
#Create three empty lists for the three columns within the table:
A=[]
B=[]
C=[]

#Iterate through the table, looking precisely for the 'tr' and 'td' tags within the webpage:
for row in soup.find_all('tr'):
    cells=row.find_all('td')
    #Making sure what we find is precisely 3 columns long, then appending each to our empty lists above:
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))

#Creating our DataFrame using these lists which have now been populated:
df=pd.DataFrame(A,columns=['PostalCode'])
df['Borough']=B
df['Neighborhood']=C

#Display the head of the data to visually check our DataFrame has been populated:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Now we have the DataFrame populated, it is time to clean things up a little

In [4]:
#Not viewable in Jupyter Notebook, but the table contained '\n' after every entry in every row of the table
#(Denoting a 'new line' character). The below expression simply replaces these with nothing, essentially removing them:
df = df.replace(r'\n','', regex=True)

#Showing the shape of the DataFrame before any data is omitted:
print(df.shape)

#Displaying the head of the table once more to check the '\n' has been removed from the rows:
df.head()

(180, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### The rows which are not populated, or labelled 'Not assigned', are of no use to us. We can remove these to make our analysis easier

In [5]:
#The below simply removes any rows from the DataFrame which contain 'Not assigned' in the 'Borough' column:
df = df[df.Borough != 'Not assigned']

#Reset the index after the above operation:
df.reset_index(drop=True)

#Show the shape of the data once more, following omission of the rows:
print(df.shape)

#Visually inspect the head of the data once more:
df.head()

(103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


# Question 2

### We have a .csv file containing the latitude/longitude of the postal codes, so we can read this into a new DataFrame below

In [6]:
#Defining the URL which contains the CSV file:
lat_long_URL = "https://cocl.us/Geospatial_data"

#Simply reading the CSV file and assigning to the DataFrame df2, then displaying the top 5 rows:
df2 = pd.read_csv(lat_long_URL)
print(df2.shape)
df2.head()

(103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### We now want to merge the two Data Frames we have together, so all the data is contained in one place

In [7]:
#Defining a new DataFrame, toronto_df, and specifying the way in which we want to merge the two existing DataFrames:
toronto_df = df.merge(right=df2, how='left', left_on='PostalCode', right_on='Postal Code')
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M3A,North York,Parkwoods,M3A,43.753259,-79.329656
1,M4A,North York,Victoria Village,M4A,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",M5A,43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",M6A,43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",M7A,43.662301,-79.389494


In [8]:
#Removing the "Postal Code" column from the middle of the DataFrame following the merge:
toronto_df.drop(columns="Postal Code", inplace=True)
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


# Question 3

### We will now install the libraries we require to plot geographical data within this notebook

In [9]:
!conda install -c conda-forge geopy --yes
!conda install -c conda-forge folium=0.5.0 --yes

print("Libraries Installed")

Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries Installed


In [10]:
#Importing the necessary libraries for further analysis:
from geopy.geocoders import Nominatim
import folium
from sklearn.cluster import KMeans

In [11]:
#Toronto's Latitude and Longitude values:

latitude = 43.65323
longitude = -79.38318

#Creating our stock map of Toronto, before anything is populated:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

map_toronto

### Adding markers onto our map of Toronto in the centre of each Borough

In [12]:
for lat, lng, borough, neighborhood, pc in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighborhood'], toronto_df['PostalCode']):
    label = '{}, {}, {}'.format(neighborhood, borough, pc)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
    
map_toronto

### Lets look at one Borough in particular, North York

In [13]:
#Creating a DataFrame which contains only the Neighborhoods found in North York:
north_york_df = toronto_df[toronto_df['Borough'] == 'North York'].reset_index(drop=True)

print(north_york_df.shape)
north_york_df.head()

(24, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
3,M3B,North York,Don Mills,43.745906,-79.352188
4,M6B,North York,Glencairn,43.709577,-79.445073


### Visualising this on a map

In [14]:
#North York's latitude & longitude values:

ny_lat = 43.76154
ny_long = -79.41108

ny_map = folium.Map(location=[ny_lat, ny_long], zoom_start=11)

#Iterate through and add in markers, as before:
for lat, lng, borough, neighborhood, pc in zip(north_york_df['Latitude'], north_york_df['Longitude'], north_york_df['Borough'], north_york_df['Neighborhood'], north_york_df['PostalCode']):
    label = '{}, {}, {}'.format(neighborhood, borough, pc)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(ny_map)
    
ny_map

### We can now use the Foursquare API to start collecting data on venues within the North York area

In [15]:
CLIENT_ID = 'XE4XW2FWEU3LUSMLRB4PMUOS03N2PTLGEBGBSHGJDX4YCP0G'
CLIENT_SECRET = 'LPYJ4BPEHC12ZZXHKFAX5X5MSJKOW43RNXMY5Y2CWFKSPXGK'
VERSION = '20180605'

In [16]:
#Collecting latitude/longitude values for the North York area:

neighborhood_latitude = north_york_df.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = north_york_df.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = north_york_df.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Parkwoods are 43.7532586, -79.3296565.


In [17]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

### Executing a call to Foursquare for the venue data, then saving this to a JSON file

In [18]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 1000 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

results = requests.get(url).json()

In [19]:
#Function to collect the category type from the returned data:

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

### Transforming data to be worked with in a Pandas DataFrame

In [20]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

nearby_venues.head()

29 venues were returned by Foursquare.


Unnamed: 0,name,categories,lat,lng
0,Allwyn's Bakery,Caribbean Restaurant,43.75984,-79.324719
1,Brookbanks Park,Park,43.751976,-79.33214
2,Tim Hortons,Café,43.760668,-79.326368
3,A&W,Fast Food Restaurant,43.760643,-79.326865
4,Bruno's valu-mart,Grocery Store,43.746143,-79.32463


### This seems to work, now to apply this to all Neighborhoods found within the North York area

In [21]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [22]:
north_york_venues = getNearbyVenues(names=north_york_df['Neighborhood'],
                                   latitudes=north_york_df['Latitude'],
                                   longitudes=north_york_df['Longitude']
                                  )

Parkwoods
Victoria Village
Lawrence Manor, Lawrence Heights
Don Mills
Glencairn
Don Mills
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Fairview, Henry Farm, Oriole
Northwood Park, York University
Bayview Village
Downsview
York Mills, Silver Hills
Downsview
North Park, Maple Leaf Park, Upwood Park
Humber Summit
Willowdale, Newtonbrook
Downsview
Bedford Park, Lawrence Manor East
Humberlea, Emery
Willowdale, Willowdale East
Downsview
York Mills West
Willowdale, Willowdale West


In [23]:
# Inspecting the shape and structure of our new table:

print(north_york_venues.shape)
north_york_venues.head()

(244, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


In [24]:
# Checking how many venues were returned for each Neighborhood:

count_ny_venues = north_york_venues.groupby('Neighborhood').count()
count_ny_venues[["Venue"]]

Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
"Bathurst Manor, Wilson Heights, Downsview North",21
Bayview Village,4
"Bedford Park, Lawrence Manor East",23
Don Mills,27
Downsview,18
"Fairview, Henry Farm, Oriole",64
Glencairn,4
Hillcrest Village,5
Humber Summit,1
"Humberlea, Emery",2


In [25]:
print('There are {} uniques categories.'.format(len(north_york_venues['Venue Category'].unique())))

There are 106 uniques categories.


### Analysing each Neighborhood

In [26]:
# one hot encoding

north_york_onehot = pd.get_dummies(north_york_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe

north_york_onehot['Neighborhood'] = north_york_venues['Neighborhood'] 

# move neighborhood column to the first column

fixed_columns = [north_york_onehot.columns[-1]] + list(north_york_onehot.columns[:-1])
north_york_onehot = north_york_onehot[fixed_columns]

north_york_onehot.shape

(244, 107)

In [27]:
ny_grouped = north_york_onehot.groupby('Neighborhood').mean().reset_index()

ny_grouped.shape

(20, 107)

### A simple function to fetch the most common venues

In [28]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

### Creating a new DataFrame which displays the most common venues

In [29]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = ny_grouped['Neighborhood']

for ind in np.arange(ny_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(ny_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Wilson Heights, Downsview North",Bank,Coffee Shop,Pet Store,Bridal Shop,Diner,Restaurant,Pizza Place,Sandwich Place,Shopping Mall,Ice Cream Shop
1,Bayview Village,Chinese Restaurant,Café,Bank,Japanese Restaurant,Women's Store,Event Space,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store
2,"Bedford Park, Lawrence Manor East",Sandwich Place,Sushi Restaurant,Restaurant,Italian Restaurant,Coffee Shop,Indian Restaurant,Pharmacy,Pizza Place,Butcher,Liquor Store
3,Don Mills,Gym,Japanese Restaurant,Asian Restaurant,Coffee Shop,Restaurant,Beer Store,Gym / Fitness Center,Bike Shop,Italian Restaurant,Discount Store
4,Downsview,Grocery Store,Park,Gym / Fitness Center,Korean Restaurant,Discount Store,Shopping Mall,Snack Place,Food Truck,Baseball Field,Hotel


# Clustering our neighborhoods

In [30]:
# set number of clusters - an arbitrary number for this analysis, but the number of k should be previously investigated:
kclusters = 5

ny_grouped_clustering = ny_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(ny_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

north_york_merged = north_york_df

# merge north_york_df with neighborhoods_venues_sorted to add latitude/longitude for each neighborhood
north_york_merged = north_york_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

north_york_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,0,Park,Food & Drink Shop,Women's Store,Electronics Store,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,1,Hockey Arena,Portuguese Restaurant,Intersection,Financial or Legal Service,Coffee Shop,Women's Store,Dog Run,Construction & Landscaping,Convenience Store,Cosmetics Shop
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1,Clothing Store,Furniture / Home Store,Accessories Store,Boutique,Gift Shop,Event Space,Miscellaneous Shop,Coffee Shop,Vietnamese Restaurant,Arts & Crafts Store
3,M3B,North York,Don Mills,43.745906,-79.352188,1,Gym,Japanese Restaurant,Asian Restaurant,Coffee Shop,Restaurant,Beer Store,Gym / Fitness Center,Bike Shop,Italian Restaurant,Discount Store
4,M6B,North York,Glencairn,43.709577,-79.445073,1,Playground,Pub,Bakery,Japanese Restaurant,Women's Store,Electronics Store,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega


## Lets see our clusters on the map, colour coded to distinguish them from each other

##### Each marker can be clicked to reveal the name of the Borough and which cluster it belongs to

In [31]:
# create map
map_clusters = folium.Map(location=[ny_lat, ny_long], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(north_york_merged['Latitude'], north_york_merged['Longitude'], north_york_merged['Neighborhood'], north_york_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ', Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# Viewing the different clusters and the venues they contain can provide us an insight into why they were clustered in such a way

In [32]:
# Cluster 1:

north_york_merged.loc[north_york_merged['Cluster Labels'] == 0, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,0,Park,Food & Drink Shop,Women's Store,Electronics Store,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant
14,North York,0,Construction & Landscaping,Park,Bakery,Women's Store,Event Space,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant
22,North York,0,Park,Convenience Store,Women's Store,Coffee Shop,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant,Diner


In [33]:
# Cluster 2:

north_york_merged.loc[north_york_merged['Cluster Labels'] == 1, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,1,Hockey Arena,Portuguese Restaurant,Intersection,Financial or Legal Service,Coffee Shop,Women's Store,Dog Run,Construction & Landscaping,Convenience Store,Cosmetics Shop
2,North York,1,Clothing Store,Furniture / Home Store,Accessories Store,Boutique,Gift Shop,Event Space,Miscellaneous Shop,Coffee Shop,Vietnamese Restaurant,Arts & Crafts Store
3,North York,1,Gym,Japanese Restaurant,Asian Restaurant,Coffee Shop,Restaurant,Beer Store,Gym / Fitness Center,Bike Shop,Italian Restaurant,Discount Store
4,North York,1,Playground,Pub,Bakery,Japanese Restaurant,Women's Store,Electronics Store,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega
5,North York,1,Gym,Japanese Restaurant,Asian Restaurant,Coffee Shop,Restaurant,Beer Store,Gym / Fitness Center,Bike Shop,Italian Restaurant,Discount Store
6,North York,1,Golf Course,Pool,Athletics & Sports,Mediterranean Restaurant,Dog Run,Women's Store,Electronics Store,Construction & Landscaping,Convenience Store,Cosmetics Shop
7,North York,1,Bank,Coffee Shop,Pet Store,Bridal Shop,Diner,Restaurant,Pizza Place,Sandwich Place,Shopping Mall,Ice Cream Shop
8,North York,1,Clothing Store,Coffee Shop,Women's Store,Fast Food Restaurant,Japanese Restaurant,Shoe Store,Toy / Game Store,Tea Room,Bank,Restaurant
9,North York,1,Coffee Shop,Bar,Massage Studio,Metro Station,Caribbean Restaurant,Vietnamese Restaurant,Asian Restaurant,Athletics & Sports,Deli / Bodega,Department Store
10,North York,1,Chinese Restaurant,Café,Bank,Japanese Restaurant,Women's Store,Event Space,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store


In [34]:
# Cluster 3:

north_york_merged.loc[north_york_merged['Cluster Labels'] == 2, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
15,North York,2,Shopping Mall,Women's Store,Electronics Store,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant


In [35]:
# Cluster 4:

north_york_merged.loc[north_york_merged['Cluster Labels'] == 3, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,North York,3,Piano Bar,Women's Store,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant,Diner


In [36]:
# Cluster 5:

north_york_merged.loc[north_york_merged['Cluster Labels'] == 4, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,North York,4,Paper / Office Supplies Store,Baseball Field,Women's Store,Event Space,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant


### Observations:

##### Clusters 3, 4 & 5 all contain one Neighborhood. This possibly means that the algorithm was not able to cluster effectively and overfitted the data to achieve the five clusters. This may be solved by reducing the number of clusters following an analysis of the most appropriate number for K.

##### Cluster 1 consists of three areas which are geographically different, which is interesting. Through analysis of the table, it groups them based on their shops as opposed to meeting places (cafes and restaurants, for example). These places may be less populated than other parts of the Borough, containing more parks / open spaces as well as some shops.

##### Cluster 2 is by far the biggest of the clusters, containing 18 separate boroughs. These places appear to be grouped based off of being more highly populated - much more types of restaurants, banks, as well as different types of gyms and fitness centres.

##### As aforementioned, the final three clusters contain one Neighborhood each. Upon closer analysis of each, it appears likely that the algorithm would have clustered these together if 3 was the number chosen for K.