# Capstone Project

This notebook will be mainly used for Coursera capstone project of IBM Data Science Certificate. 

Importing required modules

In [155]:
import pandas as pd
pd.set_option('max_colwidth', None)
pd.set_option('max_rows', None)

import numpy as np
from bs4 import BeautifulSoup
from urllib.request import urlopen
from geopy import Nominatim
import folium
import requests


# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

## Part 1: Scraping and Parsing Data from Wikipedia into a Pandas DF

Initializing a data frame with three columns: PostalCode, Borough, and Neighborhood

In [156]:
df = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighborhood'])
#df = df.append({'PostalCode':1, 'Borough':2, 'Neighborhood':3}, ignore_index=True)
#df.loc[0] = [1,2,3]
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood


Passing the web page into an instance of Beautiful Soup for parsing.

In [157]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_page = urlopen(url)
soup = BeautifulSoup(html_page, 'html.parser')

Parsing the table into the data frame

In [158]:
index = 0
for row in soup.table.find_all('tr')[1:]:
    # a list of three elements: one element for each column: ['PostalCode', 'Borough', 'Neighborhood']
    columns = [c.get_text().strip() for c in row.find_all('td')]
    if columns[1] == 'Not assigned' or len(columns)>3:
        continue # Ignore cells with a borough that is Not assigned
    elif columns[2] == 'Not assigned':
        # If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
        df.loc[index] = [columns[0], columns[1], columns[1]]
        
    else:
        df.loc[index] = columns
    
    index += 1
        
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Let's check the shape of this data frame. As shown in the cell output below, the data frame has 103 rows and 3 columns.

In [159]:
df.shape

(103, 3)

Also, we can see below that the number of unique postal codes is equal to the number of rows.

In [160]:
len(df.PostalCode.unique())

103

Let's make sure that both columns (Borough, and Neighborhood) do not have the value 'Not assigned'. 

In [161]:
'Not assigned' in df.Borough

False

In [162]:
'Not assigned' in df.Neighborhood

False

## Part 2: Adding Coordinates (Latitudes and Longitudes) of Each Neighborhood

A CSV file with geospatial data is used to add the coordinates. Let's first import this file into a data frame.

In [163]:
geo_df = pd.read_csv('Geospatial_Coordinates.csv')
geo_df.rename({'Postal Code': 'PostalCode'}, axis=1, inplace=True)
geo_df.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Let's make sure that the CSV file has all the geospatial data needed for our data frame. 

In [164]:
set(geo_df.PostalCode) == set(df.PostalCode)

True

Now, we can merge both data frames using Pandas merge method.

In [165]:
df_canada = df.merge(geo_df, on='PostalCode')
df_canada.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


## Part 3: Clustering

### 1. Extracting Toronto Data Frame from df_canada

__Note__ that the coordinates in df_canada belongs to the postal codes not the neighborhoods. Therefore, we first need to extract the neighborhoods in Toronto, find their coordinates, then explore them using Foursquare API. 

We will be working on clustering the neighborhoods of Toronto boroughs: ['Downtown Toronto', 'Central Toronto', 'East Toronto', 'West Toronto']

In [166]:
toronto_boroughs =  {col for col in df_canada.Borough if "Toronto" in col}
toronto = df_canada[df_canada['Borough'].isin(toronto_boroughs)][['Borough', 'Neighborhood']].reset_index(drop=True)
toronto.head(10)

Unnamed: 0,Borough,Neighborhood
0,Downtown Toronto,"Regent Park, Harbourfront"
1,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
2,Downtown Toronto,"Garden District, Ryerson"
3,Downtown Toronto,St. James Town
4,East Toronto,The Beaches
5,Downtown Toronto,Berczy Park
6,Downtown Toronto,Central Bay Street
7,Downtown Toronto,Christie
8,Downtown Toronto,"Richmond, Adelaide, King"
9,West Toronto,"Dufferin, Dovercourt Village"


We can see that th Neighborhood column might have more than one neighborhood separated by a comma. We need to have only one
neighborhood name per cell. Let's first create a __list__ of all neighborhoods in Toronto.

In [167]:
neighborhoods = [[col_0, Neighborhood.strip()] for col_0, col_1 in toronto[['Borough', 'Neighborhood']].values
                 for Neighborhood in col_1.split(',')]
neighborhoods[:10]

[['Downtown Toronto', 'Regent Park'],
 ['Downtown Toronto', 'Harbourfront'],
 ['Downtown Toronto', "Queen's Park"],
 ['Downtown Toronto', 'Ontario Provincial Government'],
 ['Downtown Toronto', 'Garden District'],
 ['Downtown Toronto', 'Ryerson'],
 ['Downtown Toronto', 'St. James Town'],
 ['East Toronto', 'The Beaches'],
 ['Downtown Toronto', 'Berczy Park'],
 ['Downtown Toronto', 'Central Bay Street']]

In [168]:
print(f"There are a total of {len(neighborhoods)} neighborhoods in Downtown Toronto")

There are a total of 78 neighborhoods in Downtown Toronto


### 2. Adding Latitudes and Longitudes of Each Neighborhood

Finding latitude and longitude of Downtown Toronto:

In [169]:
address = "Downtown Toronto"

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
toronto_latitude = location.latitude
toronto_longitude = location.longitude
print("Latitude: ", toronto_latitude)
print("Longitude: ", toronto_longitude)

Latitude:  43.6541737
Longitude:  -79.38081164513409


Building Downtown Toronto Neighborhoods data frame:

In [170]:
toronto = pd.DataFrame(columns=['Borough','Neighborhood', 'Latitude', 'Longitude'])
toronto

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


Now let's add the geospatial data for each neighborhood to the above data frame.

In [171]:
index = 0
for borough, neighborhood in neighborhoods:
    location = geolocator.geocode("Toronto, " + neighborhood)
    try: 
        toronto.loc[index] = [borough, neighborhood, location.latitude, location.longitude]
    except:
        toronto.loc[index] = [borough, neighborhood, np.nan, np.nan]
    index += 1

In [172]:
toronto.head(10)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Downtown Toronto,Regent Park,43.660706,-79.360457
1,Downtown Toronto,Harbourfront,43.64008,-79.38015
2,Downtown Toronto,Queen's Park,43.659659,-79.39034
3,Downtown Toronto,Ontario Provincial Government,,
4,Downtown Toronto,Garden District,43.6565,-79.377114
5,Downtown Toronto,Ryerson,43.658469,-79.378993
6,Downtown Toronto,St. James Town,43.669403,-79.372704
7,East Toronto,The Beaches,43.671024,-79.296712
8,Downtown Toronto,Berczy Park,43.647984,-79.375396
9,Downtown Toronto,Central Bay Street,43.659756,-79.385393


Let's drop neighborhoods, which we couldn't find the geospatial data. They are only five:

In [173]:
toronto[toronto.Latitude.isna()]

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
3,Downtown Toronto,Ontario Provincial Government,,
32,East Toronto,Studio District,,
70,Downtown Toronto,Stn A PO Boxes,,
76,East Toronto,Business reply mail Processing Centre,,
77,East Toronto,South Central Letter Processing Plant Toronto,,


Some of the names above do not appear to be names of neiborhoods, so let's drop them.

In [174]:
toronto.dropna(inplace=True)
toronto.reset_index(drop=True, inplace=True)
toronto.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Downtown Toronto,Regent Park,43.660706,-79.360457
1,Downtown Toronto,Harbourfront,43.64008,-79.38015
2,Downtown Toronto,Queen's Park,43.659659,-79.39034
3,Downtown Toronto,Garden District,43.6565,-79.377114
4,Downtown Toronto,Ryerson,43.658469,-79.378993


Let's see how many unique neighborhoods are there.

In [175]:
len(toronto.Neighborhood.unique())

71

In [176]:
toronto.shape

(73, 4)

There are 71 unique neighborhoods, which means that we have two duplicates (73-71). Let's check what are the names of these neighborhoods. 

In [177]:
toronto.Neighborhood.value_counts()[:5]

St. James Town             2
Lawrence Park              2
Toronto Dominion Centre    1
South Hill                 1
Harbourfront East          1
Name: Neighborhood, dtype: int64

In [178]:
toronto[toronto.Neighborhood.isin(['St. James Town', 'Lawrence Park'])]

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
5,Downtown Toronto,St. James Town,43.669403,-79.372704
31,Central Toronto,Lawrence Park,43.729199,-79.403252
39,Central Toronto,Lawrence Park,43.729199,-79.403252
68,Downtown Toronto,St. James Town,43.669403,-79.372704


Now, let's drop the duplicates usind Pandas drop_duplicates method.

In [179]:
toronto.drop_duplicates(inplace=True)
toronto.shape

(71, 4)

Now, we have a total of 71 neighborhoods in Toronto four boroughs. Let's show them on a Folium map.

In [270]:
# create map of Downtown Toronto using latitude and longitude values
map_manhattan = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood, borough in zip(toronto['Latitude'], toronto['Longitude'], toronto['Neighborhood'], toronto['Borough']):
    label = folium.Popup(borough+', '+neighborhood, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

### 3. Explore Neighborhoods in Toronto

Importing the Foursquare credentials from a local file.

In [181]:
with open('cred.csv') as file:
    cred = file.read().split(',')

CLIENT_ID = cred[0]
CLIENT_SECRET = cred[1]
ACCESS_TOKEN = cred[2]
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

The following function was created in one of the course labs for exploring New York neighborhoods. We can use it to explore Toronto neighborhoods.

In [182]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Creating Toronto Neighborhoods venues data frame:

In [183]:
toronto_venues = getNearbyVenues(names=toronto['Neighborhood'],
                                   latitudes=toronto['Latitude'],
                                   longitudes=toronto['Longitude']
                                  )

In [184]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Regent Park,43.660706,-79.360457,Regent Park Aquatic Centre,43.6606,-79.361392,Pool
1,Regent Park,43.660706,-79.360457,Sumach Espresso,43.658135,-79.359515,Coffee Shop
2,Regent Park,43.660706,-79.360457,Daniels Spectrum,43.660137,-79.361808,Performing Arts Venue
3,Regent Park,43.660706,-79.360457,Thai To Go,43.663418,-79.36071,Thai Restaurant
4,Regent Park,43.660706,-79.360457,Paintbox Bistro,43.66005,-79.362855,Restaurant


In [185]:
toronto_venues.shape

(3556, 7)

Let's check how many venues were returned for each neighborhood

In [186]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelaide,100,100,100,100,100,100
Bathurst Quay,25,25,25,25,25,25
Berczy Park,100,100,100,100,100,100
Brockton,19,19,19,19,19,19
CN Tower,61,61,61,61,61,61
Cabbagetown,50,50,50,50,50,50
Central Bay Street,63,63,63,63,63,63
Chinatown,70,70,70,70,70,70
Christie,57,57,57,57,57,57
Church and Wellesley,75,75,75,75,75,75


#### Let's find out how many unique categories can be curated from all the returned venues

In [187]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 294 uniques categories.


### 4. Analyze Each Neighborhood

First, we need to encode the catogory of the venues.

In [188]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = ['Neighborhood'] + [c for c in toronto_onehot.columns if c !='Neighborhood']
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Service,Airport Terminal,American Restaurant,Animal Shelter,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Regent Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Regent Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Regent Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Regent Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Regent Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [189]:
toronto_onehot.shape

(3556, 294)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category


In [190]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Service,Airport Terminal,American Restaurant,Animal Shelter,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Adelaide,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0
1,Bathurst Quay,0.0,0.0,0.0,0.04,0.04,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,...,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01
3,Brockton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.105263,0.0,0.0,0.0,0.0,0.0,0.0
4,CN Tower,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.016393,0.0,0.0,0.016393
5,Cabbagetown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.015873,0.0,0.0,0.0,0.015873
7,Chinatown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.042857,0.014286,0.0,0.028571,0.0,0.014286,0.0,0.0,0.0,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,...,0.0,0.0,0.017544,0.017544,0.0,0.017544,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.0,0.013333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.013333,0.0,0.0,0.0,0.0,0.0,0.026667


Let's confirm the new size of the data frame, which should have as many rows as the number of the neighborhoods (71).

In [191]:
toronto_grouped.shape

(71, 294)

Now, Let's write a function to sort the venues in descending order.

In [192]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [193]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Adelaide,Coffee Shop,Café,Gym,Restaurant,Hotel,Italian Restaurant,Japanese Restaurant,Clothing Store,Asian Restaurant,Gastropub
1,Bathurst Quay,Coffee Shop,Café,Park,Harbor / Marina,Bank,Gym,Sushi Restaurant,Garden,Ramen Restaurant,Japanese Restaurant
2,Berczy Park,Coffee Shop,Restaurant,Hotel,Italian Restaurant,Café,Japanese Restaurant,Gastropub,Gym,Seafood Restaurant,Bakery
3,Brockton,Bar,Vietnamese Restaurant,Park,Bakery,Coffee Shop,Sake Bar,Café,Gastropub,French Restaurant,Portuguese Restaurant
4,CN Tower,Hotel,Coffee Shop,Pizza Place,Baseball Stadium,Italian Restaurant,Concert Hall,Gym,Aquarium,Restaurant,Scenic Lookout


### 5. Clustering the Neighborhoods

Run k-means to cluster the neighborhood into 7 clusters. 

In [251]:
# set number of clusters
k = 7

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([6, 6, 6, 4, 6, 4, 6, 4, 4, 6])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [252]:
# add clustering labels
#neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
neighborhoods_venues_sorted['Cluster Labels'] = kmeans.labels_

toronto_merged = toronto.copy()
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

In [253]:
toronto_merged.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
0,Downtown Toronto,Regent Park,43.660706,-79.360457,Coffee Shop,Thai Restaurant,Pet Store,Grocery Store,Sushi Restaurant,Beer Store,Fast Food Restaurant,Restaurant,Auto Dealership,Pub,6
1,Downtown Toronto,Harbourfront,43.64008,-79.38015,Coffee Shop,Café,Restaurant,Hotel,Italian Restaurant,Pizza Place,Sushi Restaurant,Steakhouse,Sports Bar,Sporting Goods Shop,6
2,Downtown Toronto,Queen's Park,43.659659,-79.39034,Coffee Shop,Café,Sandwich Place,Italian Restaurant,Bubble Tea Shop,Thai Restaurant,Restaurant,French Restaurant,Japanese Restaurant,Vegetarian / Vegan Restaurant,6
3,Downtown Toronto,Garden District,43.6565,-79.377114,Clothing Store,Coffee Shop,Restaurant,Lingerie Store,Japanese Restaurant,Theater,Movie Theater,Electronics Store,Bookstore,Fast Food Restaurant,4
4,Downtown Toronto,Ryerson,43.658469,-79.378993,Coffee Shop,Clothing Store,Café,Japanese Restaurant,Diner,Italian Restaurant,Spa,Hotel,Bubble Tea Shop,Sandwich Place,6


Finally, let's visualize the resulting clusters

In [271]:
# create map
map_clusters = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=11)

# set color scheme for the clusters: K=7
#x = np.arange(k)
#colors_array = cm.gist_rainbow(np.linspace(0, 1, 7))
#rainbow = [colors.rgb2hex(i) for i in colors_array]
color_map = {0:'yellow', 1:'cyan', 2:'green', 3:'orange', 4:'blue', 5: 'purple', 6:'red'}

# add markers to the map
markers_colors = []
for lat, lon, borough, neighborhood, cluster in toronto_merged[['Latitude', 'Longitude', 'Borough', 'Neighborhood', 'Cluster Labels']].values:
    label = folium.Popup(f'{borough}, {neighborhood}, Cluster: {cluster}', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=6,
        popup=label,
        color=None,
        fill=True,
        fill_color=color_map[cluster],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### 6. Clusters Analysis

Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster. 

#### Cluster 0

In [260]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(4, toronto_merged.shape[1]-1))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
33,Davisville North,Italian Restaurant,Sushi Restaurant,Coffee Shop,Indian Restaurant,Pub,Park,Convenience Store,Deli / Bodega,Irish Pub,Mexican Restaurant
36,High Park,Convenience Store,Pet Store,Mexican Restaurant,Gym,Pub,Pizza Place,Pool,Creperie,Fish Market,Fish & Chips Shop
45,Davisville,Italian Restaurant,Sushi Restaurant,Coffee Shop,Indian Restaurant,Pub,Park,Convenience Store,Deli / Bodega,Irish Pub,Mexican Restaurant
50,Moore Park,Playground,Trail,Gym,Dog Run,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Escape Room
56,Rathnelly,Mexican Restaurant,Italian Restaurant,Park,French Restaurant,Café,Shoe Repair,BBQ Joint,Coffee Shop,Liquor Store,Pub


#### Cluster 1

In [261]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(4, toronto_merged.shape[1]-1))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
66,Island airport,Airport,Airport Terminal,Yoga Studio,Farmers Market,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Event Space


#### Cluster 2

In [262]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(4, toronto_merged.shape[1]-1))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
17,Toronto Islands,Café,Harbor / Marina,Park,Music Venue,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Yoga Studio
25,Parkdale Village,Park,Gym / Fitness Center,Light Rail Station,Lake,Building,Bus Stop,Beach,American Restaurant,Trail,Gas Station
49,Swansea,Park,Pilates Studio,Dance Studio,Yoga Studio,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Event Space


#### Cluster 3

In [264]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(4, toronto_merged.shape[1]-1))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
58,Forest Hill SE,Playground,Park,Bank,Flea Market,Fish Market,Fish & Chips Shop,Filipino Restaurant,Fast Food Restaurant,Farmers Market,Donut Shop
67,Rosedale,Playground,Bike Trail,Park,Falafel Restaurant,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Event Space


#### Cluster 4

In [266]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(4, toronto_merged.shape[1]-1))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Garden District,Clothing Store,Coffee Shop,Restaurant,Lingerie Store,Japanese Restaurant,Theater,Movie Theater,Electronics Store,Bookstore,Fast Food Restaurant
6,The Beaches,Beach,Bar,Park,Breakfast Spot,Thai Restaurant,Japanese Restaurant,Liquor Store,BBQ Joint,Bakery,Tea Room
9,Christie,Korean Restaurant,Coffee Shop,Grocery Store,Ice Cream Shop,Indian Restaurant,Cocktail Bar,Café,Sandwich Place,Dessert Shop,Mexican Restaurant
10,Richmond,Coffee Shop,Cosmetics Shop,Arts & Crafts Store,Café,Hotel,Fast Food Restaurant,Sandwich Place,Beer Bar,Speakeasy,Bistro
13,Dufferin,Bar,Bakery,Coffee Shop,Café,Vietnamese Restaurant,Sandwich Place,Mexican Restaurant,Beer Store,Cocktail Bar,Restaurant
14,Dovercourt Village,Café,Pizza Place,Coffee Shop,Restaurant,Park,Bar,Brazilian Restaurant,Fast Food Restaurant,Farmers Market,Filipino Restaurant
18,Little Portugal,Bar,Café,Coffee Shop,Korean Restaurant,Restaurant,Cocktail Bar,Bakery,Park,Athletics & Sports,Wine Bar
20,The Danforth West,Coffee Shop,Pharmacy,Bus Line,Grocery Store,Pizza Place,Construction & Landscaping,Bakery,Metro Station,Café,Mexican Restaurant
21,Riverdale,Vietnamese Restaurant,Bakery,Chinese Restaurant,Fast Food Restaurant,Bar,Grocery Store,Light Rail Station,Asian Restaurant,Gym / Fitness Center,Baseball Field
24,Brockton,Bar,Vietnamese Restaurant,Park,Bakery,Coffee Shop,Sake Bar,Café,Gastropub,French Restaurant,Portuguese Restaurant


#### Cluster 5

In [267]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 5, toronto_merged.columns[[1] + list(range(4, toronto_merged.shape[1]-1))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
35,Forest Hill Road Park,Trail,Skating Rink,Park,Gym / Fitness Center,Ethiopian Restaurant,Dumpling Restaurant,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Escape Room


#### Cluster 6

In [269]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 6, toronto_merged.columns[[1] + list(range(4, toronto_merged.shape[1]-1))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Regent Park,Coffee Shop,Thai Restaurant,Pet Store,Grocery Store,Sushi Restaurant,Beer Store,Fast Food Restaurant,Restaurant,Auto Dealership,Pub
1,Harbourfront,Coffee Shop,Café,Restaurant,Hotel,Italian Restaurant,Pizza Place,Sushi Restaurant,Steakhouse,Sports Bar,Sporting Goods Shop
2,Queen's Park,Coffee Shop,Café,Sandwich Place,Italian Restaurant,Bubble Tea Shop,Thai Restaurant,Restaurant,French Restaurant,Japanese Restaurant,Vegetarian / Vegan Restaurant
4,Ryerson,Coffee Shop,Clothing Store,Café,Japanese Restaurant,Diner,Italian Restaurant,Spa,Hotel,Bubble Tea Shop,Sandwich Place
5,St. James Town,Coffee Shop,Café,Pizza Place,Grocery Store,Diner,Library,Bistro,Restaurant,Sandwich Place,Bar
7,Berczy Park,Coffee Shop,Restaurant,Hotel,Italian Restaurant,Café,Japanese Restaurant,Gastropub,Gym,Seafood Restaurant,Bakery
8,Central Bay Street,Coffee Shop,Café,Sandwich Place,Italian Restaurant,Bubble Tea Shop,Bookstore,Miscellaneous Shop,Japanese Restaurant,Sushi Restaurant,Portuguese Restaurant
11,Adelaide,Coffee Shop,Café,Gym,Restaurant,Hotel,Italian Restaurant,Japanese Restaurant,Clothing Store,Asian Restaurant,Gastropub
12,King,Coffee Shop,Restaurant,Hotel,Gastropub,Gym,Café,Seafood Restaurant,Japanese Restaurant,Italian Restaurant,American Restaurant
15,Harbourfront East,Coffee Shop,Café,Restaurant,Hotel,Italian Restaurant,Pizza Place,Sushi Restaurant,Steakhouse,Sports Bar,Sporting Goods Shop
