<h1 style="text-align:center">Applied Data Science Capstone Project</h1>

### Module 3 Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

* The aim of this project is to scrape the Wikipedia page for the postal codes, boroughs and neighborhoods of Toronto, Canada. 
* Data is then cleaned and processed for the clustering. 
* The clustering is executed with K Means and the clusters are plotted using the Folium Library.

All the 3 tasks of web scraping, cleaning and clustering are implemented in the same notebook for the ease of evaluation.

### Part 1 instructions & requirements 
1. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
2. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
3. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11  in the above table. If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.
5. Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
6. In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [34]:
# installing & importing the required libraries

!pip install beautifulsoup4

import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner

from bs4 import BeautifulSoup

print('Libraries imported.')

Libraries imported.


In [None]:
# scraping the wikipedia page 

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
toronto_table = pd.read_html(url,header=0,flavor='html5lib')[0]
print(toronto_table.head())

1. Define dataframe 

In [49]:
column_names = ['PostalCode','Borough', 'Neighborhood']
neighborhoods = pd.DataFrame(columns=column_names)

In [50]:
clmn = toronto_table.columns.tolist()
# Iterating the list of column names
for i in clmn:
    PostalCode = i[:3] # parse the zipcode prefix
    length=len(i)
    Rtlen = length - 3 # string length after zipcode prefix  
    BandN = str(i[-Rtlen:])
    l1 = str(i[-Rtlen:]).split('(', 1)  #split the borough from neighborhood
    Borough = l1[0]
    if len(l1)>1 :
        length=len(l1[1])-1
        l2=str(l1[1]).split(')',1) #remove final parenthisis from neighborhood string
        Neighborhood = l2[0]
        #print( PostalCode + " " + Borough + " " + Neighborhood)     
    else:
        Neighborhood=Borough
    df2 = (pd.DataFrame.from_records([{'PostalCode':PostalCode,'Borough':Borough,'Neighborhood':Neighborhood}]))
    #print(df2)
    neighborhoods = pd.concat([neighborhoods,df2])
print(neighborhoods.head())

  PostalCode           Borough                Neighborhood
0        M1A      Not assigned                Not assigned
0        M2A      Not assigned                Not assigned
0        M3A        North York                   Parkwoods
0        M4A        North York            Victoria Village
0        M5A  Downtown Toronto  Regent Park / Harbourfront


In [51]:
# Convert each column of dataframe to it's own row and parse it into 3 columns
colcount=0
rowcount = 0
#for index, row in Toronto_table.iterrows(): 
for rowcount in range(len(toronto_table)):
    for colcount in range(len(toronto_table.columns)):
        df1=toronto_table.iloc[rowcount:rowcount+1,colcount:colcount+1] 
        i=str(df1.iloc[0,0])
        #print(i)
        PostalCode = i[:3] # parse the zipcode prefix
        #print(PostalCode)
        length=len(i)
        Rtlen = length - 3 # string length after zipcode prefix  
        BandN = str(i[-Rtlen:])
        l1 = str(i[-Rtlen:]).split('(', 1)  #split the borough from neighborhood
        Borough = l1[0]
        if len(l1)>1 :
            length=len(l1[1])-1
            l2=str(l1[1]).split(')',1) #remove final parenthisis from neighborhood strig
            Neighborhood = l2[0]
            #print( PostalCode + " " + Borough + " " + Neighborhood)     
        else:
            Neighborhood=Borough
        df2 = (pd.DataFrame.from_records([{'PostalCode':PostalCode,'Borough':Borough,'Neighborhood':Neighborhood}]))
        #print(df2)
        neighborhoods = pd.concat([neighborhoods,df2])
neighborhoods.shape

(180, 3)

2. Process only the cells with borough assigned & 3. One borough can have multiple neighborhoods

In [52]:
# Extracting Table that has an assigned borough
assigned = neighborhoods[neighborhoods['Borough'] != 'Not assigned']
# remove trailing parenthesis from Neighborhood

# Extracting Table that has not assigned borough
not_assigned = neighborhoods[neighborhoods['Borough'] == 'Not assigned']

assigned.groupby(['PostalCode','Borough'])['Neighborhood'].apply(','.join).reset_index()
assigned = assigned[['PostalCode','Borough','Neighborhood']].drop_duplicates()

print(neighborhoods[neighborhoods.PostalCode == 'M5A'] )

  PostalCode           Borough                Neighborhood
0        M5A  Downtown Toronto  Regent Park / Harbourfront


5. Print the number of rows of your dataframe

In [53]:
assigned.shape

(103, 3)

### Part 2 instructions & requirements

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. 

We will use the Geocoder Python package: https://geocoder.readthedocs.io/index.html

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code.

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

Use the Geocoder package or the csv file to create the dataframe

In [54]:
# installing additional libraries
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

print("Geopy installed")

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Geopy installed


In [57]:
# create a dataframe for latitude & longitude info

df_data = pd.read_csv(r"Geospatial_Coordinates.csv")
print(df_data.head())

  Postal Code   Latitude  Longitude
0         M1B  43.806686 -79.194353
1         M1C  43.784535 -79.160497
2         M1E  43.763573 -79.188711
3         M1G  43.770992 -79.216917
4         M1H  43.773136 -79.239476


In [61]:
# define a new dataframe
column_names = ['PostalCode','Borough', 'Neighborhood', 'Latitude', 'Longitude']
df_data.columns = ['PostalCode', 'Latitude', 'Longitude']

In [62]:
# merging with inner join because records without latitude / Longitude are of no use to the below analyses
merged_neighborhood = neighborhoods.merge(df_data, how='inner', left_on=['PostalCode'], right_on=['PostalCode']) 
merged_neighborhood.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 0 to 102
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PostalCode    103 non-null    object 
 1   Borough       103 non-null    object 
 2   Neighborhood  103 non-null    object 
 3   Latitude      103 non-null    float64
 4   Longitude     103 non-null    float64
dtypes: float64(2), object(3)
memory usage: 4.8+ KB


In [63]:
print(merged_neighborhood[merged_neighborhood['PostalCode'] == 'M5A'])


  PostalCode           Borough                Neighborhood  Latitude  \
2        M5A  Downtown Toronto  Regent Park / Harbourfront  43.65426   

   Longitude  
2 -79.360636  


In [64]:
# Display row counts and column counts in merged dataframe
merged_neighborhood.shape

(103, 5)

### Part 3 instructions & requirements 
Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you. 

Just make sure to add enough Markdown cells to explain what you decided to do and to report any observations you make & to generate maps to visualize your neighborhoods and how they cluster together. 

In [66]:
# install additional libraries 

import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#!conda install -c conda-forge folium=0.5.0 --yes  # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

from sklearn.cluster import KMeans

##### Cluster Analysis, working only with boroughs containing the word 'Toronto'

In [65]:
# subset dataframe to only ones with Toronto in borough column
toronto_neighborhood = merged_neighborhood[merged_neighborhood['Borough'].str.contains("Toronto", case=False)]
print(toronto_neighborhood.head())

   PostalCode           Borough                Neighborhood   Latitude  \
2         M5A  Downtown Toronto  Regent Park / Harbourfront  43.654260   
9         M5B  Downtown Toronto    Garden District, Ryerson  43.657162   
15        M5C  Downtown Toronto              St. James Town  43.651494   
19        M4E      East Toronto                 The Beaches  43.676357   
20        M5E  Downtown Toronto                 Berczy Park  43.644771   

    Longitude  
2  -79.360636  
9  -79.378937  
15 -79.375418  
19 -79.293031  
20 -79.373306  


In [70]:
# get the latitude and longitude of toronto to draw a map 

address = 'Toronto, Canada'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create the map of Toronto
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(toronto_neighborhood['Latitude'], toronto_neighborhood['Longitude'], toronto_neighborhood['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

print("Folium is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.")
map_toronto

Folium is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.


Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.    



In [71]:
# define Foursquare Credentials and Version
VERSION = '20210319' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
radius = 1000
CLIENT_ID = 'XUAJQ5DWX0MO0U55Q11RWZK5HNHYIEKZDUK1N0EYM5441MBA' ## Foursquare ID removed; to run, enter your own FourSquare Client ID
CLIENT_SECRET = 'RI1I0XBLONHSSQRCJUXAXT5VSHNLP4WAIDYUHWRHAS4T1ZGH' ## Foursquare Client Secret removed; to run, enter your own FourSquare Client Secret

In [74]:
#Let's explore a single neighborhood in our dataframe.
#Get the neighborhood's name.
print(toronto_neighborhood.head())

   PostalCode           Borough                Neighborhood   Latitude  \
2         M5A  Downtown Toronto  Regent Park / Harbourfront  43.654260   
9         M5B  Downtown Toronto    Garden District, Ryerson  43.657162   
15        M5C  Downtown Toronto              St. James Town  43.651494   
19        M4E      East Toronto                 The Beaches  43.676357   
20        M5E  Downtown Toronto                 Berczy Park  43.644771   

    Longitude  
2  -79.360636  
9  -79.378937  
15 -79.375418  
19 -79.293031  
20 -79.373306  


In [75]:
#Now, let's get the top 100 venues that are in the neighborhood of The Beaches, East Toronto witin 500 meters.
#First, create the GET request URL. Name your URL url.
neighborhood_latitude = 43.676357
neighborhood_longitude = -79.293031
radius = 500
LIMIT = 100

url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        lat, 
        lng, 
        VERSION, 
        radius, 
        LIMIT)

In [None]:
#Send the GET request and examine the resutls
results = requests.get(url).json()
print(results)

In [None]:
#From the Foursquare lab, we know that all the information is in the items key. Before we proceed, let's borrow the get_category_type function from the Foursquare lab.
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

venues = results['response']['groups'][0]['items']
venues

In [78]:
#Now we are ready to clean the json and structure it into a pandas dataframe.

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  """


Unnamed: 0,name,categories,lat,lng
0,Rorschach Brewing Co.,Brewery,43.663483,-79.319824
1,Leslieville Farmers Market,Farmers Market,43.664901,-79.319784
2,The Sidekick,Comic Shop,43.664484,-79.325162
3,Chino Locos,Burrito Place,43.664653,-79.325584
4,Queen Margherita Pizza,Pizza Place,43.664685,-79.324164


In [79]:
# how many venues are returned ? 
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))


16 venues were returned by Foursquare.


In [82]:
# Analyze Each Neighborhood
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

toronto_neighborhood.head

<bound method NDFrame.head of     PostalCode                                            Borough  \
2          M5A                                   Downtown Toronto   
9          M5B                                   Downtown Toronto   
15         M5C                                   Downtown Toronto   
19         M4E                                       East Toronto   
20         M5E                                   Downtown Toronto   
24         M5G                                   Downtown Toronto   
25         M6G                                   Downtown Toronto   
30         M5H                                   Downtown Toronto   
31         M6H                                       West Toronto   
35         M4J                              East YorkEast Toronto   
36         M5J                                   Downtown Toronto   
37         M6J                                       West Toronto   
41         M4K                                       East Toronto   
42  

In [90]:
toronto_venues = getNearbyVenues(names=merged_neighborhood['Neighborhood'],
                                   latitudes=merged_neighborhood['Latitude'],
                                   longitudes=merged_neighborhood['Longitude']
                                  )
print(toronto_venues.head())

       Neighborhood  Neighborhood Latitude  Neighborhood Longitude  \
0         Parkwoods              43.753259              -79.329656   
1         Parkwoods              43.753259              -79.329656   
2  Victoria Village              43.725882              -79.315572   
3  Victoria Village              43.725882              -79.315572   
4  Victoria Village              43.725882              -79.315572   

                    Venue  Venue Latitude  Venue Longitude  \
0         Brookbanks Park       43.751976       -79.332140   
1           Variety Store       43.751974       -79.333114   
2  Victoria Village Arena       43.723481       -79.315635   
3               Portugril       43.725819       -79.312785   
4             Tim Hortons       43.725517       -79.313103   

          Venue Category  
0                   Park  
1      Food & Drink Shop  
2           Hockey Arena  
3  Portuguese Restaurant  
4            Coffee Shop  


In [84]:
print(toronto_venues.shape)

(2127, 7)


In [85]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [86]:
#Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
toronto_venues.groupby('Neighborhood').count()
print('There are {} unique venue categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 271 unique venue categories.


In [88]:
#print each neighborhood along with the top 5 most common venues

num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                       venue  freq
0             Breakfast Spot  0.25
1                     Lounge  0.25
2  Latin American Restaurant  0.25
3               Skating Rink  0.25
4  Middle Eastern Restaurant  0.00


----Alderwood / Long Branch----
            venue  freq
0     Pizza Place  0.25
1     Coffee Shop  0.12
2    Skating Rink  0.12
3  Sandwich Place  0.12
4             Pub  0.12


----Bathurst Manor / Wilson Heights / Downsview North----
                       venue  freq
0                       Bank  0.09
1                Coffee Shop  0.09
2                   Pharmacy  0.04
3                      Diner  0.04
4  Middle Eastern Restaurant  0.04


----Bayview Village----
                 venue  freq
0                 Café  0.25
1   Chinese Restaurant  0.25
2                 Bank  0.25
3  Japanese Restaurant  0.25
4        Metro Station  0.00


----Bedford Park / Lawrence Manor East----
                venue  freq
0      Sandwich Place  0.09
1  Italian Restaurant  

In [91]:
# sort the venues in descending order

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#create the new dataframe and display the top 10 venues for each neighborhood.

# loops through grouped venues 
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                       venue  freq
0             Breakfast Spot  0.25
1                     Lounge  0.25
2  Latin American Restaurant  0.25
3               Skating Rink  0.25
4  Middle Eastern Restaurant  0.00


----Alderwood / Long Branch----
            venue  freq
0     Pizza Place  0.25
1     Coffee Shop  0.12
2    Skating Rink  0.12
3  Sandwich Place  0.12
4             Pub  0.12


----Bathurst Manor / Wilson Heights / Downsview North----
                       venue  freq
0                       Bank  0.09
1                Coffee Shop  0.09
2                   Pharmacy  0.04
3                      Diner  0.04
4  Middle Eastern Restaurant  0.04


----Bayview Village----
                 venue  freq
0                 Café  0.25
1   Chinese Restaurant  0.25
2                 Bank  0.25
3  Japanese Restaurant  0.25
4        Metro Station  0.00


----Bedford Park / Lawrence Manor East----
                venue  freq
0      Sandwich Place  0.09
1  Italian Restaurant  

In [93]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#create the new dataframe and display the top 10 venues for each neighborhood.
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

print(neighborhoods_venues_sorted.head())

                                        Neighborhood 1st Most Common Venue  \
0                                          Agincourt                Lounge   
1                            Alderwood / Long Branch           Pizza Place   
2  Bathurst Manor / Wilson Heights / Downsview North                  Bank   
3                                    Bayview Village                  Café   
4                 Bedford Park / Lawrence Manor East    Italian Restaurant   

       2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue  \
0  Latin American Restaurant        Breakfast Spot          Skating Rink   
1                   Pharmacy                   Pub                   Gym   
2                Coffee Shop   Fried Chicken Joint    Frozen Yogurt Shop   
3        Japanese Restaurant                  Bank    Chinese Restaurant   
4                Coffee Shop        Sandwich Place          Liquor Store   

  5th Most Common Venue 6th Most Common Venue        7th Most Common Venue

Cluster Neighborhoods

In [95]:
#k-means to cluster the neighborhood into 5 clusters.

# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [96]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = merged_neighborhood

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,1.0,Park,Food & Drink Shop,Women's Store,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,Pizza Place,Portuguese Restaurant,Hockey Arena,French Restaurant,Intersection,Coffee Shop,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Electronics Store
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636,0.0,Coffee Shop,Pub,Bakery,Park,Restaurant,Breakfast Spot,Café,Theater,Gym / Fitness Center,Farmers Market
3,M6A,North York,Lawrence Manor / Lawrence Heights,43.718518,-79.464763,0.0,Clothing Store,Furniture / Home Store,Accessories Store,Vietnamese Restaurant,Boutique,Coffee Shop,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494,0.0,Coffee Shop,Diner,Sushi Restaurant,Park,Bar,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Restaurant


In [97]:
#create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

neighborhoods_venues_sorted

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,Agincourt,Lounge,Latin American Restaurant,Breakfast Spot,Skating Rink,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Dim Sum Restaurant,Ethiopian Restaurant
1,0,Alderwood / Long Branch,Pizza Place,Pharmacy,Pub,Gym,Sandwich Place,Coffee Shop,Skating Rink,Dessert Shop,Dim Sum Restaurant,Diner
2,0,Bathurst Manor / Wilson Heights / Downsview North,Bank,Coffee Shop,Fried Chicken Joint,Frozen Yogurt Shop,Restaurant,Supermarket,Sushi Restaurant,Middle Eastern Restaurant,Ice Cream Shop,Deli / Bodega
3,0,Bayview Village,Café,Japanese Restaurant,Bank,Chinese Restaurant,Women's Store,Dim Sum Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant
4,0,Bedford Park / Lawrence Manor East,Italian Restaurant,Coffee Shop,Sandwich Place,Liquor Store,Juice Bar,Indian Restaurant,Restaurant,Sushi Restaurant,Café,Pub
...,...,...,...,...,...,...,...,...,...,...,...,...
90,0,Willowdale,Ramen Restaurant,Pizza Place,Coffee Shop,Café,Sushi Restaurant,Sandwich Place,Restaurant,Grocery Store,Electronics Store,Discount Store
91,1,Willowdale / Newtonbrook,Park,Women's Store,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
92,0,Woburn,Coffee Shop,Soccer Field,Korean BBQ Restaurant,Women's Store,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant
93,0,Woodbine Heights,Spa,Athletics & Sports,Curling Ice,Bus Stop,Skating Rink,Beer Store,Intersection,Park,General Entertainment,Department Store


In [98]:
# add clustering labels
neighborhoods_venues_sorted = neighborhoods_venues_sorted.drop('Cluster Labels', 1)
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = merged_neighborhood

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,1.0,Park,Food & Drink Shop,Women's Store,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,Pizza Place,Portuguese Restaurant,Hockey Arena,French Restaurant,Intersection,Coffee Shop,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Electronics Store
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.654260,-79.360636,0.0,Coffee Shop,Pub,Bakery,Park,Restaurant,Breakfast Spot,Café,Theater,Gym / Fitness Center,Farmers Market
3,M6A,North York,Lawrence Manor / Lawrence Heights,43.718518,-79.464763,0.0,Clothing Store,Furniture / Home Store,Accessories Store,Vietnamese Restaurant,Boutique,Coffee Shop,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494,0.0,Coffee Shop,Diner,Sushi Restaurant,Park,Bar,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North,43.653654,-79.506944,0.0,River,Pool,Smoke Shop,Women's Store,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160,0.0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Restaurant,Gay Bar,Men's Store,Fast Food Restaurant,Mediterranean Restaurant,Yoga Studio,Café
100,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L,43.662744,-79.321558,0.0,Gym / Fitness Center,Auto Workshop,Comic Shop,Pizza Place,Restaurant,Butcher,Burrito Place,Brewery,Skate Park,Smoke Shop
101,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...,43.636258,-79.498509,4.0,Baseball Field,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Women's Store,Farm


In [99]:
# visualize the resulting clusters

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
#colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
#rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        #color=rainbow[cluster-1],
        fill=True,
        #fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Examine Clusters

In [101]:
#examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster
#Cluster 1
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,0.0,Pizza Place,Portuguese Restaurant,Hockey Arena,French Restaurant,Intersection,Coffee Shop,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Electronics Store
2,Downtown Toronto,0.0,Coffee Shop,Pub,Bakery,Park,Restaurant,Breakfast Spot,Café,Theater,Gym / Fitness Center,Farmers Market
3,North York,0.0,Clothing Store,Furniture / Home Store,Accessories Store,Vietnamese Restaurant,Boutique,Coffee Shop,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store
4,Queen's Park,0.0,Coffee Shop,Diner,Sushi Restaurant,Park,Bar,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Restaurant
6,Scarborough,0.0,Fast Food Restaurant,Women's Store,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...
97,Downtown Toronto,0.0,Coffee Shop,Café,Hotel,Restaurant,Japanese Restaurant,Gym,Salad Place,Asian Restaurant,Seafood Restaurant,Steakhouse
98,Etobicoke,0.0,River,Pool,Smoke Shop,Women's Store,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center
99,Downtown Toronto,0.0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Restaurant,Gay Bar,Men's Store,Fast Food Restaurant,Mediterranean Restaurant,Yoga Studio,Café
100,East TorontoBusiness reply mail Processing Cen...,0.0,Gym / Fitness Center,Auto Workshop,Comic Shop,Pizza Place,Restaurant,Butcher,Burrito Place,Brewery,Skate Park,Smoke Shop


In [102]:
#Cluster 2
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,1.0,Park,Food & Drink Shop,Women's Store,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
52,North York,1.0,Park,Women's Store,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
61,Central Toronto,1.0,Bus Line,Park,Swim School,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Women's Store
64,York,1.0,Park,Jewelry Store,Women's Store,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
66,North York,1.0,Convenience Store,Park,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Women's Store
68,Central Toronto,1.0,Jewelry Store,Trail,Park,Sushi Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
83,Central Toronto,1.0,Lawyer,Trail,Park,Summer Camp,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
85,Scarborough,1.0,Playground,Intersection,Park,Women's Store,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run
91,Downtown Toronto,1.0,Park,Playground,Trail,Women's Store,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center


In [103]:
#Cluster 3
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,Scarborough,2.0,Bar,Women's Store,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Farm
94,EtobicokeNorthwest,2.0,Bar,Rental Car Location,Drugstore,Truck Stop,Women's Store,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant


In [104]:
#Cluster 4
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
32,Scarborough,3.0,Playground,Women's Store,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore


In [105]:
#Cluster 5
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
57,North York,4.0,Baseball Field,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Women's Store,Farm
101,Etobicoke,4.0,Baseball Field,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Women's Store,Farm
