# Segmenting and clustering the neighborhoods in Toronto

This notebook is for the Coursera course project: Segmenting and clustering the neigohorhoods in Toronto.

The data is scraped from the wikipedia, the link is below:

https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

The project is proceeded through three steps:
- Step 1 - Obtain the city data and transform it into a pandas data frame;
- Step 2 - Get the latitude and the longitude coordinates of each neighborhood so to use the FourSquare location data; 
- Step 3 - Explore and cluster the neighborhoods in Toronto.

## 1. Obtaion the Toronto city data from Wikipedia and transform into a dataframe
The requirements for the dataframe:
- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11  in the above table.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# Visit the wiki website and store the web html source code in a variable called 'html_source_code'.
html_source_code = urlopen(url)

# Use Beautiful soup to inteprete the html code and store the results in 'soup'.
soup = BeautifulSoup(html_source_code, 'html.parser')

Now we have the source data scraped from the Wikipedia website, it contains everything of the web page. But we only need the 'PostalCode', 'Borough', and 'Neighborhood' of the Toronto city. We can find this information is stored in the 'table class' by printing out and checking the above 'soup' (ommited the print results here as it takes a big space in the notebook.

In [3]:
# Extract the table data.
table = soup.find('table')

# Build the dataframe with three columns, i.e. PostalCode', 'Borough', and 'Neighborhood'.
column_names = ['Postalcode','Borough','Neighborhood']
df = pd.DataFrame(columns = column_names)

# Use a loop to visit the table rows ('tr') and table data ('td') of each row, and store the data into the dataframe.
for tr_cell in table.find_all('tr'): 
    row_data=[] 
    for td_cell in tr_cell.find_all('td'): 
        row_data.append(td_cell.text.strip()) 
    if len(row_data)==3: 
        df.loc[len(df)] = row_data

print(df.shape)
df.head()

(180, 3)


Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Now we get the data and we need to to process the data accroding to the aforementioned requiremnts.

In [4]:
# Drop the rows with 'Not assigned' for column 'Borough'.
df.drop(df[df['Borough'] == 'Not assigned'].index, inplace = True)

print(df.shape)
df.head()

(103, 3)


Unnamed: 0,Postalcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [5]:
# For the 'Neiborhood' values with 'Not assigned', replace them with the 'Borough' value.
condition = df['Neighborhood'] == 'Not assigned'
df.loc[condition, 'Neighborhood'] = df.loc[condition, 'Borough']

# Reset the index as the previous dataframe still shows the original indexes.
df.reset_index(drop = True, inplace = True) 

# See the dataframe size and first 5 rows.
print(df.shape) 
df.head()

(103, 3)


Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [6]:
# Combine the 'Neighborhood' under the same postal code.
# Check the 'duplicated' postal codes.
df.groupby('Postalcode').count()

Unnamed: 0_level_0,Borough,Neighborhood
Postalcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,1,1
M1C,1,1
M1E,1,1
M1G,1,1
M1H,1,1
...,...,...
M9N,1,1
M9P,1,1
M9R,1,1
M9V,1,1


The above results actualy show that the latest data on the Wikipedia (i.e. Jan 2021 as when this notebook was generated) has been updated that there are no duplicated postcal code in the Column 'Postcalcode'. This is also validated by checking the first dataframe, i.e. before 'Not assigned' rows were dropped. So the work to combine the data is save here.
And the final data frame is as below.

In [7]:
print('Size of the dataframe:',df.shape)
df.head(12)

Size of the dataframe: (103, 3)


Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


***This completes the first part of the project.***

## 2. Get the latitude and the longitude coordinates of each neighborhood
Use the Geocoder Python package to get the latitude and longitude coordinates of each neighborhood.
https://geocoder.readthedocs.io/index.html.

The issue with this packge is you sometimes get no results (the return will be 'None') when making a call for coordinates, so you may need to repeat the call for a few times to get the wanted response. It's instructed the course that this can be solved by running a while loop.

In [8]:
import geocoder

# Define a function to get the coordinates
def get_coords(postal_code):

    lat_lng_coords = None
    # loop until get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
      lat_lng_coords = g.latlng
    
    return lat_lng_coords

In [None]:
# Get the latitude and longtitude coordinates
coords = []

for code in df['Postalcode']:
    coords.append(get_coords(code))

coords
print(coords)

Tried several times to get the neighbours' coordinates through the geocoder method and the code above but without succsess, no response for long time at every attemp. So I decided to use the backup way that to use the coordinates data provided here directly: \
http://cocl.us/Geospatial_data 

In [9]:
filename = r'C:\Users\JX\Desktop\Geospatial_Coordinates.csv' # I have downloaded and saved the data locally.
df_coords = pd.read_csv(filename)
print(df_coords.shape)
df_coords.head()

(103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Next step, we merge the two dataframes according the clumn 'Postalcode'.

In [10]:
df_coords.columns=['Postalcode','Latitude','Longitude'] # Notice it's 'Postal Code' in the df_coords, so unify it.

df = pd.merge(df, df_coords[['Postalcode','Latitude', 'Longitude']], on='Postalcode')

df

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


***This completes the second part of the project.***

## 3. Explore and cluster the neighborhoods in Toronto




### 3.1 Install and Import all the libraries will be used.

In [11]:
import numpy as np
import json
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

### 3.2 Explore the neighborhood by utilizing the Foursquare the API

**Define the Foursquare Credentials and Version.**

In [12]:
CLIENT_ID = 'YMZJVW4NGEGGEFRLCXIJHHAQBQXVFJF0XUNI1ABH25ED2L5X' # Foursquare ID
CLIENT_SECRET = 'VM2DSE00DLIAQOXGQXD4G3DT1W5XD3GZRVVYDUFEHSYUUUMG' # Foursquare Secret
VERSION = '20210131' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

**Get the top 100 venues that are in the neighborhood within a radius of 500 meters.**

In [13]:
def getNearbyVenues(names, latitudes, longitudes, radius = 500):
    venues_list=[]
    LIMIT = 100
    for name, lat, lng in zip(names, latitudes, longitudes):       
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [14]:
neighborhood_names = df['Neighborhood']
neighborhood_latitudes = df['Latitude'] 
neighborhood_longitudes = df['Longitude'] 

toronto_venues = getNearbyVenues(neighborhood_names, neighborhood_latitudes, neighborhood_longitudes)

toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,649 Variety,43.754513,-79.331942,Convenience Store
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


**Check the size and first 10 rows of the dataframe.**

In [16]:
print('Dataframe size:',toronto_venues.shape)
toronto_venues.head(10)

Dataframe size: (2115, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,649 Variety,43.754513,-79.331942,Convenience Store
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
5,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
6,Victoria Village,43.725882,-79.315572,Eglinton Ave E & Sloane Ave/Bermondsey Rd,43.726086,-79.31362,Intersection
7,Victoria Village,43.725882,-79.315572,Pizza Nova,43.725824,-79.31286,Pizza Place
8,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
9,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery


**Check how many venues were returned for each neighborhood.**

In [17]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,4,4,4,4,4,4
"Alderwood, Long Branch",9,9,9,9,9,9
"Bathurst Manor, Wilson Heights, Downsview North",21,21,21,21,21,21
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",26,26,26,26,26,26
...,...,...,...,...,...,...
"Willowdale, Willowdale East",34,34,34,34,34,34
"Willowdale, Willowdale West",6,6,6,6,6,6
Woburn,3,3,3,3,3,3
Woodbine Heights,6,6,6,6,6,6


**Check how many unique categories can be curated from all the returned venues.**

In [18]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 262 uniques categories.


### 3.3 Analyze Each Neighborhood

In [19]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

print('Dataframe size:',toronto_onehot.shape)
toronto_onehot.head(10)

Dataframe size: (2115, 262)


Unnamed: 0,Yoga Studio,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Trail,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category, and check the dataframe size.**

In [20]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

print('Dataframe size:', toronto_grouped.shape)
toronto_grouped

Dataframe size: (95, 262)


Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Trail,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038462,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.038462
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90,"Willowdale, Willowdale East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.029412,0.0,0.0,0.0,0.000000
91,"Willowdale, Willowdale West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
92,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
93,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000


**Print out each neighborhood along with the top 5 most common venues.**

In [21]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                       venue  freq
0                     Lounge  0.25
1  Latin American Restaurant  0.25
2             Breakfast Spot  0.25
3         Chinese Restaurant  0.25
4              Metro Station  0.00


----Alderwood, Long Branch----
                venue  freq
0         Pizza Place  0.22
1                 Gym  0.11
2  Athletics & Sports  0.11
3        Dance Studio  0.11
4         Coffee Shop  0.11


----Bathurst Manor, Wilson Heights, Downsview North----
          venue  freq
0   Coffee Shop  0.10
1          Bank  0.10
2      Pharmacy  0.05
3  Intersection  0.05
4   Bridal Shop  0.05


----Bayview Village----
                 venue  freq
0  Japanese Restaurant  0.25
1                 Café  0.25
2                 Bank  0.25
3   Chinese Restaurant  0.25
4          Yoga Studio  0.00


----Bedford Park, Lawrence Manor East----
                     venue  freq
0           Sandwich Place  0.08
1       Italian Restaurant  0.08
2              Coffee Shop  0.08
3    

**Put the results into a _pandas_ dataframe.**

We do it by two steps:
- Define a function to sort the venues in descending order.
- Create the new dataframe and display the top 10 venues for each neighborhood.

In [22]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [23]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

print('Dataframe size:',neighborhoods_venues_sorted.shape)
neighborhoods_venues_sorted.head(10)

Dataframe size: (95, 6)


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Agincourt,Chinese Restaurant,Latin American Restaurant,Lounge,Breakfast Spot,Dog Run
1,"Alderwood, Long Branch",Pizza Place,Gym,Athletics & Sports,Pharmacy,Pub
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Shopping Mall,Frozen Yogurt Shop,Ice Cream Shop
3,Bayview Village,Chinese Restaurant,Café,Bank,Japanese Restaurant,Deli / Bodega
4,"Bedford Park, Lawrence Manor East",Sandwich Place,Coffee Shop,Italian Restaurant,Pharmacy,Juice Bar
5,Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Restaurant,Bakery
6,"Birch Cliff, Cliffside West",College Stadium,Skating Rink,Café,General Entertainment,Women's Store
7,"Brockton, Parkdale Village, Exhibition Place",Café,Bakery,Breakfast Spot,Coffee Shop,Gym
8,"Business reply mail Processing Centre, South C...",Light Rail Station,Park,Garden,Brewery,Spa
9,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Airport Terminal,Boutique,Plane


### 3.4 Use K-means to cluster the neighborhoods in 5 groups

In [24]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

**Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.**

In [25]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_venues
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

print('Size of the dataframe:', toronto_merged.shape)
toronto_merged.head()

Size of the dataframe: (2115, 13)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park,0,Park,Food & Drink Shop,Convenience Store,Discount Store,Department Store
1,Parkwoods,43.753259,-79.329656,649 Variety,43.754513,-79.331942,Convenience Store,0,Park,Food & Drink Shop,Convenience Store,Discount Store,Department Store
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop,0,Park,Food & Drink Shop,Convenience Store,Discount Store,Department Store
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena,1,Coffee Shop,Intersection,Pizza Place,Hockey Arena,Portuguese Restaurant
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant,1,Coffee Shop,Intersection,Pizza Place,Hockey Arena,Portuguese Restaurant


**Visualize the resulting clusters**

In [26]:
# Get Toronto's coordinates.
address = 'Toronto, ON'
geolocator = Nominatim(user_agent="Toronto")
location = geolocator.geocode(address)
latitude_toronto = location.latitude
longitude_toronto = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude_toronto, longitude_toronto))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [27]:
map_clusters = folium.Map(location=[latitude_toronto, longitude_toronto], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lng, poi, cluster in zip(toronto_merged['Neighborhood Latitude'], toronto_merged['Neighborhood Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### 3.5 Examine Clusters

Examine each cluster and determine the discriminating venue categories that distinguish each cluster.

**Cluster 1**

In [28]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood Latitude,Venue Longitude,Venue Category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,43.753259,-79.33214,Park,0,Park,Food & Drink Shop,Convenience Store,Discount Store,Department Store
1,43.753259,-79.331942,Convenience Store,0,Park,Food & Drink Shop,Convenience Store,Discount Store,Department Store
2,43.753259,-79.333114,Food & Drink Shop,0,Park,Food & Drink Shop,Convenience Store,Discount Store,Department Store
412,43.689026,-79.4563,Park,0,Park,Women's Store,College Gym,Deli / Bodega,Electronics Store
413,43.689026,-79.456333,Women's Store,0,Park,Women's Store,College Gym,Deli / Bodega,Electronics Store
414,43.689026,-79.448924,Park,0,Park,Women's Store,College Gym,Deli / Bodega,Electronics Store
689,43.744734,-79.239336,Playground,0,Playground,Convenience Store,Women's Store,Distribution Center,Department Store
690,43.744734,-79.24465,Convenience Store,0,Playground,Convenience Store,Women's Store,Distribution Center,Department Store
759,43.685347,-79.335007,Park,0,Park,Metro Station,Convenience Store,Discount Store,Department Store
760,43.685347,-79.335007,Convenience Store,0,Park,Metro Station,Convenience Store,Discount Store,Department Store


**Cluster 2**

In [29]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood Latitude,Venue Longitude,Venue Category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
3,43.725882,-79.315635,Hockey Arena,1,Coffee Shop,Intersection,Pizza Place,Hockey Arena,Portuguese Restaurant
4,43.725882,-79.312785,Portuguese Restaurant,1,Coffee Shop,Intersection,Pizza Place,Hockey Arena,Portuguese Restaurant
5,43.725882,-79.313103,Coffee Shop,1,Coffee Shop,Intersection,Pizza Place,Hockey Arena,Portuguese Restaurant
6,43.725882,-79.313620,Intersection,1,Coffee Shop,Intersection,Pizza Place,Hockey Arena,Portuguese Restaurant
7,43.725882,-79.312860,Pizza Place,1,Coffee Shop,Intersection,Pizza Place,Hockey Arena,Portuguese Restaurant
...,...,...,...,...,...,...,...,...,...
2110,43.628841,-79.518041,Fast Food Restaurant,1,Gym,Fast Food Restaurant,Bakery,Burger Joint,Tanning Salon
2111,43.628841,-79.518617,Grocery Store,1,Gym,Fast Food Restaurant,Bakery,Burger Joint,Tanning Salon
2112,43.628841,-79.519006,Tanning Salon,1,Gym,Fast Food Restaurant,Bakery,Burger Joint,Tanning Salon
2113,43.628841,-79.518290,Kids Store,1,Gym,Fast Food Restaurant,Bakery,Burger Joint,Tanning Salon


**Cluster 3**

In [31]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood Latitude,Venue Longitude,Venue Category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
1229,43.756303,-79.568913,Restaurant,2,Restaurant,Curling Ice,Eastern European Restaurant,Dumpling Restaurant,Drugstore


**Cluster 4**

In [32]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood Latitude,Venue Longitude,Venue Category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
101,43.806686,-79.199056,Fast Food Restaurant,3,Fast Food Restaurant,Women's Store,Dance Studio,Eastern European Restaurant,Dumpling Restaurant


**Cluster 5**

In [33]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood Latitude,Venue Longitude,Venue Category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
1302,43.724766,-79.532854,Baseball Field,4,Baseball Field,Furniture / Home Store,Distribution Center,Department Store,Dessert Shop
1303,43.724766,-79.529923,Furniture / Home Store,4,Baseball Field,Furniture / Home Store,Distribution Center,Department Store,Dessert Shop
2101,43.636258,-79.496266,Baseball Field,4,Baseball Field,Women's Store,Deli / Bodega,Electronics Store,Eastern European Restaurant


**This completes the third part of the project.**

Thanks for your comments and suggestions for the work.