# Capstone Project - Segmenting and Clustering Toronto Neighbourhoods
Owner: Jothika Sundaram

## Introduction
<p>This notebook is part of my Peer-Graded Capstone Project for the IBM Professional Data Science Certificate Course. In this project I will be examining Toronto neighbourhoods based on location and vicinity qualities. The information for these neighborhoods will be gathered from <a href = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'>this wikipedia table</a> that will be scraped to obtain postal codes and their corresponding boroughs.<br>
Next, I will be using the Foursquare API to access location data of the various Boroughs, Neighbourhoods and Venues in Toronto. This will be used to cluster the various postal code areas based on their surrounding venues.<br>
Finally, I will be visualizing this data using Folium, an interactive map rendering library which will differentiate the various clusters and then analyse the cluster features.</p>

## Structure

<div>

1. <a href="#item1"> Webscraping - acquire and clean table of postal code data </a><br>
    
2. <a href="#item2"> Map of Toronto - render a map of Toronto using folium to visualize these postal code areas </a><br>
    
3. <a href="#item3"> Exploring a Borough - using Foursquare api</a><br>
    
4. <a href="#item4"> Exploring all neighbourhoods within a Borough </a><br>
    
5. <a href="#item5"> Clustering Neighbourhoods - cluster using kmeans and visualize them using folium </a><br>
    
6. <a href="#item6"> Discussion - examine these clusters and their features </a>
        
</div>

### Note: If you are unable to view the maps on my github, please use <a href = 'https://nbviewer.jupyter.org/github/Jo-Sundaram/Coursera_Capstone/blob/master/Coursera%20Capstone%20-%20Segmenting%20and%20Clustering.ipynb'> this link to a notebook viewer that will render the maps. </a>

In [181]:
import pandas as pd
import numpy as np
import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors


<h1> <a id="item1"></a>1. Webscraping</h1>

#### Scrape the wiki page for the required table of postal codes

In [22]:
# Scrape the wiki link for the required table
postal_codes = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

In [23]:
# convert to data frame
df=postal_codes[0]
print(df.shape)
df.head()

(180, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


#### We cannot use any entries that is not assigned a Borough. We must replace 'Not assigned' in the 'Borough' column with NaN and drop those values

In [24]:
df['Borough'].replace('Not assigned',np.nan,inplace = True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [25]:
# make dataframe consisting of all entries where 'Borough' does not have NaN values
df = df[df['Borough'].notna()]

In [26]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


#### If there are any *Neighborhoods* that are not assigned a name, they will use the same name as their Borough

In [27]:
df['Neighborhood'].replace(np.nan,df['Borough'],inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


In [28]:
# Reset index
df.reset_index(drop = True,inplace = True)

#### Lets look at the shape of the dataframe after cleaning:

In [29]:
df.shape

(103, 3)

#### We need to import a csv that contains the geographical coordinates of each postal code

In [30]:
postal_coords = pd.read_csv('https://cocl.us/Geospatial_data')

In [31]:
postal_coords.head()
postal_coords.sort_values(by='Postal Code',inplace = True)


In [32]:
postal_coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [33]:
df.sort_values(by =['Postal Code'],inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


#### I created a new dataframe to merge the latitude and longitude columns

In [34]:
geo_df = pd.DataFrame(df)

In [35]:
geo_df = geo_df.merge(postal_coords)

In [36]:
geo_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


<h2> <a id="item2"></a>2. Map of Toronto</h2>

Import Folium to render the map

In [41]:
# !conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library
print('Folium imported')

Folium imported


In [135]:
geo_df['Borough'].unique()

array(['Scarborough', 'North York', 'East York', 'East Toronto',
       'Central Toronto', 'Downtown Toronto', 'York', 'West Toronto',
       'Mississauga', 'Etobicoke'], dtype=object)

#### I used the latitude and longitude of the first row entry as the location for the city of Toronto

In [39]:
latitude = geo_df.iloc[0,3]
longitude = geo_df.iloc[0,4]

#### And now we can render a map of Toronto marking all the neighbourhoods of the postal codes with blue circle markers

In [43]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, postal, borough, neighborhood in zip(geo_df['Latitude'], geo_df['Longitude'], geo_df['Postal Code'],geo_df['Borough'], geo_df['Neighborhood']):
    label = '{}, {},{}'.format(postal,neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

<h2 id="item3"><a id="item3"></a> 3. Exploring a Borough </h2>

#### Define Foursquare Credentials

In [44]:
CLIENT_ID = 'NA0Q3HAIZG5G1XDGDI5ZCGQPBSNEP4SGYF5B0ULTZBDZHIOV' # your Foursquare ID
CLIENT_SECRET = 'HNVACZRJK5KNIEWX0IBRFZI0U3AG5NH1MUSWVUXBDWBFTU1J' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: NA0Q3HAIZG5G1XDGDI5ZCGQPBSNEP4SGYF5B0ULTZBDZHIOV
CLIENT_SECRET:HNVACZRJK5KNIEWX0IBRFZI0U3AG5NH1MUSWVUXBDWBFTU1J


#### Lets focus on the first borough in the dataframe: Scarborough

In [48]:
scar_lat = geo_df.loc[0,'Latitude']
scar_lon = geo_df.loc[0,'Longitude']
scar_name = geo_df.loc[0,'Borough']

print('Latitude and longitude values of {} are {}, {}.'.format(scar_name, 
                                                               scar_lat, 
                                                               scar_lon))

Latitude and longitude values of Scarborough are 43.806686299999996, -79.19435340000001.


#### Find the top 100 venues located in this area within 3 km using the Foursquare api

In [83]:
LIMIT = 100
radius = 3000
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    scar_lat, 
    scar_lon,
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=NA0Q3HAIZG5G1XDGDI5ZCGQPBSNEP4SGYF5B0ULTZBDZHIOV&client_secret=HNVACZRJK5KNIEWX0IBRFZI0U3AG5NH1MUSWVUXBDWBFTU1J&v=20180605&ll=43.806686299999996,-79.19435340000001&radius=3000&limit=100'

In [84]:
results = requests.get(url).json()

#### We need to extract some information from the results obtained by the Frouquare api. This function will extract the catrgory of each venue

In [85]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### Filter the results to create a dataframe

In [86]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Toronto Pan Am Sports Centre,Athletics & Sports,43.790623,-79.193869
1,African Rainforest Pavilion,Zoo Exhibit,43.817725,-79.183433
2,Polar Bear Exhibit,Zoo,43.823372,-79.185145
3,Toronto Zoo,Zoo,43.820582,-79.181551
4,Orangutan Exhibit,Zoo Exhibit,43.818413,-79.182548


We can see that 4/5 of the top venues are Zoo exihibts. That may give us an idea about how the rest of the data will be clustered.

In [87]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

73 venues were returned by Foursquare.


<h2 id = "item4"><a id="item4"></a>4. Exploring all neighborhoods within a Borough</h2>

#### Now lets find venues located in all the neighborhoods within Scarborough

In [95]:
# This function will extract information of all the neighborhoods and venues located within a specified radius
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lon in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lon, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lon, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [96]:
scar_data = geo_df[geo_df['Borough'] == 'Scarborough'].reset_index(drop=True)
scar_venues = getNearbyVenues(names=scar_data['Neighborhood'],
                                   latitudes=scar_data['Latitude'],
                                   longitudes=scar_data['Longitude']
                                  )

Malvern, Rouge
Rouge Hill, Port Union, Highland Creek
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
Kennedy Park, Ionview, East Birchmount Park
Golden Mile, Clairlea, Oakridge
Cliffside, Cliffcrest, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Wexford Heights, Scarborough Town Centre
Wexford, Maryvale
Agincourt
Clarks Corners, Tam O'Shanter, Sullivan
Milliken, Agincourt North, Steeles East, L'Amoreaux East
Steeles West, L'Amoreaux West
Upper Rouge


#### Now we can see the venues located in the many neighborhoods and postal code areas within Scarborough

In [99]:
scar_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Malvern, Rouge",43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,Affordable Toronto Movers,43.787919,-79.162977,Moving Target
3,"Guildwood, Morningside, West Hill",43.763573,-79.188711,RBC Royal Bank,43.76679,-79.191151,Bank
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


#### We can see that some neighborhoods and postal code areas have multiple venues, so lets group them and see how many unique venue categories there are

In [100]:
scar_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,5,5,5,5,5,5
"Birch Cliff, Cliffside West",4,4,4,4,4,4
Cedarbrae,9,9,9,9,9,9
"Clarks Corners, Tam O'Shanter, Sullivan",14,14,14,14,14,14
"Cliffside, Cliffcrest, Scarborough Village West",2,2,2,2,2,2
"Dorset Park, Wexford Heights, Scarborough Town Centre",6,6,6,6,6,6
"Golden Mile, Clairlea, Oakridge",10,10,10,10,10,10
"Guildwood, Morningside, West Hill",7,7,7,7,7,7
"Kennedy Park, Ionview, East Birchmount Park",6,6,6,6,6,6
"Malvern, Rouge",1,1,1,1,1,1


In [101]:
print('There are {} uniques categories.'.format(len(scar_venues['Venue Category'].unique())))

There are 55 uniques categories.


## Analyze the neighborhoods
#### Use one hot encoding to convert the categories to numerical values which will make it easier to analyse

In [108]:
# one hot encoding
scar_onehot = pd.get_dummies(scar_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
scar_onehot['Neighborhood'] = scar_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [scar_onehot.columns[-1]] + list(scar_onehot.columns[:-1])
scar_onehot = scar_onehot[fixed_columns]

scar_onehot.head()


Unnamed: 0,Neighborhood,American Restaurant,Athletics & Sports,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,Café,...,Playground,Rental Car Location,Sandwich Place,Skating Rink,Smoke Shop,Soccer Field,Spa,Thai Restaurant,Thrift / Vintage Store,Vietnamese Restaurant
0,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Rouge Hill, Port Union, Highland Creek",0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Rouge Hill, Port Union, Highland Creek",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Guildwood, Morningside, West Hill",0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Lets group rows by neighborhood and take the mean frequency of occurence of each category

In [105]:
scar_grouped = scar_onehot.groupby('Neighborhood').mean().reset_index()
scar_grouped.head()

Unnamed: 0,Neighborhood,American Restaurant,Athletics & Sports,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,Café,...,Playground,Rental Car Location,Sandwich Place,Skating Rink,Smoke Shop,Soccer Field,Spa,Thai Restaurant,Thrift / Vintage Store,Vietnamese Restaurant
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,...,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0
1,"Birch Cliff, Cliffside West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,...,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0
2,Cedarbrae,0.0,0.111111,0.111111,0.111111,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0
3,"Clarks Corners, Tam O'Shanter, Sullivan",0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0
4,"Cliffside, Cliffcrest, Scarborough Village West",0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [106]:
scar_grouped.shape

(16, 56)

#### Lets look at the top 5 venues in each neighborhood

In [109]:
num_top_venues = 5

for hood in scar_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = scar_grouped[scar_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                       venue  freq
0  Latin American Restaurant   0.2
1                     Lounge   0.2
2             Breakfast Spot   0.2
3               Skating Rink   0.2
4             Clothing Store   0.2


----Birch Cliff, Cliffside West----
                   venue  freq
0  General Entertainment  0.25
1           Skating Rink  0.25
2                   Café  0.25
3        College Stadium  0.25
4    American Restaurant  0.00


----Cedarbrae----
                 venue  freq
0               Bakery  0.11
1                 Bank  0.11
2               Lounge  0.11
3      Thai Restaurant  0.11
4  Fried Chicken Joint  0.11


----Clarks Corners, Tam O'Shanter, Sullivan----
                venue  freq
0         Pizza Place  0.14
1            Pharmacy  0.14
2  Chinese Restaurant  0.07
3         Gas Station  0.07
4        Noodle House  0.07


----Cliffside, Cliffcrest, Scarborough Village West----
                 venue  freq
0  American Restaurant   0.5
1                Mot

#### Lets store this data in a dataframe. First we need to sort the venues of each neighborhood in descending order of frequency

In [110]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#### Lets look at the top 10 venues in each neighborhood

In [145]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = scar_grouped['Neighborhood']

for ind in np.arange(scar_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(scar_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Skating Rink,Breakfast Spot,Latin American Restaurant,Lounge,Clothing Store,Vietnamese Restaurant,Convenience Store,General Entertainment,Gas Station,Furniture / Home Store
1,"Birch Cliff, Cliffside West",College Stadium,General Entertainment,Skating Rink,Café,Vietnamese Restaurant,Grocery Store,Gas Station,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant
2,Cedarbrae,Lounge,Thai Restaurant,Athletics & Sports,Bakery,Bank,Gas Station,Hakka Restaurant,Fried Chicken Joint,Caribbean Restaurant,Vietnamese Restaurant
3,"Clarks Corners, Tam O'Shanter, Sullivan",Pharmacy,Pizza Place,Chinese Restaurant,Fast Food Restaurant,Convenience Store,Noodle House,Italian Restaurant,Coffee Shop,Fried Chicken Joint,Bank
4,"Cliffside, Cliffcrest, Scarborough Village West",American Restaurant,Motel,Hobby Shop,Grocery Store,General Entertainment,Gas Station,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant,Electronics Store


<h2 id="item5"><a id="item5"></a>5. Clustering Neighborhoods</h2>

#### Run *k*-means to make 5 clusters

In [146]:
from sklearn.cluster import KMeans
kclusters = 5
scar_grouped_clustering = scar_grouped.drop('Neighborhood', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(scar_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 2], dtype=int32)

Create a new dataframe that includes the cluster labels

In [156]:
neighborhoods_venues_sorted.drop('Cluster Labels',1,inplace = True)  # drop column if it already exists
# add clustering labels column
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_) 


scar_merged = scar_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
scar_merged = scar_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
scar_merged.dropna(inplace = True)
scar_merged.head() 


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,2.0,Fast Food Restaurant,Vietnamese Restaurant,College Stadium,Grocery Store,General Entertainment,Gas Station,Furniture / Home Store,Fried Chicken Joint,Electronics Store,Discount Store
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,0.0,Moving Target,Bar,Vietnamese Restaurant,Grocery Store,General Entertainment,Gas Station,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,1.0,Bank,Intersection,Breakfast Spot,Rental Car Location,Electronics Store,Medical Center,Mexican Restaurant,Vietnamese Restaurant,Convenience Store,Gas Station
3,M1G,Scarborough,Woburn,43.770992,-79.216917,4.0,Coffee Shop,Korean Restaurant,Vietnamese Restaurant,Convenience Store,Grocery Store,General Entertainment,Gas Station,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,1.0,Lounge,Thai Restaurant,Athletics & Sports,Bakery,Bank,Gas Station,Hakka Restaurant,Fried Chicken Joint,Caribbean Restaurant,Vietnamese Restaurant


In [325]:
# neighborhoods_venues_sorted.reset_index(inplace = True) # reset index if needed

In [152]:
neighborhoods_venues_sorted

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1,Agincourt,Skating Rink,Breakfast Spot,Latin American Restaurant,Lounge,Clothing Store,Vietnamese Restaurant,Convenience Store,General Entertainment,Gas Station,Furniture / Home Store
1,1,"Birch Cliff, Cliffside West",College Stadium,General Entertainment,Skating Rink,Café,Vietnamese Restaurant,Grocery Store,Gas Station,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant
2,1,Cedarbrae,Lounge,Thai Restaurant,Athletics & Sports,Bakery,Bank,Gas Station,Hakka Restaurant,Fried Chicken Joint,Caribbean Restaurant,Vietnamese Restaurant
3,1,"Clarks Corners, Tam O'Shanter, Sullivan",Pharmacy,Pizza Place,Chinese Restaurant,Fast Food Restaurant,Convenience Store,Noodle House,Italian Restaurant,Coffee Shop,Fried Chicken Joint,Bank
4,1,"Cliffside, Cliffcrest, Scarborough Village West",American Restaurant,Motel,Hobby Shop,Grocery Store,General Entertainment,Gas Station,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
5,1,"Dorset Park, Wexford Heights, Scarborough Town...",Indian Restaurant,Vietnamese Restaurant,Thrift / Vintage Store,Chinese Restaurant,Pet Store,Convenience Store,General Entertainment,Gas Station,Furniture / Home Store,Fried Chicken Joint
6,1,"Golden Mile, Clairlea, Oakridge",Bus Line,Bakery,Intersection,Metro Station,Park,Bus Station,Ice Cream Shop,Soccer Field,Fried Chicken Joint,Convenience Store
7,1,"Guildwood, Morningside, West Hill",Bank,Intersection,Breakfast Spot,Rental Car Location,Electronics Store,Medical Center,Mexican Restaurant,Vietnamese Restaurant,Convenience Store,Gas Station
8,1,"Kennedy Park, Ionview, East Birchmount Park",Coffee Shop,Convenience Store,Bus Station,Discount Store,Department Store,Hobby Shop,Grocery Store,General Entertainment,Gas Station,Furniture / Home Store
9,2,"Malvern, Rouge",Fast Food Restaurant,Vietnamese Restaurant,College Stadium,Grocery Store,General Entertainment,Gas Station,Furniture / Home Store,Fried Chicken Joint,Electronics Store,Discount Store


#### Now lets visualize the clustering

In [153]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(scar_merged['Latitude'], scar_merged['Longitude'], scar_merged['Neighborhood'], neighborhoods_venues_sorted['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters


<h2 id="item6"><a id="item6"></a>6. Discussion</h2> 

#### Now that we have clustered the neighborhoods, lets examine their features

In [174]:
scar_merged.loc[scar_merged['Cluster Labels'] == 0]

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,0.0,Moving Target,Bar,Vietnamese Restaurant,Grocery Store,General Entertainment,Gas Station,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant,Electronics Store


This cluster only has one neighborhood. We can assume it is clustered by its 1st common venue, Moving Target.

In [175]:
scar_merged.loc[scar_merged['Cluster Labels'] == 1]

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,1.0,Bank,Intersection,Breakfast Spot,Rental Car Location,Electronics Store,Medical Center,Mexican Restaurant,Vietnamese Restaurant,Convenience Store,Gas Station
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,1.0,Lounge,Thai Restaurant,Athletics & Sports,Bakery,Bank,Gas Station,Hakka Restaurant,Fried Chicken Joint,Caribbean Restaurant,Vietnamese Restaurant
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029,1.0,Coffee Shop,Convenience Store,Bus Station,Discount Store,Department Store,Hobby Shop,Grocery Store,General Entertainment,Gas Station,Furniture / Home Store
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577,1.0,Bus Line,Bakery,Intersection,Metro Station,Park,Bus Station,Ice Cream Shop,Soccer Field,Fried Chicken Joint,Convenience Store
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476,1.0,American Restaurant,Motel,Hobby Shop,Grocery Store,General Entertainment,Gas Station,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848,1.0,College Stadium,General Entertainment,Skating Rink,Café,Vietnamese Restaurant,Grocery Store,Gas Station,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant
10,M1P,Scarborough,"Dorset Park, Wexford Heights, Scarborough Town...",43.75741,-79.273304,1.0,Indian Restaurant,Vietnamese Restaurant,Thrift / Vintage Store,Chinese Restaurant,Pet Store,Convenience Store,General Entertainment,Gas Station,Furniture / Home Store,Fried Chicken Joint
11,M1R,Scarborough,"Wexford, Maryvale",43.750072,-79.295849,1.0,Bakery,Smoke Shop,Breakfast Spot,Middle Eastern Restaurant,Vietnamese Restaurant,Convenience Store,General Entertainment,Gas Station,Furniture / Home Store,Fried Chicken Joint
12,M1S,Scarborough,Agincourt,43.7942,-79.262029,1.0,Skating Rink,Breakfast Spot,Latin American Restaurant,Lounge,Clothing Store,Vietnamese Restaurant,Convenience Store,General Entertainment,Gas Station,Furniture / Home Store
13,M1T,Scarborough,"Clarks Corners, Tam O'Shanter, Sullivan",43.781638,-79.304302,1.0,Pharmacy,Pizza Place,Chinese Restaurant,Fast Food Restaurant,Convenience Store,Noodle House,Italian Restaurant,Coffee Shop,Fried Chicken Joint,Bank


This cluster is very diverse in its common venues, so it is difficult to determine the feature that defines this cluster. However, looking at the dataframe, we can see that some of the most common venues among the postal code areas include Gas Stations, Convinience Stores and General Entertainment. It is possible that one of those venues can most likely define this cluster. 

In [176]:
scar_merged.loc[scar_merged['Cluster Labels'] == 2]

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,2.0,Fast Food Restaurant,Vietnamese Restaurant,College Stadium,Grocery Store,General Entertainment,Gas Station,Furniture / Home Store,Fried Chicken Joint,Electronics Store,Discount Store


This cluster only has one postal code area, so we can assume it is defined by Fast Food Restaurants as its 1st most common venue

In [178]:
scar_merged.loc[scar_merged['Cluster Labels'] == 3]

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476,3.0,Spa,Playground,Vietnamese Restaurant,Coffee Shop,General Entertainment,Gas Station,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
14,M1V,Scarborough,"Milliken, Agincourt North, Steeles East, L'Amo...",43.815252,-79.284577,3.0,Playground,Park,Vietnamese Restaurant,Coffee Shop,General Entertainment,Gas Station,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant,Electronics Store


In this cluster, the first two common venues are dominated by playgrounds and parks, and the rest of the venues share the same categories. We can assume this cluster is defined by Parks. 

In [179]:
scar_merged.loc[scar_merged['Cluster Labels'] == 4]

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,M1G,Scarborough,Woburn,43.770992,-79.216917,4.0,Coffee Shop,Korean Restaurant,Vietnamese Restaurant,Convenience Store,Grocery Store,General Entertainment,Gas Station,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant


This cluster only has one neighborhood, so we can assume it is defined by Coffee Shop as its 1st most common venue