<h1>Pavel's Coursera Applied Data Science Capstone Project</h1>

Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto 

<span style="color:blue">
     <h1> Part 1. Toronto Neighborhoods Dataframe creation <h1>

In [3]:
# Import libraries
import pandas as pd
import numpy as np
import requests as request
#!pip install beautifulsoup4
from bs4 import BeautifulSoup

In [4]:
# Make a beatiful soup (scraping) of wikipedia page using the BeautifulSoup package
wikipage_soup = BeautifulSoup(request.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text,'html.parser')

# Find a table with postal codes in soup
postal_codes_table = wikipage_soup.find('table', class_='wikitable sortable')

# Transform HTML with table into dataframe containing postcodes dfPC
dfPC = pd.read_html(str(postal_codes_table))[0]

# Let's check what we have got
print('Dataframe shape:', dfPC.shape)
dfPC.head()

Dataframe shape: (288, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [5]:
# Select only rows with assigned Borough (or Ignore cells with a borough that is Not assigned)
dfPC = dfPC[dfPC.Borough != 'Not assigned']
# And check our data 
print('Dataframe shape:', dfPC.shape)
dfPC.head()

Dataframe shape: (211, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [6]:
# Now combine rows with indentical postcode and aggregate neighborhoods into one row separated with a comma
dfPC=dfPC.groupby(['Postcode','Borough'],sort=False).agg({'Neighbourhood': ", ".join}).reset_index()

# And check our data again
print('Dataframe shape:', dfPC.shape)
dfPC.head(12)

Dataframe shape: (103, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Not assigned
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [7]:
# Assign to Neighbourhood a name of Borough if Neighbourhood is not assigned.
dfPC.loc[dfPC.Neighbourhood == 'Not assigned', 'Neighbourhood'] = dfPC.Borough 
print('Dataframe shape:', dfPC.shape)

Dataframe shape: (103, 3)


<span style="color:blue">
<p></p>
<h1> Part 2. To get the latitude and the longitude coordinates of each neighborhood <h1>


In [8]:
# Read geopraphical coordinates from http://cocl.us/Geospatial_data
!wget -q -O 'geospatial_data.csv' http://cocl.us/Geospatial_data
print('Data with geopraphical coordinates downloaded!')

# Now that the data is downloaded, let's read it into a pandas dataframe.
geo_coordinates_df = pd.read_csv('geospatial_data.csv')

# Let's check what we have got
print(geo_coordinates_df.shape)
geo_coordinates_df.head()

Data with geopraphical coordinates downloaded!
(103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [9]:
# Merge data with Neighbourhoods with geographical coordinates data
dfPC_geo = dfPC.join(geo_coordinates_df.set_index('Postal Code'), on='Postcode')
# Let's check what we have got
print(dfPC_geo.shape)
dfPC_geo.head(12)

(103, 5)


Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937


<span style="color:blue">
<p></p>
<h1> Part 3. Explore and cluster the neighborhoods in Toronto <h1>


In [11]:
# Import libraries

import numpy as np # library to handle data in a vectorized manner
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')



Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.6.16  |       hecc5488_0         145 KB  conda-forge
    certifi-2019.6.16          |           py36_1         149 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    altair-3.1.0               |           py36_0         724 KB  conda-forge
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.2 MB

The following NEW packages will be 

In [12]:
# We know that we have 103 unique postal codes (or neighborhoods grouped by postal codes)  in dataframe. Let's check the number of boroughs.
print('The dataframe has ',len(dfPC_geo['Borough'].unique()),' boroughs: ',dfPC_geo['Borough'].unique())

# Let's calculate neighborhoods groups (postal codes) in each borough and select first 5 boroughs with maximal neighbourhood number

df_toExplore = dfPC_geo[['Borough','Postcode']].groupby(['Borough'])['Postcode'] \
                             .count() \
                             .reset_index(name='count') \
                             .sort_values(['count'], ascending=False) \
                             .head(5)
print('We will explore these 5 boroughs')
df_toExplore

The dataframe has  11  boroughs:  ['North York' 'Downtown Toronto' "Queen's Park" 'Etobicoke' 'Scarborough'
 'East York' 'York' 'East Toronto' 'West Toronto' 'Central Toronto'
 'Mississauga']
We will explore these 5 boroughs


Unnamed: 0,Borough,count
6,North York,24
1,Downtown Toronto,18
8,Scarborough,17
4,Etobicoke,12
0,Central Toronto,9


In [13]:
# Let's create a map of Toronto with groups of neighborhoods (grouped by postal codes) superimposed on top.
# create map of Toronto using Toronto latitude  longitude values
TorontoLocation = [43.6529,-79.3849]
map_Toronto = folium.Map(TorontoLocation, zoom_start=11)
map_Toronto

In [14]:
# add markers of neighborhoods groups to map
for lat, lng, borough, neighborhood, postcode in zip(dfPC_geo['Latitude'], dfPC_geo['Longitude'], dfPC_geo['Borough'], dfPC_geo['Neighbourhood'], dfPC_geo['Postcode']):
    label = '(Post code: {}) Neighbourhoods: {} (Borough: {})'.format(postcode, neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
map_Toronto

In [15]:
# let's simplify the above map and segment and cluster only the neighborhoods in 5 selected boroughs. 
# So let's slice the original dataframe and create a new dataframe of 5 selected boroughs for exploration.

dfNY=pd.merge(dfPC_geo, df_toExplore, on='Borough', how='inner')
dfNY

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,count
0,M3A,North York,Parkwoods,43.753259,-79.329656,24
1,M4A,North York,Victoria Village,43.725882,-79.315572,24
2,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,24
3,M3B,North York,Don Mills North,43.745906,-79.352188,24
4,M6B,North York,Glencairn,43.709577,-79.445073,24
5,M3C,North York,"Flemingdon Park, Don Mills South",43.725900,-79.340923,24
6,M2H,North York,Hillcrest Village,43.803762,-79.363452,24
7,M3H,North York,"Bathurst Manor, Downsview North, Wilson Heights",43.754328,-79.442259,24
8,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,24
9,M3J,North York,"Northwood Park, York University",43.767980,-79.487262,24


In [16]:
# Let's vizualize North York borough with neighborhoods in it
map_NY = folium.Map(TorontoLocation, zoom_start=11)
# add markers of neighborhoods groups to map
for lat, lng, neighborhood, postcode in zip(dfNY['Latitude'], dfNY['Longitude'], dfNY['Neighbourhood'], dfNY['Postcode']):
    label = '(Post code: {}) Neighbourhoods: {})'.format(postcode, neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_NY)  
map_NY

In [17]:
#Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.
# Define Foursquare Version and Credentials (in next hidden block)
VERSION = '20180605' # Foursquare API version
LIMIT=30

In [18]:
{
    "tags": [
        "remove_input",
    ]
}
CLIENT_ID = 'EDTFWJEZYDXJ4URCIECIVXYDYCJEEXIGTAZJPG0QMWX223HA' # my Foursquare ID
CLIENT_SECRET = 'CQH2L1KWZV13PCE2FAY5XWIWQAZKMZUW1YERFG5HUYYDX4OU' # my Foursquare Secret 
print ('Foursquare credentials defined')

Foursquare credentials defined


In [19]:
# Let's create a function to explore neighborhoods in North York borough using Foursquare data

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = request.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [20]:
# Now run the above function on each neighborhood and create a new dataframe called NY_venues.

NY_venues = getNearbyVenues(names= dfNY['Postcode'], latitudes=dfNY['Latitude'], longitudes=dfNY['Longitude'])
NY_venues

M3A
M4A
M6A
M3B
M6B
M3C
M2H
M3H
M2J
M3J
M2K
M3K
M2L
M3L
M6L
M9L
M2M
M3M
M5M
M9M
M2N
M3N
M2P
M2R
M5A
M5B
M5C
M5E
M5G
M6G
M5H
M5J
M5K
M5L
M5S
M5T
M5V
M4W
M5W
M4X
M5X
M4Y
M9A
M9B
M9C
M9P
M9R
M8V
M9V
M8W
M9W
M8X
M8Y
M8Z
M1B
M1C
M1E
M1G
M1H
M1J
M1K
M1L
M1M
M1N
M1P
M1R
M1S
M1T
M1V
M1W
M1X
M4N
M5N
M4P
M5P
M4R
M5R
M4S
M4T
M4V


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,43.753259,-79.329656,Brookbanks Park,43.751976,-79.332140,Park
1,M3A,43.753259,-79.329656,KFC,43.754387,-79.333021,Fast Food Restaurant
2,M3A,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,M3A,43.753259,-79.329656,TTC stop - 44 Valley Woods,43.755402,-79.333741,Bus Stop
4,M4A,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
5,M4A,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
6,M4A,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
7,M4A,43.725882,-79.315572,Eglinton Ave E & Sloane Ave/Bermondsey Rd,43.726086,-79.313620,Intersection
8,M4A,43.725882,-79.315572,Pizza Nova,43.725824,-79.312860,Pizza Place
9,M4A,43.725882,-79.315572,Cash Money,43.725486,-79.312665,Financial or Legal Service


In [21]:
# Let's check how many venues were returned for each neighborhood
NY_venues.groupby('Neighborhood').count()


Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M1B,2,2,2,2,2,2
M1C,2,2,2,2,2,2
M1E,7,7,7,7,7,7
M1G,3,3,3,3,3,3
M1H,7,7,7,7,7,7
M1J,1,1,1,1,1,1
M1K,7,7,7,7,7,7
M1L,7,7,7,7,7,7
M1M,2,2,2,2,2,2
M1N,4,4,4,4,4,4


In [22]:
#Let's find out how many unique categories can be curated from all the returned venues
print('There are {} uniques categories.'.format(len(NY_venues['Venue Category'].unique())))

There are 210 uniques categories.


In [23]:
# Analyze Each Neighborhood
# one hot encoding
NY_onehot = pd.get_dummies(NY_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
NY_onehot['Neighborhood'] = NY_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [NY_onehot.columns[-1]] + list(NY_onehot.columns[:-1])
NY_onehot = NY_onehot[fixed_columns]

NY_onehot.head()


Unnamed: 0,Yoga Studio,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Aquarium,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
# And let's examine the new dataframe size
NY_onehot.shape

(950, 210)

In [25]:
# Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
NY_grouped = NY_onehot.groupby('Neighborhood').mean().reset_index()
NY_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint
0,M1B,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,M1C,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,M1E,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,M1G,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,M1H,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
5,M1J,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
6,M1K,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
7,M1L,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
8,M1M,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.500000,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
9,M1N,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


In [26]:
# Let's confirm the new size of dataframe
NY_grouped.shape

(76, 210)

In [27]:
# Let's print each neighborhood along with the top 5 most common venues
num_top_venues = 5

for hood in NY_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = NY_grouped[NY_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')


----M1B----
                      venue  freq
0      Fast Food Restaurant   0.5
1                Print Shop   0.5
2             Moving Target   0.0
3  Mediterranean Restaurant   0.0
4               Men's Store   0.0


----M1C----
                      venue  freq
0             Moving Target   0.5
1                       Bar   0.5
2               Yoga Studio   0.0
3  Mediterranean Restaurant   0.0
4               Men's Store   0.0


----M1E----
                 venue  freq
0          Pizza Place  0.14
1  Rental Car Location  0.14
2       Medical Center  0.14
3       Breakfast Spot  0.14
4    Electronics Store  0.14


----M1G----
                venue  freq
0         Coffee Shop  0.67
1   Korean Restaurant  0.33
2      Medical Center  0.00
3         Men's Store  0.00
4  Mexican Restaurant  0.00


----M1H----
                  venue  freq
0                  Bank  0.14
1    Athletics & Sports  0.14
2      Hakka Restaurant  0.14
3       Thai Restaurant  0.14
4  Caribbean Restaurant  0.14




In [28]:
# Let's put that into a pandas dataframe
# First, let's write a function to sort the venues in descending order.

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

# Now let's create the new dataframe and display the top 10 venues for each neighborhood.

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues

columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = NY_grouped['Neighborhood']

for ind in np.arange(NY_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(NY_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Fast Food Restaurant,Print Shop,Wings Joint,Deli / Bodega,Empanada Restaurant,Electronics Store,Drugstore,Donut Shop,Dog Run,Discount Store
1,M1C,Bar,Moving Target,Wings Joint,Department Store,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Drugstore,Donut Shop,Dog Run
2,M1E,Electronics Store,Medical Center,Pizza Place,Breakfast Spot,Mexican Restaurant,Rental Car Location,Intersection,Department Store,Drugstore,Donut Shop
3,M1G,Coffee Shop,Korean Restaurant,Dessert Shop,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Drugstore,Donut Shop,Dog Run
4,M1H,Caribbean Restaurant,Fried Chicken Joint,Bakery,Thai Restaurant,Bank,Athletics & Sports,Hakka Restaurant,Drugstore,Donut Shop,Electronics Store


In [29]:
# Cluster Neighborhoods
# Run k-means to cluster the neighborhood into 5 clusters.

# set number of clusters
kclusters = 5

NY_grouped_clustering = NY_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(NY_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 2, 0, 0, 0, 0], dtype=int32)

In [30]:
#Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

neighborhoods_venues_sorted.head()

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,M1B,Fast Food Restaurant,Print Shop,Wings Joint,Deli / Bodega,Empanada Restaurant,Electronics Store,Drugstore,Donut Shop,Dog Run,Discount Store
1,0,M1C,Bar,Moving Target,Wings Joint,Department Store,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Drugstore,Donut Shop,Dog Run
2,0,M1E,Electronics Store,Medical Center,Pizza Place,Breakfast Spot,Mexican Restaurant,Rental Car Location,Intersection,Department Store,Drugstore,Donut Shop
3,0,M1G,Coffee Shop,Korean Restaurant,Dessert Shop,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Drugstore,Donut Shop,Dog Run
4,0,M1H,Caribbean Restaurant,Fried Chicken Joint,Bakery,Thai Restaurant,Bank,Athletics & Sports,Hakka Restaurant,Drugstore,Donut Shop,Electronics Store


In [31]:
NY_merged = dfNY

# merge NY_merged with dfNY data to add latitude/longitude for each neighborhood

NY_merged = NY_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Postcode', how='inner')

NY_merged.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,count,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,24,4,Park,Fast Food Restaurant,Food & Drink Shop,Bus Stop,Wings Joint,Department Store,Electronics Store,Drugstore,Donut Shop,Dog Run
1,M4A,North York,Victoria Village,43.725882,-79.315572,24,0,Coffee Shop,Pizza Place,Financial or Legal Service,Hockey Arena,Portuguese Restaurant,Intersection,Dance Studio,Drugstore,Donut Shop,Dog Run
2,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,24,0,Clothing Store,Furniture / Home Store,Arts & Crafts Store,Miscellaneous Shop,Boutique,Shoe Store,Event Space,Coffee Shop,Accessories Store,Vietnamese Restaurant
3,M3B,North York,Don Mills North,43.745906,-79.352188,24,0,Gym / Fitness Center,Basketball Court,Baseball Field,Caribbean Restaurant,Café,Japanese Restaurant,Dim Sum Restaurant,Ethiopian Restaurant,Empanada Restaurant,Electronics Store
4,M6B,North York,Glencairn,43.709577,-79.445073,24,4,Park,Bakery,Pub,Japanese Restaurant,Italian Restaurant,Department Store,Electronics Store,Drugstore,Donut Shop,Dog Run


In [32]:
# Finally, let's visualize the resulting clusters

# create map
map_clusters = folium.Map(location=[43.761539, -79.411079], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(NY_merged['Latitude'], NY_merged['Longitude'], NY_merged['Postcode'], NY_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters


In [33]:
# Examine Clusters
# Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster. 
# Based on the defining categories, we can then assign a name to each cluster.

#Cluster 0
NY_merged.loc[NY_merged['Cluster Labels'] == 0, NY_merged.columns[[0] + list(range(5, NY_merged.shape[1]))]]


Unnamed: 0,Postcode,count,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,M4A,24,0,Coffee Shop,Pizza Place,Financial or Legal Service,Hockey Arena,Portuguese Restaurant,Intersection,Dance Studio,Drugstore,Donut Shop,Dog Run
2,M6A,24,0,Clothing Store,Furniture / Home Store,Arts & Crafts Store,Miscellaneous Shop,Boutique,Shoe Store,Event Space,Coffee Shop,Accessories Store,Vietnamese Restaurant
3,M3B,24,0,Gym / Fitness Center,Basketball Court,Baseball Field,Caribbean Restaurant,Café,Japanese Restaurant,Dim Sum Restaurant,Ethiopian Restaurant,Empanada Restaurant,Electronics Store
5,M3C,24,0,Gym,Coffee Shop,Grocery Store,Beer Store,Asian Restaurant,Italian Restaurant,Clothing Store,Chinese Restaurant,Dim Sum Restaurant,Restaurant
6,M2H,24,0,Mediterranean Restaurant,Golf Course,Pool,Dog Run,Wings Joint,Deli / Bodega,Electronics Store,Drugstore,Donut Shop,Discount Store
7,M3H,24,0,Coffee Shop,Supermarket,Sushi Restaurant,Middle Eastern Restaurant,Bank,Fast Food Restaurant,Restaurant,Bridal Shop,Fried Chicken Joint,Frozen Yogurt Shop
8,M2J,24,0,Clothing Store,Coffee Shop,Tea Room,Fast Food Restaurant,Bank,Bakery,Department Store,Smoothie Shop,Japanese Restaurant,Food Court
9,M3J,24,0,Falafel Restaurant,Coffee Shop,Massage Studio,Bar,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Drugstore,Donut Shop
10,M2K,24,0,Chinese Restaurant,Bank,Café,Japanese Restaurant,Wings Joint,Dessert Shop,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Drugstore
15,M9L,24,0,Pharmacy,Pizza Place,Empanada Restaurant,Coffee Shop,Deli / Bodega,Electronics Store,Drugstore,Donut Shop,Dog Run,Discount Store


In [None]:
# I wood call this biggest cluster 0 as FOOD&DRINK CLUSTER as we have here a lot of restorants, coffee shops etc. 

In [34]:
#Cluster 1
NY_merged.loc[NY_merged['Cluster Labels'] == 1, NY_merged.columns[[0] + list(range(5, NY_merged.shape[1]))]]

Unnamed: 0,Postcode,count,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
43,M9B,12,1,Bank,Wings Joint,Falafel Restaurant,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Drugstore,Donut Shop,Dog Run,Discount Store


In [None]:
# Cluster 1 is very strange, with only one neighborhood and with Bank as 1st most common venue. Let's call it STRANGE CLUSTER



In [35]:
#Cluster 2
NY_merged.loc[NY_merged['Cluster Labels'] == 2, NY_merged.columns[[0] + list(range(5, NY_merged.shape[1]))]]

Unnamed: 0,Postcode,count,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
59,M1J,17,2,Playground,Wings Joint,Deli / Bodega,Empanada Restaurant,Electronics Store,Drugstore,Donut Shop,Dog Run,Discount Store,Diner


In [None]:
# This cluster 2 is strange too but with a playground as 1st most common place. Let's call it PLAYGROUND



In [36]:
#Cluster 3
NY_merged.loc[NY_merged['Cluster Labels'] == 3, NY_merged.columns[[0] + list(range(5, NY_merged.shape[1]))]]

Unnamed: 0,Postcode,count,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
17,M3M,24,3,Baseball Field,Home Service,Food Truck,Wings Joint,Dessert Shop,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Drugstore,Donut Shop
19,M9M,24,3,Baseball Field,Wings Joint,Falafel Restaurant,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Drugstore,Donut Shop,Dog Run,Discount Store
72,M5N,9,3,Garden,Home Service,Deli / Bodega,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Drugstore,Donut Shop,Dog Run,Discount Store


In [None]:
# This clluster 3 is interesting because we have here Baseball Field as 1St most common venue in two neighborhoods. Let's call it BASEBALL CLUSTER



In [37]:
#Cluster 4
NY_merged.loc[NY_merged['Cluster Labels'] == 4, NY_merged.columns[[0] + list(range(5, NY_merged.shape[1]))]]

Unnamed: 0,Postcode,count,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,24,4,Park,Fast Food Restaurant,Food & Drink Shop,Bus Stop,Wings Joint,Department Store,Electronics Store,Drugstore,Donut Shop,Dog Run
4,M6B,24,4,Park,Bakery,Pub,Japanese Restaurant,Italian Restaurant,Department Store,Electronics Store,Drugstore,Donut Shop,Dog Run
11,M3K,24,4,Park,Airport,Wings Joint,Department Store,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Drugstore,Donut Shop,Dog Run
13,M3L,24,4,Grocery Store,Park,Shopping Mall,Bank,Deli / Bodega,Electronics Store,Drugstore,Donut Shop,Dog Run,Discount Store
14,M6L,24,4,Park,Basketball Court,Construction & Landscaping,Bakery,Dessert Shop,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Drugstore,Donut Shop
22,M2P,24,4,Park,Bank,Convenience Store,Wings Joint,Dessert Shop,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Drugstore,Donut Shop
37,M4W,18,4,Park,Playground,Trail,Building,Deli / Bodega,Empanada Restaurant,Electronics Store,Drugstore,Donut Shop,Dog Run
46,M9R,12,4,Bus Line,Park,Mobile Phone Shop,Deli / Bodega,Electronics Store,Drugstore,Donut Shop,Dog Run,Discount Store,Diner
51,M8X,12,4,Park,River,Pool,Electronics Store,Drugstore,Donut Shop,Dog Run,Discount Store,Diner,Dim Sum Restaurant
68,M1V,17,4,Coffee Shop,Park,Playground,Dance Studio,Electronics Store,Drugstore,Donut Shop,Dog Run,Discount Store,Diner


In [None]:
# This cluster 4 could be beautiful as we have here a lot of parks. Let's call it BEAUTIFUL CLUSTER.