<h1 align=center><font size = 5>Capstone Week 3: Segmenting and clustering Neighborhoods in Toronto</font></h1>

## Introduction

This notebook is my solution to the Capstone project week3 of the Coursera Course "Applied Data Science with Python" from IBM. 

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Task 1: Create Dataframe of neighbourhoods (web scraping and data cleaning)</a>

2. <a href="#item2">Task 2: Add Geolocation to neighbourhoods</a>

3. <a href="#item3">Exploration of neighbourhoods (EDA)</a>

4. <a href="#item4">Cluster Neighborhoods (k-means w sklearn)</a>

5. <a href="#item5">Examine Clusters</a>
</font>
</div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# Web-requests are done via module requests
import requests

# For webscraping I will need BeautifulSoup and requests
from bs4 import BeautifulSoup

# For counting in dataframes
from collections import Counter

print('Libraries imported.')

Libraries imported.


<a id='item1'></a>

<h1>Task 1: Create Dataframe of neighbourhoods</h1>

The course gave the following source for this project:<br>
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

It is an overview of all postal codes starting with "M".<br>
<b>"Postal codes beginning with M are located within the city of Toronto in the province of Ontario."</b>

Webscraping will be done with the help of BeautifulSoup<br>
https://beautiful-soup-4.readthedocs.io/en/latest/

I found these ressources helpful:<br>
https://pythonprogramminglanguage.com/web-scraping-with-pandas-and-beautifulsoup/

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r = requests.get(url)
data=r.text
soup = BeautifulSoup(data)
# Selects the first <table></table> element in the html file
table = soup.find_all('table')[0]

In [3]:
# Returns a list of elements
df = pd.read_html(str(table))
# The first element is our dataframe
neighbourhoods=df[0]
print(neighbourhoods.head())
print(neighbourhoods.shape)
# 289 Postcodes

  Postcode           Borough     Neighbourhood
0      M1A      Not assigned      Not assigned
1      M2A      Not assigned      Not assigned
2      M3A        North York         Parkwoods
3      M4A        North York  Victoria Village
4      M5A  Downtown Toronto      Harbourfront
(289, 3)


In [4]:
# Clean the data 1

# Only cells which have an assigned borough
neighbourhoods=neighbourhoods.loc[neighbourhoods["Borough"] !="Not assigned"]
print(neighbourhoods.head())
print(neighbourhoods.shape)
# 212 Boroughs

  Postcode           Borough     Neighbourhood
2      M3A        North York         Parkwoods
3      M4A        North York  Victoria Village
4      M5A  Downtown Toronto      Harbourfront
5      M5A  Downtown Toronto       Regent Park
6      M6A        North York  Lawrence Heights
(212, 3)


In [5]:
# Clean the data 2

# More than one neighborhood can exist in one postal code area
# Combine these neighborhoods into one row with the neighborhoods separated with a comma

# Make a new column with the combined neighbourhoods
neighbourhood_list=neighbourhoods.groupby("Postcode")["Neighbourhood"].apply(lambda x: ', '.join(x)).reset_index()
#neighbourhood_list.columns=[["Postcode","Neighbourhood"]]

print(neighbourhood_list.head())
print(neighbourhood_list.shape)
# 103 Postal codes left

  Postcode                           Neighbourhood
0      M1B                          Rouge, Malvern
1      M1C  Highland Creek, Rouge Hill, Port Union
2      M1E       Guildwood, Morningside, West Hill
3      M1G                                  Woburn
4      M1H                               Cedarbrae
(103, 2)


In [6]:
# Add boroughs back in
neighbourhoods=neighbourhood_list.merge(neighbourhoods[["Postcode","Borough"]]).drop_duplicates()
print(neighbourhoods.head())
print(neighbourhoods.shape)

  Postcode                           Neighbourhood      Borough
0      M1B                          Rouge, Malvern  Scarborough
2      M1C  Highland Creek, Rouge Hill, Port Union  Scarborough
5      M1E       Guildwood, Morningside, West Hill  Scarborough
8      M1G                                  Woburn  Scarborough
9      M1H                               Cedarbrae  Scarborough
(103, 3)


In [7]:
# Clean the data 3

# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
# So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

print(neighbourhoods[neighbourhoods["Neighbourhood"]=="Not assigned"])
# Only 1 case

    Postcode Neighbourhood       Borough
160      M7A  Not assigned  Queen's Park


In [8]:
neighbourhoods.loc[160,"Neighbourhood"]="Queen's Park"
print(neighbourhoods[neighbourhoods["Neighbourhood"]=="Not assigned"])

Empty DataFrame
Columns: [Postcode, Neighbourhood, Borough]
Index: []


In [9]:
# Reorder the columns
neighbourhoods=neighbourhoods[["Postcode","Borough","Neighbourhood"]].reset_index()
neighbourhoods.drop("index",axis=1, inplace=True)
print(neighbourhoods.head())

  Postcode      Borough                           Neighbourhood
0      M1B  Scarborough                          Rouge, Malvern
1      M1C  Scarborough  Highland Creek, Rouge Hill, Port Union
2      M1E  Scarborough       Guildwood, Morningside, West Hill
3      M1G  Scarborough                                  Woburn
4      M1H  Scarborough                               Cedarbrae


<b>So this is the solution for Task 1: The Dataframe of the Toronto Neighbourhoods:</b>

In [10]:
print(neighbourhoods.shape)

(103, 3)


In [11]:
print(neighbourhoods.head(10))

  Postcode      Borough                                    Neighbourhood
0      M1B  Scarborough                                   Rouge, Malvern
1      M1C  Scarborough           Highland Creek, Rouge Hill, Port Union
2      M1E  Scarborough                Guildwood, Morningside, West Hill
3      M1G  Scarborough                                           Woburn
4      M1H  Scarborough                                        Cedarbrae
5      M1J  Scarborough                              Scarborough Village
6      M1K  Scarborough      East Birchmount Park, Ionview, Kennedy Park
7      M1L  Scarborough                  Clairlea, Golden Mile, Oakridge
8      M1M  Scarborough  Cliffcrest, Cliffside, Scarborough Village West
9      M1N  Scarborough                      Birch Cliff, Cliffside West


<h1>Task 2: Adding latitude and the longitude coordinates of each neighborhood.</h1>

The data is provided in the following csv file: http://cocl.us/Geospatial_data 

In [12]:
url='http://cocl.us/Geospatial_data'

In [13]:
location=pd.read_csv(url)
location.columns=["Postcode","Latitude","Longitude"]
print(location.head())

  Postcode   Latitude  Longitude
0      M1B  43.806686 -79.194353
1      M1C  43.784535 -79.160497
2      M1E  43.763573 -79.188711
3      M1G  43.770992 -79.216917
4      M1H  43.773136 -79.239476


In [14]:
neighbourhoods2=neighbourhoods.merge(location)
print(neighbourhoods2.head())
print(neighbourhoods2.shape)

  Postcode      Borough                           Neighbourhood   Latitude  \
0      M1B  Scarborough                          Rouge, Malvern  43.806686   
1      M1C  Scarborough  Highland Creek, Rouge Hill, Port Union  43.784535   
2      M1E  Scarborough       Guildwood, Morningside, West Hill  43.763573   
3      M1G  Scarborough                                  Woburn  43.770992   
4      M1H  Scarborough                               Cedarbrae  43.773136   

   Longitude  
0 -79.194353  
1 -79.160497  
2 -79.188711  
3 -79.216917  
4 -79.239476  
(103, 5)


<h1>Task 3: Exploring the neighbourhood</h1>

Quickly examine the resulting dataframe.

#### Use geopy library to get the latitude and longitude values of New York City.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>ny_explorer</em>, as shown below.

In [15]:
address = 'Toronto, Ontario, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


#### Create a map of Toronto with neighborhoods superimposed on top.

In [16]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighbourhood in zip(neighbourhoods2['Latitude'], neighbourhoods2['Longitude'], neighbourhoods2['Borough'], neighbourhoods2['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [17]:
# I decided to look into the area around the airport
North_York = neighbourhoods2[neighbourhoods2['Borough'] == 'North York'].reset_index(drop=True)
North_York.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M2H,North York,Hillcrest Village,43.803762,-79.363452
1,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556
2,M2K,North York,Bayview Village,43.786947,-79.385975
3,M2L,North York,"Silver Hills, York Mills",43.75749,-79.374714
4,M2M,North York,"Newtonbrook, Willowdale",43.789053,-79.408493


Let's get the geographical coordinates of North York.

In [18]:
address = 'North York, Toronto, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of North York are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of North York are 43.7708175, -79.4132998.


Let's visualize the North York area

In [19]:
# create map of Manhattan using latitude and longitude values
map_North_York = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(North_York['Latitude'], North_York['Longitude'], North_York['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_North_York)  
    
map_North_York

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [20]:
CLIENT_ID = 'GO4Z4KGOTWHWX2HCNJZNTHQOXOLLCSUOMCUFL4OE2W1NU5PI' # your Foursquare ID
CLIENT_SECRET = 'C2KBW0KV3TLTVY1F5LZ3NXI1DD5H1BRXHUBAKXVTKIDSKFYG' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: GO4Z4KGOTWHWX2HCNJZNTHQOXOLLCSUOMCUFL4OE2W1NU5PI
CLIENT_SECRET:C2KBW0KV3TLTVY1F5LZ3NXI1DD5H1BRXHUBAKXVTKIDSKFYG


#### Let's explore the first neighborhood in our dataframe.

Get the neighborhood's name.

In [21]:
North_York.loc[0, 'Neighbourhood']

'Hillcrest Village'

Get the neighborhood's latitude and longitude values.

In [22]:
neighbourhood_latitude = North_York.loc[0, 'Latitude'] # neighborhood latitude value
neighbourhood_longitude = North_York.loc[0, 'Longitude'] # neighborhood longitude value

neighbourhood_name = North_York.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighbourhood_name, 
                                                               neighbourhood_latitude, 
                                                               neighbourhood_longitude))

Latitude and longitude values of Hillcrest Village are 43.8037622, -79.3634517.


#### Now, let's get the top 100 venues that are in Hillcrest Village within a radius of 500 meters.

First, let's create the GET request URL. Name your URL **url**.

In [23]:
radius = 500
latitude = neighbourhood_latitude
longitude = neighbourhood_longitude
LIMIT = 100

url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
results = requests.get(url).json()
print(results)

{'meta': {'code': 200, 'requestId': '5c7eccfc351e3d13a545d02b'}, 'response': {'headerLocation': 'Toronto', 'headerFullLocation': 'Toronto', 'headerLocationGranularity': 'city', 'totalResults': 4, 'suggestedBounds': {'ne': {'lat': 43.808262204500004, 'lng': -79.3572281853783}, 'sw': {'lat': 43.7992621955, 'lng': -79.3696752146217}}, 'groups': [{'type': 'Recommended Places', 'name': 'recommended', 'items': [{'reasons': {'count': 0, 'items': [{'summary': 'This spot is popular', 'type': 'general', 'reasonName': 'globalInteractionReason'}]}, 'venue': {'id': '4ad9dce6f964a520651b21e3', 'name': "Eagle's Nest Golf Club", 'location': {'address': '10000 Dufferin Rd', 'lat': 43.805454826002794, 'lng': -79.36418592243415, 'labeledLatLngs': [{'label': 'display', 'lat': 43.805454826002794, 'lng': -79.36418592243415}], 'distance': 197, 'cc': 'CA', 'city': 'Toronto', 'state': 'ON', 'country': 'Canada', 'formattedAddress': ['10000 Dufferin Rd', 'Toronto ON', 'Canada']}, 'categories': [{'id': '4bf58dd8d

From the Foursquare lab in the previous module, we know that all the information is in the *items* key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.

In [24]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [25]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Eagle's Nest Golf Club,Golf Course,43.805455,-79.364186
1,AY Jackson Pool,Pool,43.804515,-79.366138
2,Villa Madina,Mediterranean Restaurant,43.801685,-79.363938
3,Duncan Creek Park,Dog Run,43.805539,-79.360695


And how many venues were returned by Foursquare?

In [26]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

4 venues were returned by Foursquare.


<a id='item2'></a>

## More exploration of the Neighborhoods in North York

#### Using the function from previous assignment to repeat the same process to all the neighborhoods in North York

In [27]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Running the function on each neighborhood and create a new dataframe *North_York_venues*.

In [28]:
North_York_venues = getNearbyVenues(names=North_York['Neighbourhood'],
                                   latitudes=North_York['Latitude'],
                                   longitudes=North_York['Longitude']
                                  )

Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
Silver Hills, York Mills
Newtonbrook, Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park, Don Mills South
Bathurst Manor, Downsview North, Wilson Heights
Northwood Park, York University
CFB Toronto, Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Bedford Park, Lawrence Manor East
Lawrence Heights, Lawrence Manor
Glencairn
Maple Leaf Park, North Park, Upwood Park
Humber Summit
Emery, Humberlea


#### Let's check the size of the resulting dataframe

In [29]:
print(North_York_venues.shape)
North_York_venues.head()

(233, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hillcrest Village,43.803762,-79.363452,Eagle's Nest Golf Club,43.805455,-79.364186,Golf Course
1,Hillcrest Village,43.803762,-79.363452,AY Jackson Pool,43.804515,-79.366138,Pool
2,Hillcrest Village,43.803762,-79.363452,Villa Madina,43.801685,-79.363938,Mediterranean Restaurant
3,Hillcrest Village,43.803762,-79.363452,Duncan Creek Park,43.805539,-79.360695,Dog Run
4,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,CF Fairview Mall,43.777803,-79.344226,Shopping Mall


Let's check how many venues were returned for each neighborhood

In [30]:
North_York_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Bathurst Manor, Downsview North, Wilson Heights",17,17,17,17,17,17
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",24,24,24,24,24,24
"CFB Toronto, Downsview East",4,4,4,4,4,4
Don Mills North,5,5,5,5,5,5
Downsview Central,3,3,3,3,3,3
Downsview Northwest,4,4,4,4,4,4
Downsview West,4,4,4,4,4,4
"Emery, Humberlea",1,1,1,1,1,1
"Fairview, Henry Farm, Oriole",66,66,66,66,66,66


#### Let's find out how many unique categories can be curated from all the returned venues

In [31]:
print('There are {} uniques categories.'.format(len(North_York_venues['Venue Category'].unique())))

There are 108 uniques categories.


<a id='item3'></a>

## 3. Analyze Each Neighborhood

In [32]:
# one hot encoding
North_York_onehot = pd.get_dummies(North_York_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
North_York_onehot['Neighbourhood'] = North_York_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [North_York_onehot.columns[-1]] + list(North_York_onehot.columns[:-1])
North_York_onehot = North_York_onehot[fixed_columns]

North_York_onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,Bar,...,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Video Store,Vietnamese Restaurant,Wings Joint,Women's Store
0,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Fairview, Henry Farm, Oriole",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [33]:
North_York_onehot.shape

(233, 109)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [34]:
North_York_grouped = North_York_onehot.groupby('Neighbourhood').mean().reset_index()
North_York_grouped

Unnamed: 0,Neighbourhood,Accessories Store,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,Bar,...,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Video Store,Vietnamese Restaurant,Wings Joint,Women's Store
0,"Bathurst Manor, Downsview North, Wilson Heights",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,...,0.058824,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,...,0.041667,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CFB Toronto, Downsview East",0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Don Mills North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Downsview Central,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Downsview Northwest,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Downsview West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"Emery, Humberlea",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Fairview, Henry Farm, Oriole",0.0,0.0,0.015152,0.0,0.030303,0.0,0.030303,0.015152,0.0,...,0.0,0.015152,0.0,0.015152,0.030303,0.015152,0.0,0.0,0.015152,0.030303


#### Let's confirm the new size

In [35]:
North_York_grouped.shape

(23, 109)

#### Let's print each neighborhood along with the top 5 most common venues

In [36]:
num_top_venues = 5

for hood in North_York_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = North_York_grouped[North_York_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bathurst Manor, Downsview North, Wilson Heights----
         venue  freq
0  Coffee Shop  0.12
1  Supermarket  0.06
2     Pharmacy  0.06
3  Pizza Place  0.06
4        Diner  0.06


----Bayview Village----
                 venue  freq
0   Chinese Restaurant  0.25
1                 Café  0.25
2                 Bank  0.25
3  Japanese Restaurant  0.25
4    Accessories Store  0.00


----Bedford Park, Lawrence Manor East----
                  venue  freq
0           Coffee Shop  0.08
1    Italian Restaurant  0.08
2  Fast Food Restaurant  0.08
3           Pizza Place  0.08
4     Indian Restaurant  0.04


----CFB Toronto, Downsview East----
                        venue  freq
0                     Airport  0.25
1  Construction & Landscaping  0.25
2                        Park  0.25
3                    Bus Stop  0.25
4                       Plaza  0.00


----Don Mills North----
                  venue  freq
0  Gym / Fitness Center   0.2
1  Caribbean Restaurant   0.2
2                  Café 

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [37]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [38]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = North_York_grouped['Neighbourhood']

for ind in np.arange(North_York_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(North_York_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Downsview North, Wilson Heights",Coffee Shop,Frozen Yogurt Shop,Supermarket,Pharmacy,Pizza Place,Deli / Bodega,Bridal Shop,Restaurant,Diner,Shopping Mall
1,Bayview Village,Chinese Restaurant,Café,Bank,Japanese Restaurant,Dog Run,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega
2,"Bedford Park, Lawrence Manor East",Coffee Shop,Fast Food Restaurant,Pizza Place,Italian Restaurant,Grocery Store,Indian Restaurant,Comfort Food Restaurant,Café,Liquor Store,Butcher
3,"CFB Toronto, Downsview East",Airport,Park,Construction & Landscaping,Bus Stop,Women's Store,Discount Store,Coffee Shop,Comfort Food Restaurant,Cosmetics Shop,Deli / Bodega
4,Don Mills North,Japanese Restaurant,Gym / Fitness Center,Caribbean Restaurant,Café,Basketball Court,Dog Run,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega


<a id='item4'></a>

## 4. Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [39]:
# set number of clusters
kclusters = 5

North_York_grouped_clustering = North_York_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(North_York_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 4, 0, 0, 0, 0, 1, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [40]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

North_York_merged = North_York

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
North_York_merged = North_York_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

North_York_merged.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M2H,North York,Hillcrest Village,43.803762,-79.363452,0.0,Golf Course,Pool,Mediterranean Restaurant,Dog Run,Diner,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop
1,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,0.0,Clothing Store,Fast Food Restaurant,Coffee Shop,Asian Restaurant,Bakery,Cosmetics Shop,Women's Store,Toy / Game Store,Deli / Bodega,Candy Store
2,M2K,North York,Bayview Village,43.786947,-79.385975,0.0,Chinese Restaurant,Café,Bank,Japanese Restaurant,Dog Run,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega
3,M2L,North York,"Silver Hills, York Mills",43.75749,-79.374714,3.0,Cafeteria,Women's Store,Dog Run,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop
4,M2M,North York,"Newtonbrook, Willowdale",43.789053,-79.408493,,,,,,,,,,,


In [41]:
print(North_York_merged[North_York_merged["Cluster Labels"].isnull()])

  Postcode     Borough            Neighbourhood   Latitude  Longitude  \
4      M2M  North York  Newtonbrook, Willowdale  43.789053 -79.408493   

   Cluster Labels 1st Most Common Venue 2nd Most Common Venue  \
4             NaN                   NaN                   NaN   

  3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue  \
4                   NaN                   NaN                   NaN   

  6th Most Common Venue 7th Most Common Venue 8th Most Common Venue  \
4                   NaN                   NaN                   NaN   

  9th Most Common Venue 10th Most Common Venue  
4                   NaN                    NaN  


In [42]:
North_York_merged.drop(4,inplace=True)

In [43]:
print(North_York_merged[North_York_merged["Cluster Labels"].isnull()])

Empty DataFrame
Columns: [Postcode, Borough, Neighbourhood, Latitude, Longitude, Cluster Labels, 1st Most Common Venue, 2nd Most Common Venue, 3rd Most Common Venue, 4th Most Common Venue, 5th Most Common Venue, 6th Most Common Venue, 7th Most Common Venue, 8th Most Common Venue, 9th Most Common Venue, 10th Most Common Venue]
Index: []


Finally, let's visualize the resulting clusters

In [44]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(North_York_merged['Latitude'], North_York_merged['Longitude'], North_York_merged['Neighbourhood'], North_York_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>

## 5. Examine Clusters

Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.

#### Cluster 1

In [45]:
North_York_merged.loc[North_York_merged['Cluster Labels'] == 0, North_York_merged.columns[[1] + list(range(5, North_York_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,0.0,Golf Course,Pool,Mediterranean Restaurant,Dog Run,Diner,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop
1,North York,0.0,Clothing Store,Fast Food Restaurant,Coffee Shop,Asian Restaurant,Bakery,Cosmetics Shop,Women's Store,Toy / Game Store,Deli / Bodega,Candy Store
2,North York,0.0,Chinese Restaurant,Café,Bank,Japanese Restaurant,Dog Run,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega
5,North York,0.0,Ramen Restaurant,Restaurant,Sandwich Place,Pizza Place,Coffee Shop,Café,Middle Eastern Restaurant,Bubble Tea Shop,Plaza,Pharmacy
7,North York,0.0,Pizza Place,Pharmacy,Coffee Shop,Butcher,Chocolate Shop,Clothing Store,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega
8,North York,0.0,Park,Food & Drink Shop,Fast Food Restaurant,Women's Store,Diner,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop
9,North York,0.0,Japanese Restaurant,Gym / Fitness Center,Caribbean Restaurant,Café,Basketball Court,Dog Run,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega
10,North York,0.0,Grocery Store,Gym,Asian Restaurant,Beer Store,Coffee Shop,Bike Shop,Chinese Restaurant,General Entertainment,Dim Sum Restaurant,Japanese Restaurant
11,North York,0.0,Coffee Shop,Frozen Yogurt Shop,Supermarket,Pharmacy,Pizza Place,Deli / Bodega,Bridal Shop,Restaurant,Diner,Shopping Mall
12,North York,0.0,Coffee Shop,Caribbean Restaurant,Furniture / Home Store,Miscellaneous Shop,Bar,Massage Studio,Discount Store,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop


#### Cluster 2

In [46]:
North_York_merged.loc[North_York_merged['Cluster Labels'] == 1, North_York_merged.columns[[1] + list(range(5, North_York_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
23,North York,1.0,Baseball Field,Women's Store,Electronics Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop


#### Cluster 3

In [47]:
North_York_merged.loc[North_York_merged['Cluster Labels'] == 2, North_York_merged.columns[[1] + list(range(5, North_York_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,North York,2.0,Empanada Restaurant,Women's Store,Dog Run,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store


#### Cluster 4

In [48]:
North_York_merged.loc[North_York_merged['Cluster Labels'] == 3, North_York_merged.columns[[1] + list(range(5, North_York_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,North York,3.0,Cafeteria,Women's Store,Dog Run,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop


#### Cluster 5

In [49]:
North_York_merged.loc[North_York_merged['Cluster Labels'] == 4, North_York_merged.columns[[1] + list(range(5, North_York_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,North York,4.0,Park,Bank,Women's Store,Dog Run,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store
13,North York,4.0,Airport,Park,Construction & Landscaping,Bus Stop,Women's Store,Discount Store,Coffee Shop,Comfort Food Restaurant,Cosmetics Shop,Deli / Bodega
21,North York,4.0,Park,Construction & Landscaping,Bakery,Basketball Court,Women's Store,Dog Run,Coffee Shop,Comfort Food Restaurant,Cosmetics Shop,Deli / Bodega


<hr>

Copyright &copy; 2018 [Cognitive Class](https://cognitiveclass.ai/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).