# Segmenting and Clustering Neighborhoods in Toronto | Oleksandr Tsapin
Peer-graded Assignment, 19.11.2020

<div class="alert alert-block alert-danger">
<b>( ! ) For proper view of this notebook, with all the maps, lease use "nbviewer" - link provided below ( ! )
    
https://nbviewer.jupyter.org/</b> 
</div>



### PART 1 (LINK 1)
#### Scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Step 1. Fetch the HTML from the URL using Urllib.request

In [1]:
# import the library we will be using to connect to the Wikipedia page and fetch the contents of that page
import urllib.request

In [2]:
# specify the URL of the Wikipedia page we are looking to scrape
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [3]:
# open the url using urllib.request and put the HTML into the page variable
page = urllib.request.urlopen(url)

Step 2. Use the power of BeautifulSoup to extract and work with the data

In [4]:
# import the BeautifulSoup library so we can parse HTML and XML documents
from bs4 import BeautifulSoup

In [5]:
# we use Beautiful Soup to parse the HTML data we stored in our ‘url’ variable and store it in a new variable 
# called ‘soup’ in the Beautiful Soup format. Jupyter Notebook prefers we specify a parser format so we use 
# the “lxml” library option

soup = BeautifulSoup(page, "lxml")

In [6]:
# To get an idea of the structure of the underlying HTML in our web page, we can view the code in 
# two ways: a) right click on the web page itself and click View Source 
# or b) use Beautiful Soup’s prettify function and check it out right there in our Jupyter Notebook.

#print(soup.prettify())

In [7]:
# use the 'find_all' function to bring back all instances of the 'table' tag in the HTML 
# and store in 'all_tables' variable
all_tables = soup.find_all("table")
#all_tables

In [8]:
# Looking through the output of ”all_tables” we can again see that the class id of our chosen table 
# is ”wikitable sortable”. We can use this to get BS to only bring back the table data for this particular table 
# and keep that in a variable called ”right_table“
right_table = soup.find('table', class_='wikitable sortable')
#right_table

In [9]:
# We know that the table is set up in rows (starting with <tr> tags) with the data sitting 
# within <td> tags in each row. We aren’t too worried about the header row with the <th> elements 
# as we know what each of the columns represent by looking at the table.

# Let's start looping through the rows
# There are three columns in our table that we want to scrape the data from so we will set up 
# three empty lists (A, B, and C) to store our data in.

# To start with, we want to use the Beautiful Soup ‘find_all’ function again and set it to look for 
# the string ‘tr’. We will then set up a FOR loop for each row within that array and set Python to loop through 
# the rows, one by one.

# Within the loop we are going to use find_all again to search each row for <td> tags with the ‘td’ string. 
# We will add all of these to a variable called ‘cells’ and then check to make sure that there are 3 items 
# in our ‘cells’ array (i.e. one for each column).

# If there are then we use the find(text=True)) option to extract the content string from within each <td> element 
# in that row and add them to the A-C lists we created at the start of this step. Let’s have a look at the code:

A = []
B = []
C = []

for row in right_table.findAll('tr'):
    cells = row.findAll('td')
    if len(cells) == 3:
        A.append(cells[0].find(text = True))
        B.append(cells[1].find(text = True))
        C.append(cells[2].find(text = True))

In [10]:
print(A[:10], B[:10], C[:10])

['M1A\n', 'M2A\n', 'M3A\n', 'M4A\n', 'M5A\n', 'M6A\n', 'M7A\n', 'M8A\n', 'M9A\n', 'M1B\n'] ['Not assigned\n', 'Not assigned\n', 'North York\n', 'North York\n', 'Downtown Toronto\n', 'North York\n', 'Downtown Toronto\n', 'Not assigned\n', 'Etobicoke\n', 'Scarborough\n'] ['Not assigned\n', 'Not assigned\n', 'Parkwoods\n', 'Victoria Village\n', 'Regent Park, Harbourfront\n', 'Lawrence Manor, Lawrence Heights\n', "Queen's Park, Ontario Provincial Government\n", 'Not assigned\n', 'Islington Avenue, Humber Valley Village\n', 'Malvern, Rouge\n']


###### Achtung! 

We see the unwanted \n near each item in the lists. This is a Python new line character. Let's remove it before converting data into pandas data frame.

In [11]:
A_clean = []
B_clean = []
C_clean = []

for x in A:
    y = x.strip('\n')
    A_clean.append(y)

for x in B:
    y = x.strip('\n')
    B_clean.append(y)
    
for x in C:
    y = x.strip('\n')
    C_clean.append(y)
    
print(A_clean[:10], B_clean[:10], C_clean[:10])

['M1A', 'M2A', 'M3A', 'M4A', 'M5A', 'M6A', 'M7A', 'M8A', 'M9A', 'M1B'] ['Not assigned', 'Not assigned', 'North York', 'North York', 'Downtown Toronto', 'North York', 'Downtown Toronto', 'Not assigned', 'Etobicoke', 'Scarborough'] ['Not assigned', 'Not assigned', 'Parkwoods', 'Victoria Village', 'Regent Park, Harbourfront', 'Lawrence Manor, Lawrence Heights', "Queen's Park, Ontario Provincial Government", 'Not assigned', 'Islington Avenue, Humber Valley Village', 'Malvern, Rouge']


Step 3. Transform the data into a pandas dataframe

In [12]:
# Pandas lets us convert lists into dataframes which are 2 dimensional data structures with rows and 
# columns, very much like spreadsheets or SQL tables.

# We’ll import pandas and create a dataframe with it, assigning each of the lists A-C into a column 
# with the name of our source table columns i.e. Postal_Code, Borough, Neighbourhood.

import pandas as pd
df = pd.DataFrame(A_clean,columns=['Postal_Code'])
df['Borough'] = B_clean
df['Neighbourhood'] = C_clean
df

Unnamed: 0,Postal_Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Step 4. Clean the data

In [13]:
# Drop raws if 'Borough' column is 'Not assigned'
Not_assigned = df[df['Borough'] == 'Not assigned'].index
df.drop(Not_assigned, inplace = True)
df

Unnamed: 0,Postal_Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [14]:
# reset the index and assigned it to df2, main dataframe we are going to work with.
df2 = df.reset_index(drop = True)
df2

Unnamed: 0,Postal_Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [15]:
# Assignment: If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be 
# the same as the borough.

# Let's check if there are many Not assigned neighborhoods in dataframe.
if 'Not assigned' not in df2.values:
    print('Element does not exists in Dataframe')

Element does not exists in Dataframe


In [16]:
# Use the .shape method to print the number of rows of your dataframe.
df2.shape

(103, 3)

### PART 2 (LINK 2)
#### Use the Geocoder Python package to add the latitude and the longitude coordinates of each neighborhood to the dataframe.

Step 1. Get geospatial data

In [17]:
# import geocoder
import geocoder 

latitude=[]
longitude=[]

# google API doesn't work, I use arcgis API instead.
for postal_code in df2['Postal_Code']:
    g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
    print(postal_code, g.latlng)
    while (g.latlng is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
        print(postal_code, g.latlng)
    latlng = g.latlng
    latitude.append(latlng[0])
    longitude.append(latlng[1])

M3A [43.75245000000007, -79.32990999999998]
M4A [43.73057000000006, -79.31305999999995]
M5A [43.65512000000007, -79.36263999999994]
M6A [43.72327000000007, -79.45041999999995]
M7A [43.66253000000006, -79.39187999999996]
M9A [43.662630000000036, -79.52830999999998]
M1B [43.811390000000074, -79.19661999999994]
M3B [43.74923000000007, -79.36185999999998]
M4B [43.70718000000005, -79.31191999999999]
M5B [43.65739000000008, -79.37803999999994]
M6B [43.70687000000004, -79.44811999999996]
M9B [43.65034000000003, -79.55361999999997]
M1C [43.78574000000003, -79.15874999999994]
M3C [43.72168000000005, -79.34351999999996]
M4C [43.68970000000007, -79.30681999999996]
M5C [43.65215000000006, -79.37586999999996]
M6C [43.69211000000007, -79.43035999999995]
M9C [43.64857000000006, -79.57824999999997]
M1E [43.765750000000025, -79.17469999999997]
M4E [43.67709000000008, -79.29546999999997]
M5E [43.64536000000004, -79.37305999999995]
M6E [43.68784000000005, -79.45045999999996]
M1G [43.76812000000007, -79.2

Step 2. Add new geospatial data to our dataframe df2

In [18]:
df2['Latitude'] = latitude
df2['Longitude'] = longitude
df2

Unnamed: 0,Postal_Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75245,-79.32991
1,M4A,North York,Victoria Village,43.73057,-79.31306
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.65319,-79.51113
99,M4Y,Downtown Toronto,Church and Wellesley,43.66659,-79.38133
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.64869,-79.38544
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.63278,-79.48945


In [19]:
# Use the .shape method to print the number of rows of your dataframe.
df2.shape

(103, 5)

### PART 3 (LINK 3)
#### Explore and cluster the neighborhoods in Toronto. 

Step 1. Create a map of Toronto with neighborhoods superimposed on top.

In [20]:
# Print number of unique boroughs and neighborhoods in Toronto
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df2['Borough'].unique()),
        len(df2['Neighbourhood'].unique())
    )
)

The dataframe has 10 boroughs and 99 neighborhoods.


In [21]:
# Use geopy library to get the latitude and longitude values of Toronto

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [22]:
# Install Folium to visualize the map of Toronto
!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df2['Latitude'], df2['Longitude'], df2['Borough'], df2['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



Step 2. I decided to segment and cluster only boroughs that contain the word Toronto. So let's slice the original dataframe and create a new dataframe of the 'Toronto data'.

In [23]:
toronto_data = df2[df2['Borough'].str.contains('Toronto')].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,Postal_Code,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.65739,-79.37804
3,M5C,Downtown Toronto,St. James Town,43.65215,-79.37587
4,M4E,East Toronto,The Beaches,43.67709,-79.29547


In [24]:
# Let's get the geographical coordinates of Downtown Toronto.
address = 'Downtown Toronto, Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Toronto are 43.6563221, -79.3809161.


In [25]:
# Let's visualize the Downtown Toronto and all the neighborhoods that contain the word Toronto superimposed on top.

# create map of Downtown Toronto using latitude and longitude values
map_DToronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighbourhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_DToronto)  
    
map_DToronto

Step 3. Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

In [26]:
# Define Foursquare Credentials and Version

CLIENT_ID = 'VA0PCQ2AWBYEFFN0WFOHL3IIZFVLVHPIZRO3CP4RFX4XTRS2' # your Foursquare ID
CLIENT_SECRET = 'K5IV2CCAJNNGAL5TB1STKHKYSUHROOYAU2Z1FSW30HISUFIF' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: VA0PCQ2AWBYEFFN0WFOHL3IIZFVLVHPIZRO3CP4RFX4XTRS2
CLIENT_SECRET:K5IV2CCAJNNGAL5TB1STKHKYSUHROOYAU2Z1FSW30HISUFIF


Step 4. Let's explore the first neighborhood in our dataframe.

In [27]:
# Get the neighborhood's name.
toronto_data.loc[0, 'Neighbourhood']

'Regent Park, Harbourfront'

In [28]:
# Get the neighborhood's latitude and longitude values.
neighborhood_latitude = toronto_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = toronto_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = toronto_data.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Regent Park, Harbourfront are 43.65512000000007, -79.36263999999994.


Now, let's get the top 100 venues that are in Marble Hill within a radius of 500 meters.

In [29]:
# First, let's create the GET request URL. Name your URL url.
latitude = neighborhood_latitude
longitude = neighborhood_longitude
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?client_id=VA0PCQ2AWBYEFFN0WFOHL3IIZFVLVHPIZRO3CP4RFX4XTRS2&client_secret=K5IV2CCAJNNGAL5TB1STKHKYSUHROOYAU2Z1FSW30HISUFIF&ll=43.65512000000007,-79.36263999999994&v=20180605&radius=500&limit=100'

In [30]:
# Send the GET request and examine the resutls

import requests # library to handle requests

results = requests.get(url).json()
#results

In [31]:
# We know that all the information is in the items key. Let's create the function that extracts the category 
# of the venue.

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [32]:
# Now we are ready to clean the json and structure it into a pandas dataframe.

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  import sys


Unnamed: 0,name,categories,lat,lng
0,Roselle Desserts,Bakery,43.653447,-79.362017
1,Tandem Coffee,Coffee Shop,43.653559,-79.361809
2,Figs Breakfast & Lunch,Breakfast Spot,43.655675,-79.364503
3,The Yoga Lounge,Yoga Studio,43.655515,-79.364955
4,Body Blitz Spa East,Spa,43.654735,-79.359874


In [33]:
# Let's see how many venues were returned by Foursquare.
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

20 venues were returned by Foursquare.


Step 5. Explore Neighborhoods in boroughs that contain the word Toronto.

In [34]:
# Let's create a function to repeat the same process to all the neighborhoods in boroughs that contain 
# the word Toronto.

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [35]:
# Now write the code to run the above function on each neighborhood and create a new dataframe called toronto_venues.

toronto_venues = getNearbyVenues(names=toronto_data['Neighbourhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West, Forest Hill Road Park
High Park, The Junction South
North Toronto West,  Lawrence Park
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport


In [36]:
# Let's check the size of the resulting dataframe
print(toronto_venues.shape)
toronto_venues.head()

(1738, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65512,-79.36264,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65512,-79.36264,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65512,-79.36264,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot
3,"Regent Park, Harbourfront",43.65512,-79.36264,The Yoga Lounge,43.655515,-79.364955,Yoga Studio
4,"Regent Park, Harbourfront",43.65512,-79.36264,Body Blitz Spa East,43.654735,-79.359874,Spa


In [37]:
# Let's check how many venues were returned for each neighborhood
toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,60,60,60,60,60,60
"Brockton, Parkdale Village, Exhibition Place",85,85,85,85,85,85
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",100,100,100,100,100,100
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",76,76,76,76,76,76
Central Bay Street,76,76,76,76,76,76
Christie,11,11,11,11,11,11
Church and Wellesley,79,79,79,79,79,79
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,26,26,26,26,26,26
Davisville North,8,8,8,8,8,8


In [38]:
# Let's find out how many unique categories can be curated from all the returned venues
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 229 uniques categories.


Step 6. Analyze Each Neighborhood

In [39]:
# Let's do preprocessing. A big part of preprocessing is encoding - representing every single piece of data 
# in a way that a computer can understand (the name literally means "convert to computer code").
# In many branches of computer science, especially machine learning and digital circuit design, 
# One-Hot Encoding is widely used. One-hot Encoding is a type of vector representation in which all of the elements 
# in a vector are 0, except for one, which has 1 as its value, where 1 represents a boolean specifying a category 
# of the element.

# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [40]:
# let's examine the new dataframe size.
toronto_onehot.shape

(1738, 230)

In [41]:
# Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category.

toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.016667,0.0,0.016667,0.0,0.0,0.0,...,0.0,0.016667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016667
1,"Brockton, Parkdale Village, Exhibition Place",0.011765,0.0,0.0,0.0,0.0,0.0,0.0,0.023529,0.011765,...,0.0,0.011765,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011765
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.01,0.03,...,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,...,0.013158,0.0,0.013158,0.0,0.0,0.0,0.0,0.0,0.0,0.013158
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.013158,0.013158,0.0,0.0,...,0.0,0.0,0.0,0.013158,0.013158,0.013158,0.0,0.0,0.0,0.0
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.0,0.012658,0.012658,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.012658,0.0,0.0,0.0,0.0,0.012658
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.03,0.0,0.0,0.01,0.0,0.0,0.01,...,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [42]:
# Let's confirm the new size
toronto_grouped.shape

(38, 230)

In [43]:
# Let's print each neighborhood along with the top 5 most common venues

num_top_venues = 5

for hood in toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
            venue  freq
0     Coffee Shop  0.08
1     Cheese Shop  0.03
2  Breakfast Spot  0.03
3    Cocktail Bar  0.03
4        Beer Bar  0.03


----Brockton, Parkdale Village, Exhibition Place----
            venue  freq
0     Coffee Shop  0.06
1            Café  0.06
2             Bar  0.06
3      Restaurant  0.05
4  Sandwich Place  0.04


----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
              venue  freq
0       Coffee Shop  0.10
1             Hotel  0.05
2        Restaurant  0.04
3              Café  0.03
4  Asian Restaurant  0.03


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
                venue  freq
0  Italian Restaurant  0.08
1         Coffee Shop  0.07
2                Café  0.07
3   French Restaurant  0.04
4                Park  0.04


----Central Bay Street----
            venue  freq
0     Coffee Shop  0.12
1  Clothing Store  0.05


Let's put that into a pandas dataframe.

In [44]:
# First, let's write a function to sort the venues in descending order.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [45]:
# Now let's create the new dataframe and display the top 10 venues for each neighborhood.

import numpy as np # library to handle data in a vectorized manner

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Farmers Market,Beer Bar,Breakfast Spot,Cocktail Bar,Restaurant,Cheese Shop,Bakery,Seafood Restaurant,Lounge
1,"Brockton, Parkdale Village, Exhibition Place",Coffee Shop,Bar,Café,Restaurant,Gift Shop,Sandwich Place,Nightclub,Japanese Restaurant,Supermarket,Furniture / Home Store
2,"Business reply mail Processing Centre, South C...",Coffee Shop,Hotel,Restaurant,Café,Italian Restaurant,Bar,Asian Restaurant,Salon / Barbershop,Thai Restaurant,Pub
3,"CN Tower, King and Spadina, Railway Lands, Har...",Italian Restaurant,Café,Coffee Shop,Bar,Park,French Restaurant,Lounge,Sandwich Place,Restaurant,Gym / Fitness Center
4,Central Bay Street,Coffee Shop,Clothing Store,Restaurant,Sandwich Place,Sushi Restaurant,Café,Plaza,Bubble Tea Shop,Cosmetics Shop,Bookstore


Step 7. Cluster Neighborhoods

In [46]:
# Run k-means to cluster the neighborhood into 5 clusters.

# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[:100] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 4,
       0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0], dtype=int32)

In [47]:
# Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# merge neighborhoods_venues_sorted with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postal_Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264,0.0,Coffee Shop,Breakfast Spot,Yoga Studio,Theater,Pub,Distribution Center,Restaurant,Electronics Store,Event Space,Food Truck
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188,0.0,Coffee Shop,Sandwich Place,Mediterranean Restaurant,Italian Restaurant,Café,Falafel Restaurant,Fried Chicken Joint,Bank,Theater,Gastropub
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.65739,-79.37804,0.0,Coffee Shop,Clothing Store,Café,Cosmetics Shop,Japanese Restaurant,Furniture / Home Store,Theater,Ramen Restaurant,Bookstore,Movie Theater
3,M5C,Downtown Toronto,St. James Town,43.65215,-79.37587,0.0,Coffee Shop,Cocktail Bar,Cosmetics Shop,Gastropub,Clothing Store,Restaurant,Café,Hotel,Japanese Restaurant,Beer Bar
4,M4E,East Toronto,The Beaches,43.67709,-79.29547,3.0,Health Food Store,Pub,Trail,Neighborhood,Yoga Studio,Eastern European Restaurant,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm


In [48]:
# Check types of values in dataframe toronto_merged
toronto_merged.dtypes

Postal_Code                object
Borough                    object
Neighbourhood              object
Latitude                  float64
Longitude                 float64
Cluster Labels            float64
1st Most Common Venue      object
2nd Most Common Venue      object
3rd Most Common Venue      object
4th Most Common Venue      object
5th Most Common Venue      object
6th Most Common Venue      object
7th Most Common Venue      object
8th Most Common Venue      object
9th Most Common Venue      object
10th Most Common Venue     object
dtype: object

In [49]:
# See the shape of dataframe before droping NA's
toronto_merged.shape

(39, 16)

In [50]:
# Drop NA's from Cluster Labels column and then change type of the vale Cluster Labels from float to integer
toronto_merged = toronto_merged.dropna()
toronto_merged = toronto_merged.reset_index(drop = True)

# Change types of value Cluster Labels from float to integer
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].astype(int)

In [51]:
# See the shape of dataframe after droping NA's
toronto_merged.shape

(38, 16)

In [52]:
# Finally, let's visualize the resulting clusters

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Step 8. Examine Clusters

Let's examine each cluster and determine the discriminating venue categories that distinguish each cluster.

Cluster 1

In [53]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,0,Coffee Shop,Breakfast Spot,Yoga Studio,Theater,Pub,Distribution Center,Restaurant,Electronics Store,Event Space,Food Truck
1,Downtown Toronto,0,Coffee Shop,Sandwich Place,Mediterranean Restaurant,Italian Restaurant,Café,Falafel Restaurant,Fried Chicken Joint,Bank,Theater,Gastropub
2,Downtown Toronto,0,Coffee Shop,Clothing Store,Café,Cosmetics Shop,Japanese Restaurant,Furniture / Home Store,Theater,Ramen Restaurant,Bookstore,Movie Theater
3,Downtown Toronto,0,Coffee Shop,Cocktail Bar,Cosmetics Shop,Gastropub,Clothing Store,Restaurant,Café,Hotel,Japanese Restaurant,Beer Bar
5,Downtown Toronto,0,Coffee Shop,Farmers Market,Beer Bar,Breakfast Spot,Cocktail Bar,Restaurant,Cheese Shop,Bakery,Seafood Restaurant,Lounge
6,Downtown Toronto,0,Coffee Shop,Clothing Store,Restaurant,Sandwich Place,Sushi Restaurant,Café,Plaza,Bubble Tea Shop,Cosmetics Shop,Bookstore
7,Downtown Toronto,0,Café,Grocery Store,Coffee Shop,Playground,Candy Store,Athletics & Sports,Italian Restaurant,Baby Store,Farm,Escape Room
8,Downtown Toronto,0,Hotel,Coffee Shop,Café,Restaurant,Japanese Restaurant,Gym,American Restaurant,Salad Place,Steakhouse,Asian Restaurant
9,West Toronto,0,Park,Grocery Store,Middle Eastern Restaurant,Brazilian Restaurant,Café,Bar,Bank,Bakery,Athletics & Sports,Furniture / Home Store
10,Downtown Toronto,0,Coffee Shop,Hotel,Japanese Restaurant,Restaurant,Plaza,Aquarium,Park,Deli / Bodega,Boat or Ferry,Electronics Store


Cluster 2

In [54]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,Central Toronto,1,Bus Line,Swim School,Yoga Studio,Elementary School,Flea Market,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm


Cluster 3

In [55]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
20,Central Toronto,2,French Restaurant,Park,Yoga Studio,Electronics Store,Flea Market,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm


Cluster 4

In [56]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,East Toronto,3,Health Food Store,Pub,Trail,Neighborhood,Yoga Studio,Eastern European Restaurant,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm


Cluster 5

In [57]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Central Toronto,4,Playground,Gym Pool,Park,Eastern European Restaurant,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant
32,Downtown Toronto,4,Playground,Tennis Court,Park,Bike Trail,Shop & Service,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant
