<h1>Segmenting and Clustering Troronto Neighborhoods</h1>

This Project is part of the IBM Data Science Professional Certificate Caoptone.

<b>Author</b>: Julian Oellrich

<h2>Table of content</h2>

<ol>
  <li>Setup and Libraries</li>
  <li>Data Import</li>
  <li>Geo-Data of Toronto Boroughs</li>
  <li>Explore Neighbourhood Venues</li>
  <li>Analyze each neighbourhood</li>
  <li>Cluster Neighbourhoods</li>
  <li>Examine Clusters</li>
</ol>

<h2 id="libraries">1. Setup and Libraries</h2>

Import Libraries

In [76]:
# Comuptation Libraries
import pandas as pd
import numpy as np

# API Communication
import json
import requests

# Plotting 
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

# Maps and geocoding
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium

# ML Algorithms
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# json & API
import json, lxml
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# data scrping modules
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings('ignore')

# Print Statement
print('Libraries imported!')

Libraries imported!


<h2 id="data_import">2. Data Import</h2>

<h3>Scrape Data from Wikipedia</h3>

Scraping the data from wikipedia page with BeautifulSoup Librariy

In [77]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(url).text
soup = BeautifulSoup(source)

Extracting data from table on Wikipedia page

In [78]:
table_data = soup.find('div', class_='mw-parser-output')
table = table_data.table.tbody

Write Table data into new dataframe

In [79]:
columns = ['PostalCode', 'Borough', 'Neighbourhood']
data = dict({key:[]*len(columns) for key in columns})

for row in table.find_all('tr'):
    for i,column in zip(row.find_all('td'),columns):
        i = i.text
        i = i.replace('\n', '')
        data[column].append(i)

df = pd.DataFrame.from_dict(data=data)[columns]
print(df.shape)
df.head(10)

(180, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


<h3>Clean Data</h3>

<h4>Remove not assigned Boroughs</h4>

Only the cells that have an assigned borough should be processed. That means cells with a borough that is 'Not assigned' should be removed from the data frame.

1. The first step is to count how many rows have a 'Not assigned' Borough:

In [80]:
print_statement = 'There are {} rows where Borough is Not assigned'.format(
    df[df['Borough'] == 'Not assigned'].shape[0])
print(print_statement)

There are 77 rows where Borough is Not assigned


2. Drop the rows where Borough is 'Not assigned' and write it into new dataframe df_cleaned

In [81]:
df_cleaned = df.drop(df[df['Borough'] == 'Not assigned'].index) 
df_cleaned.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


3. Check if there are any rows with Not assigned Borough in the new dataframe

In [82]:
print_statement = 'There are {} rows where Borough is Not assigned'.format(
    df_cleaned[df['Borough'] == 'Not assigned'].shape[0])
print(print_statement)

There are 0 rows where Borough is Not assigned


<h4>Fill in not assigned neighborhood</h4>

If a cell has a borough but a 'Not assigned' neighborhood, then the neighborhood will be the same as the borough

Let's check how many rows have a 'Not assigned' Neighbourhood:

In [83]:
print_statement = 'There are {} rows where Neighbourhood is Not assigned'.format(
    df_cleaned[df_cleaned['Neighbourhood'] == 'Not assigned'].shape[0])
print(print_statement)

There are 0 rows where Neighbourhood is Not assigned


<b>Conclusion:</b> It shows, that there are <b> no rows</b> with 'Not assigned' Neighbourhood but assigned Borough

<h4>Check duplicate Postalcodes</h4>

In [84]:
# create new test dataframe with only neighbourhoods and postal codes
df_neigh = df_cleaned[['PostalCode','Neighbourhood']]

# check Postalcode duplicates and create new bool column that marks duplicates with True
df_neigh['duplicate bool'] = df_neigh['PostalCode'].duplicated(keep = 'first')

# Output value count of duplicate PostalCode values
df_neigh['duplicate bool'].value_counts()

False    103
Name: duplicate bool, dtype: int64

<b>Conclusion:</b> There are no duplicate PostCode values 

<h4>Final cleaned DataFrame</h4>

Finally change the columnname 'PostalCode' to 'Postal Code'

In [85]:
df_cleaned.rename(columns={'PostalCode': 'Postal Code'}, inplace = True)

Write cleaned data to Dataframe df_toronto and print shape

In [86]:
# define print strings
print_statement1 = '\nThe cleaned dataframe has {} rows and {} columns \n'.format(df_cleaned.shape[0], df_cleaned.shape[1])

# output strings
print(print_statement1)

# output head of dataframe
df_cleaned.head()


The cleaned dataframe has 103 rows and 3 columns 



Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


<h2 id="geodata">3. Geo-Data of Toronto Boroughs</h2>

<h3>Import geospatial coordinates</h3>

Import latitude and longitude data from csv file 'Toronto_Geospatial_Coordinates.csv' provided by Coursera

In [87]:
geospatial_toronto = pd.read_csv('Toronto_Geospatial_Coordinates.csv')
geospatial_toronto.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


<h3>Merge geospatial coordinates to toronto DataFrame</h3>

Merging 'geospatial_toronto' to 'df_cleaned' with key 'Postal Code' and save it to new DataFrame 'df_toronto'

In [88]:
df_toronto = pd.merge(df_cleaned, geospatial_toronto, how= 'inner', on ='Postal Code')

Explore some Facts abpout the Toronto DataFrame

In [89]:
# define print strings
print_statement0 = '\n* --- Some facts about the toronto DataFrame --- *'
print_statement1 = '\nThe toronto dataframe has {} rows and {} columns'.format(df_toronto.shape[0], df_toronto.shape[1])
print_statement2 = 'Toronto has {} unique Boroughs'.format(len(df_toronto['Borough'].unique()))
print_statement3 = 'Toronto has {} unique Postal Codes'.format(len(df_toronto['Postal Code'].unique()))

print_statement9 = '\nThe Dataframe looks as following: \n'

# output strings
print(print_statement0)
print(print_statement1)
print(print_statement2)
print(print_statement3)
print(print_statement9)

# output head of dataframe
df_toronto.head(8)


* --- Some facts about the toronto DataFrame --- *

The toronto dataframe has 103 rows and 5 columns
Toronto has 10 unique Boroughs
Toronto has 103 unique Postal Codes

The Dataframe looks as following: 



Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188


<h3>Create a Toronto Map</h3>

<h4>Get Toronto Coordinatesp</h4>

In [90]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
lat = location.latitude
lng = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(lat, lng))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


<h4>Create a map of Toronto with neighborhoods superimposed on top.</h4>

In [91]:
# create map of Manhattan using latitude and longitude values
map_toronto = folium.Map(location=[lat, lng], zoom_start=10)

# add markers to map
for lat, lng, postalcode, borough, neighbourhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Postal Code'], df_toronto['Borough'], df_toronto['Neighbourhood']):
    label = '{} | {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

<h2 id="explore_venues">4. Explore Neighbourhood Venues</h2>

<h3>Define Foursquare Credentials and Version</h3>

In [92]:
CLIENT_ID = 'LUYAHPN5TJHESOZ0M0IR4DWVVH3BXMXTECOVF4L2PNCXS2YV' # your Foursquare ID
CLIENT_SECRET = 'JI3SI0ENMW45AYTRQUQQKHALUNFL34NMHW0S0I3DF10QEFCW' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: LUYAHPN5TJHESOZ0M0IR4DWVVH3BXMXTECOVF4L2PNCXS2YV
CLIENT_SECRET:JI3SI0ENMW45AYTRQUQQKHALUNFL34NMHW0S0I3DF10QEFCW


<h3>Let's explore an example neighbourhood</h3>

For testing the venue exploration i will explore the venues for one neighbourhood at first. 
After that first exploration i will define a function to repeat that exploration process with all neighbourhoods.

Get the name of he first neighbourhood

In [93]:
neighbourhood_name = df_toronto.loc[0, 'Neighbourhood']
print('The first Neighbourhood is:',neighbourhood_name)

The first Neighbourhood is: Parkwoods


Get the neighbourhoods coordinates

In [94]:
neighbourhood_latitude = df_toronto.loc[0, 'Latitude'] # neighborhood latitude value
neighbourhood_longitude = df_toronto.loc[0, 'Longitude'] # neighborhood longitude value

print('Latitude and longitude values of {} are {}, {}.'.format(neighbourhood_name, 
                                                               neighbourhood_latitude, 
                                                               neighbourhood_longitude))

Latitude and longitude values of Parkwoods are 43.7532586, -79.3296565.


<h4>Get the top 100 venues that are in Parkwoods within a radius of 1000 meters.</h4>

Create the GET request URL

In [95]:
radius = 1000

In [96]:
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighbourhood_latitude, 
    neighbourhood_longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=LUYAHPN5TJHESOZ0M0IR4DWVVH3BXMXTECOVF4L2PNCXS2YV&client_secret=JI3SI0ENMW45AYTRQUQQKHALUNFL34NMHW0S0I3DF10QEFCW&v=20180605&ll=43.7532586,-79.3296565&radius=1000&limit=100'

Get results JSON

In [97]:
results = requests.get(url).json()
#results

All the important information is in the **items** key of the result json.

In order to extract the category information of the venues a function **get_category_type** is defined

In [98]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Clean the json and structure it into a **pandas** dataframe.

In [99]:
venues = results['response']['groups'][0]['items']

# flatten JSON
nearby_venues = json_normalize(venues)

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Allwyn's Bakery,Caribbean Restaurant,43.75984,-79.324719
1,Brookbanks Park,Park,43.751976,-79.33214
2,Tim Hortons,Café,43.760668,-79.326368
3,Bruno's valu-mart,Grocery Store,43.746143,-79.32463
4,A&W,Fast Food Restaurant,43.760643,-79.326865


<h3>Define a function to get nearby venues</h3>

<p>A function is created, which extracts all nearby venues in a neighbourhood and puts them into a dataframe an organized way.</p>

<p>The resulting dataframe will have the following columns: 
<ul>
  <li>Neighbourhood </li>
  <li>Neighbourhood Latitude  </li>
  <li>Neighbourhood Longitude  </li>
  <li>Venue </li>
  <li>Venue Latitude </li>
  <li>Venue Longitude  </li>
  <li>Venue Category </li>
</ul>
</p>

In [100]:
def getNearbyVenues(names, latitudes, longitudes, radius = 10000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

<h3>Run the function on all Toronto Neighbourhoods</h3>

<h5>Run the function on all Toronto Neighbourhoods</h5>

In [101]:
toronto_venues = getNearbyVenues(names = df_toronto['Neighbourhood'],
                                   latitudes = df_toronto['Latitude'],
                                   longitudes = df_toronto['Longitude'],
                                   radius = 1000
                                  )

print('\n * ----------------------------- * \n \n Getting nearby venues COMPLETED!')

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

<h5>Examine the resulting dataframe</h5>

In [102]:
# define print strings
print_statement0 = '\n* --- Some facts about the Venue DataFrame --- *'
print_statement1 = '\nThe toronto_venues dataframe has {} rows and {} columns'.format(toronto_venues.shape[0], toronto_venues.shape[1])
print_statement2 = 'Toronto has {} unique venue categries'.format(len(toronto_venues['Venue Category'].unique()))
print_statement3 = 'Toronto has {} unique venues'.format(len(toronto_venues['Venue'].unique()))

print_statement9 = '\nThe Dataframe looks as following: \n'

# output strings
print(print_statement0)
print(print_statement1)
print(print_statement2)
print(print_statement3)
print(print_statement9)

# output head of dataframe
toronto_venues.head(8)


* --- Some facts about the Venue DataFrame --- *

The toronto_venues dataframe has 4878 rows and 7 columns
Toronto has 334 unique venue categries
Toronto has 2800 unique venues

The Dataframe looks as following: 



Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Allwyn's Bakery,43.75984,-79.324719,Caribbean Restaurant
1,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
2,Parkwoods,43.753259,-79.329656,Tim Hortons,43.760668,-79.326368,Café
3,Parkwoods,43.753259,-79.329656,Bruno's valu-mart,43.746143,-79.32463,Grocery Store
4,Parkwoods,43.753259,-79.329656,A&W,43.760643,-79.326865,Fast Food Restaurant
5,Parkwoods,43.753259,-79.329656,Shoppers Drug Mart,43.760857,-79.324961,Pharmacy
6,Parkwoods,43.753259,-79.329656,High Street Fish & Chips,43.74526,-79.324949,Fish & Chips Shop
7,Parkwoods,43.753259,-79.329656,Shoppers Drug Mart,43.745315,-79.3258,Pharmacy


<h2 id="analyse_neighbourhoods">5. Analyse each neighbourhood</h2>

<h3>Perform one hot encoding</h3>

Run one hot encoding on toronto_venue dataframe

In [103]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix ="", prefix_sep ="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

In [104]:
# define print strings
print_statement0 = '\n* --- Some facts about the Venue DataFrame --- *'
print_statement1 = '\nThe toronto_onehot dataframe has {} rows and {} columns'.format(toronto_onehot.shape[0], toronto_onehot.shape[1])
print_statement9 = '\nThe Dataframe looks as following: \n'

# output strings
print(print_statement0)
print(print_statement1)
print(print_statement9)

# output head of dataframe
toronto_onehot.head(8)


* --- Some facts about the Venue DataFrame --- *

The toronto_onehot dataframe has 4878 rows and 335 columns

The Dataframe looks as following: 



Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<h3>Group rows by neighbourhood</h3>
Group rows by neighbourhood and by taking the mean of the frequency of occurrence of each category

In [105]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()

In [106]:
# define print strings
print_statement0 = '\n* --- Some facts about the Venue DataFrame --- *'
print_statement1 = '\nThe toronto_grouped dataframe has {} rows and {} columns'.format(toronto_grouped.shape[0], toronto_grouped.shape[1])
print_statement9 = '\nThe Dataframe looks as following: \n'

# output strings
print(print_statement0)
print(print_statement1)
print(print_statement9)

# output head of dataframe
toronto_grouped.head(10)


* --- Some facts about the Venue DataFrame --- *

The toronto_grouped dataframe has 98 rows and 335 columns

The Dataframe looks as following: 



Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.02439,0.0,0.0,0.0,0.0,...,0.02439,0.0,0.0,0.0,0.0,0.0,0.02439,0.0,0.0,0.0
5,Berczy Park,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"Birch Cliff, Cliffside West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Brockton, Parkdale Village, Exhibition Place",0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<h3>Get top venues of each neighbourhood</h3>

<h5>Define function to extract most common venues of a neighbourhood</h5>

In [107]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

<h5>Get top 10 venues of each neighbourhood and write it to a dataframe</h5>

In [108]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns = columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Chinese Restaurant,Shopping Mall,Bakery,Coffee Shop,Caribbean Restaurant,Sandwich Place,Lounge,Skating Rink,Latin American Restaurant,Sushi Restaurant
1,"Alderwood, Long Branch",Discount Store,Pharmacy,Convenience Store,Pizza Place,Gas Station,Shopping Mall,Liquor Store,Donut Shop,Park,Sandwich Place
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Frozen Yogurt Shop,Dog Run,Gas Station,Chinese Restaurant,Sushi Restaurant,Supermarket,Middle Eastern Restaurant,Trail
3,Bayview Village,Bank,Grocery Store,Japanese Restaurant,Gas Station,Chinese Restaurant,Park,Restaurant,Café,Skating Rink,Trail
4,"Bedford Park, Lawrence Manor East",Italian Restaurant,Coffee Shop,Park,Pizza Place,Sandwich Place,Bank,Café,Bagel Shop,Thai Restaurant,Bakery


<h2 id="clustering">6. Cluster Neighbourhoods</h2>

<h5>Run k-means to cluster the neighbourhoods</h5>

In [109]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 3, 3, 3, 2, 2, 0, 2, 2, 2])

<h5>Create a new dataframe that includes the cluster as well as the top 10 venues for each neighbourhood.</h5>

In [110]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df_toronto

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on = 'Neighbourhood')

In [111]:
toronto_merged.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,3.0,Park,Convenience Store,Bus Stop,Shopping Mall,Pharmacy,Fish & Chips Shop,Food & Drink Shop,Café,Fast Food Restaurant,Supermarket
1,M4A,North York,Victoria Village,43.725882,-79.315572,3.0,Coffee Shop,Men's Store,Boxing Gym,Gym / Fitness Center,Portuguese Restaurant,Sporting Goods Shop,Golf Course,Hockey Arena,Intersection,Playground
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,2.0,Coffee Shop,Pub,Theater,Park,Café,Breakfast Spot,Bakery,Diner,Restaurant,Performing Arts Venue
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,2.0,Clothing Store,Fast Food Restaurant,Coffee Shop,Restaurant,Fried Chicken Joint,Dessert Shop,Vietnamese Restaurant,Furniture / Home Store,Sushi Restaurant,Accessories Store
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,2.0,Coffee Shop,Park,Gay Bar,Italian Restaurant,Sushi Restaurant,Pizza Place,Café,Ramen Restaurant,Gastropub,Yoga Studio
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242,3.0,Pharmacy,Bakery,Bank,Shopping Mall,Convenience Store,Grocery Store,Park,Playground,Café,Skating Rink
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,2.0,Coffee Shop,Trail,Fast Food Restaurant,Bank,Restaurant,Chinese Restaurant,Paper / Office Supplies Store,Bakery,Gym,Park
7,M3B,North York,Don Mills,43.745906,-79.352188,2.0,Restaurant,Coffee Shop,Japanese Restaurant,Gym,Supermarket,Bank,Burger Joint,Mobile Phone Shop,Café,Asian Restaurant
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,3.0,Gym / Fitness Center,Pizza Place,Construction & Landscaping,Brewery,Coffee Shop,Rock Climbing Spot,Gastropub,Bank,Bakery,Fast Food Restaurant
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,2.0,Coffee Shop,Gastropub,Italian Restaurant,Diner,Japanese Restaurant,Hotel,Restaurant,Sushi Restaurant,Plaza,Pizza Place


Check if the DataFrame has any rows with NaN values (that means no venues found). If so those rows will be dropped from DataFrame

In [112]:
# print actual number of NaN rows in DataFrame
number_NaN = toronto_merged[['Cluster Labels']].isna().sum()[0]
if number_NaN != 1:
    print_statement1 = '\nThe DataFrame toronto_merged has {} rows with NaN values'.format(number_NaN)
    print(print_statement1)
elif number_NaN == 1:
    print_statement11 = '\nThe DataFrame toronto_merged has {} row with NaN values'.format(number_NaN)
    print(print_statement11)

# drop NaN rows if there are any
if number_NaN > 0:
    toronto_merged.dropna(axis = 'index', inplace = True)
    
    # define print statements
    print_statement2 = '\n{} rows has been dropped from dataframe'.format(number_NaN)
    print_statement3 = 'The DataFrame now has {} rows \n'.format(toronto_merged.shape[0])
    
    #print ourput
    print(print_statement2)
    print(print_statement3)
    
else:
    print('\nNo rows need to be dropped from DataFrame\n')


The DataFrame toronto_merged has 1 row with NaN values

1 rows has been dropped from dataframe
The DataFrame now has 102 rows 



Change type of Cluster values in DataFrame to int

In [113]:
toronto_merged['Cluster Labels'] = toronto_merged[['Cluster Labels']].astype(int)

In [114]:
toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,3,Park,Convenience Store,Bus Stop,Shopping Mall,Pharmacy,Fish & Chips Shop,Food & Drink Shop,Café,Fast Food Restaurant,Supermarket
1,M4A,North York,Victoria Village,43.725882,-79.315572,3,Coffee Shop,Men's Store,Boxing Gym,Gym / Fitness Center,Portuguese Restaurant,Sporting Goods Shop,Golf Course,Hockey Arena,Intersection,Playground
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,2,Coffee Shop,Pub,Theater,Park,Café,Breakfast Spot,Bakery,Diner,Restaurant,Performing Arts Venue
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,2,Clothing Store,Fast Food Restaurant,Coffee Shop,Restaurant,Fried Chicken Joint,Dessert Shop,Vietnamese Restaurant,Furniture / Home Store,Sushi Restaurant,Accessories Store
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,2,Coffee Shop,Park,Gay Bar,Italian Restaurant,Sushi Restaurant,Pizza Place,Café,Ramen Restaurant,Gastropub,Yoga Studio


<h5>Visualize the resulting clusters on a map</h5>

In [115]:
# create map
map_clusters = folium.Map(location=[lat, lng], zoom_start = 11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<h2 id="Examine">7. Examine Clusters</h2>

Now each cluster can be examined to determine the discriminating venue categories that distinguish each cluster.

<h3>Cluster 1</h3>

In [116]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,"West Deane Park, Princess Gardens, Martin Grov...",0,Park,Pizza Place,Restaurant,Bank,Grocery Store,Gym,Mexican Restaurant,Clothing Store,Fish & Chips Shop,Hotel
12,"Rouge Hill, Port Union, Highland Creek",0,Playground,Burger Joint,Italian Restaurant,Park,Breakfast Spot,Food & Drink Shop,Field,Escape Room,Ethiopian Restaurant,Event Space
22,Woburn,0,Coffee Shop,Park,Chinese Restaurant,Fast Food Restaurant,Indian Restaurant,Mobile Phone Shop,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant
57,"Humberlea, Emery",0,Auto Workshop,Park,Golf Course,Convenience Store,Bakery,Intersection,Storage Facility,Discount Store,Gas Station,Zoo
58,"Birch Cliff, Cliffside West",0,Park,Convenience Store,Auto Workshop,Gym Pool,Gym,General Entertainment,Diner,Restaurant,College Stadium,Skating Rink
66,York Mills West,0,Park,Restaurant,Coffee Shop,Gym,Pet Store,Dog Run,Chinese Restaurant,Bowling Alley,Playground,Grocery Store
91,Rosedale,0,Coffee Shop,Park,Grocery Store,Breakfast Spot,Bistro,Bank,Sandwich Place,BBQ Joint,Filipino Restaurant,Athletics & Sports


<h3>Cluster 2</h3>

In [117]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
94,"Northwest, West Humber - Clairville",1,Lounge,Coffee Shop,Zoo,Field,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant


<h3>Cluster 3</h3>

In [118]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,"Regent Park, Harbourfront",2,Coffee Shop,Pub,Theater,Park,Café,Breakfast Spot,Bakery,Diner,Restaurant,Performing Arts Venue
3,"Lawrence Manor, Lawrence Heights",2,Clothing Store,Fast Food Restaurant,Coffee Shop,Restaurant,Fried Chicken Joint,Dessert Shop,Vietnamese Restaurant,Furniture / Home Store,Sushi Restaurant,Accessories Store
4,"Queen's Park, Ontario Provincial Government",2,Coffee Shop,Park,Gay Bar,Italian Restaurant,Sushi Restaurant,Pizza Place,Café,Ramen Restaurant,Gastropub,Yoga Studio
6,"Malvern, Rouge",2,Coffee Shop,Trail,Fast Food Restaurant,Bank,Restaurant,Chinese Restaurant,Paper / Office Supplies Store,Bakery,Gym,Park
7,Don Mills,2,Restaurant,Coffee Shop,Japanese Restaurant,Gym,Supermarket,Bank,Burger Joint,Mobile Phone Shop,Café,Asian Restaurant
9,"Garden District, Ryerson",2,Coffee Shop,Gastropub,Italian Restaurant,Diner,Japanese Restaurant,Hotel,Restaurant,Sushi Restaurant,Plaza,Pizza Place
13,Don Mills,2,Restaurant,Coffee Shop,Japanese Restaurant,Gym,Supermarket,Bank,Burger Joint,Mobile Phone Shop,Café,Asian Restaurant
14,Woodbine Heights,2,Coffee Shop,Park,Café,Skating Rink,Pizza Place,Sandwich Place,Dance Studio,Athletics & Sports,Curling Ice,Farmers Market
15,St. James Town,2,Café,Coffee Shop,Japanese Restaurant,Restaurant,Beer Bar,Hotel,Gastropub,Seafood Restaurant,Bakery,Italian Restaurant
19,The Beaches,2,Pub,Coffee Shop,Pizza Place,Breakfast Spot,Beach,Japanese Restaurant,Health Food Store,Caribbean Restaurant,Sandwich Place,Bar


<h3>Cluster 4</h3>

In [119]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Parkwoods,3,Park,Convenience Store,Bus Stop,Shopping Mall,Pharmacy,Fish & Chips Shop,Food & Drink Shop,Café,Fast Food Restaurant,Supermarket
1,Victoria Village,3,Coffee Shop,Men's Store,Boxing Gym,Gym / Fitness Center,Portuguese Restaurant,Sporting Goods Shop,Golf Course,Hockey Arena,Intersection,Playground
5,"Islington Avenue, Humber Valley Village",3,Pharmacy,Bakery,Bank,Shopping Mall,Convenience Store,Grocery Store,Park,Playground,Café,Skating Rink
8,"Parkview Hill, Woodbine Gardens",3,Gym / Fitness Center,Pizza Place,Construction & Landscaping,Brewery,Coffee Shop,Rock Climbing Spot,Gastropub,Bank,Bakery,Fast Food Restaurant
10,Glencairn,3,Grocery Store,Fast Food Restaurant,Gas Station,Coffee Shop,Pizza Place,Italian Restaurant,Pub,Gym,Metro Station,Mediterranean Restaurant
16,Humewood-Cedarvale,3,Pizza Place,Coffee Shop,Convenience Store,Optical Shop,Bagel Shop,Grocery Store,Bank,Sandwich Place,Gastropub,Dance Studio
17,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",3,Coffee Shop,Farmers Market,Pharmacy,Liquor Store,Shopping Plaza,Beer Store,Grocery Store,Gas Station,College Rec Center,Shopping Mall
18,"Guildwood, Morningside, West Hill",3,Pizza Place,Fast Food Restaurant,Bank,Coffee Shop,Pharmacy,Liquor Store,Bus Line,Sandwich Place,Supermarket,Greek Restaurant
21,Caledonia-Fairbanks,3,Pharmacy,Park,Pizza Place,Portuguese Restaurant,Food Truck,Bus Stop,Mexican Restaurant,Fast Food Restaurant,Falafel Restaurant,Bakery
26,Cedarbrae,3,Coffee Shop,Pharmacy,Bank,Gas Station,Indian Restaurant,Bakery,Music Store,Hakka Restaurant,Caribbean Restaurant,Chinese Restaurant


<h3>Cluster 5</h3>

In [120]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
45,"York Mills, Silver Hills",4,Park,Pool,Zoo,Field,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farm,Farmers Market
101,"Old Mill South, King's Mill Park, Sunnylea, Hu...",4,Park,Ice Cream Shop,Italian Restaurant,Bus Stop,Shopping Mall,Eastern European Restaurant,Gym / Fitness Center,Dessert Shop,Design Studio,Event Space
