# Segmenting and Clustering Toronto Neighborhoods

## Clustering the Neighborhoods

First, we must re-create the dataframe as we did in part 2 of the assignment

In [1]:
#This cell runs all the necessary code to create the dataframe

import pandas as pd
from bs4 import BeautifulSoup
import requests

source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text #scrapes the source code
soup = BeautifulSoup(source, 'lxml') #reads the source code
table = soup.find('table') #isolates the source code of the table
body = table.find_all('tr') #converts source code of table into a list of source code of each row

t_headings = [] #creates empty list to be populated with table headings
for th in body[0].find_all('th'):
    t_headings.append(th.text.replace('\n', ' ').strip()) #populates headings list with table headings
    
table_data = [] #creates empty list to be populated with table data
for tr in table.find_all('tr')[1:]:
    t_row = {} #creates empty dictionary to be populated with each row of data
    for td, th in zip(tr.find_all('td'), t_headings):
        t_row[th] = td.text.replace('\n', ' ').strip() #populates dictionary with data
    table_data.append(t_row) #populates data list with each row of data
    
nb_list = pd.DataFrame(table_data) #converts scraped data into Pandas dataframe
nb_list = nb_list[['Postal Code', 'Borough', 'Neighbourhood']] #rearrange the columns
nb_list = nb_list[nb_list.Borough != 'Not assigned'].reset_index(drop = True) #drops all postal codes whose boroughs are not assigned

import pgeocode

country = pgeocode.Nominatim('ca') #sets the country to Canada
lat = [] #create an empty list for latitude coordinates
lng = [] #create an empty list for longitude coordinates

for i in range(nb_list.shape[0]):
    nb = country.query_postal_code(nb_list.iloc[i, 0]) #searches for location data based on postal code in row i of the dataframe
    lat.append(nb.latitude) #appends the latitude coordinate to the latitude coordinates list
    lng.append(nb.longitude) #appends the longitude coordinate to the longitude coordinates list

nb_list['Latitude'] = lat
nb_list['Longitude'] = lng #appends the latitude and longitude coordinates lists as new columns to the dataframe

Before we proceed with our analysis, we should check for and remove any postal codes for which no coordinates were returned by pgeocoder.

In [2]:
nb_list[['Latitude', 'Longitude']].isnull().values.any() #check for any null latitude and longitude values

True

In [3]:
nb_list.dropna(inplace = True) #drop all rows with NaN values
nb_list[['Latitude', 'Longitude']].isnull().values.any() #check for any null latitude and longitude values

False

Now we can proceed with the rest of our analysis. We begin by loading all the packages we will be using.

In [4]:
from geopy.geocoders import Nominatim

import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

import json
from pandas.io.json import json_normalize

Next, let's create a map of Toronto to get a sense of what we're working with.

In [5]:
address = 'Toronto, Ontario'

geo = Nominatim(user_agent = 'toronto_explorer')
toronto = geo.geocode(address)
tlat = toronto.latitude
tlng = toronto.longitude

map_toronto = folium.Map(location = [tlat, tlng], zoom_start = 10)
for lat, lng, borough, neighborhood, code in zip(nb_list['Latitude'], nb_list['Longitude'], nb_list['Borough'], nb_list['Neighbourhood'], nb_list['Postal Code']):
    label = '{} {}: {}'.format(code, borough, neighborhood)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
    [lat, lng],
    radius = 5,
    popup = label,
    color = 'blue',
    fill_color = True,
    fill_opacity = 0.7,
    parse_html = False).add_to(map_toronto)
    

map_toronto

Next, we input all relevant information to use the FourSquare API (this cell is hidden)

We can now define a function to return the venues in a neighborhood

In [7]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            100)
            
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)    

With the function defined, we can get the venues for all our Toronto neighborhoods

In [8]:
toronto_venues = getNearbyVenues(names = nb_list['Postal Code'],
                                latitudes = nb_list['Latitude'],
                                longitudes = nb_list['Longitude']
                                )

In [9]:
#Examine our venues:
toronto_venues.groupby('Postal Code').count().head()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M1B,1,1,1,1,1,1
M1C,1,1,1,1,1,1
M1E,34,34,34,34,34,34
M1G,1,1,1,1,1,1
M1H,3,3,3,3,3,3


Now, we must format our data to be usable with k-means clustering

In [10]:
#One-hot encoding 
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

#Adding postal code back in to the dataframe
toronto_onehot['Postal Code'] = toronto_venues['Postal Code'] 

fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

#Grouping by neighborhood for proportion of nearby venues by category
toronto_grouped = toronto_onehot.groupby('Postal Code').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Postal Code,Accessories Store,Afghan Restaurant,Airport,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.029412,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Next, for a better understanding of our dataset, we create a dataframe to list the most common categories of venues in each neighborhood.

In [11]:
#Create a function to return the most common venues of each neighborhood.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [12]:
#Create a dataframe of each neighborhood and its ten most common venue types
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postal Code'] = toronto_grouped['Postal Code']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Postal Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Home Service,Yoga Studio,Eastern European Restaurant,Food & Drink Shop,Flower Shop,Flea Market,Fish Market,Fish & Chips Shop,Financial or Legal Service,Field
1,M1C,Bar,Yoga Studio,Food Truck,Food & Drink Shop,Flower Shop,Flea Market,Fish Market,Fish & Chips Shop,Financial or Legal Service,Field
2,M1E,Pizza Place,Coffee Shop,Grocery Store,Restaurant,Bank,Greek Restaurant,Fast Food Restaurant,Pharmacy,Food & Drink Shop,Medical Center
3,M1G,Korean Restaurant,Yoga Studio,Food Truck,Food & Drink Shop,Flower Shop,Flea Market,Fish Market,Fish & Chips Shop,Financial or Legal Service,Field
4,M1H,Construction & Landscaping,Trail,Lounge,Yoga Studio,Fast Food Restaurant,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farmers Market,Financial or Legal Service


Now, we can perform the K-means clustering algorithm. Since there are over a hundred neighborhoods, we will create 20 clusters.

In [13]:
kclusters = 20

toronto_grouped_clustering = toronto_grouped.drop('Postal Code', 1)

kmeans = KMeans(n_clusters = kclusters, random_state = 0).fit(toronto_grouped_clustering)

kmeans.labels_[0:10]

array([ 2,  6, 19, 13,  5,  3, 16,  5, 19, 18], dtype=int32)

Now that we have our cluster labels, we add them into the dataframe.

In [14]:
#Add the cluster labels into a dataframe containing neighborhood location data and most common venue types
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = nb_list

toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Postal Code'), on = 'Postal Code')

toronto_merged

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.7545,-79.3300,15.0,Food & Drink Shop,Park,Yoga Studio,Eastern European Restaurant,Flower Shop,Flea Market,Fish Market,Fish & Chips Shop,Financial or Legal Service,Field
1,M4A,North York,Victoria Village,43.7276,-79.3148,5.0,Hockey Arena,Intersection,Park,Portuguese Restaurant,Coffee Shop,Financial or Legal Service,French Restaurant,Pizza Place,Harbor / Marina,Historic Site
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626,16.0,Coffee Shop,Breakfast Spot,Pub,Food Truck,Spa,Event Space,Beer Store,Electronics Store,Restaurant,Thai Restaurant
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504,16.0,Clothing Store,Coffee Shop,Women's Store,Jewelry Store,Restaurant,Toy / Game Store,Men's Store,Sushi Restaurant,Sandwich Place,Food Court
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889,16.0,Coffee Shop,Dance Studio,Japanese Restaurant,Café,Burrito Place,Bubble Tea Shop,Mexican Restaurant,Ethiopian Restaurant,Sushi Restaurant,Beer Bar
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.6662,-79.5282,5.0,Pharmacy,Bank,Park,Skating Rink,Grocery Store,Farmers Market,Electronics Store,Ethiopian Restaurant,Event Space,Falafel Restaurant
6,M1B,Scarborough,"Malvern, Rouge",43.8113,-79.1930,2.0,Home Service,Yoga Studio,Eastern European Restaurant,Food & Drink Shop,Flower Shop,Flea Market,Fish Market,Fish & Chips Shop,Financial or Legal Service,Field
7,M3B,North York,Don Mills,43.7450,-79.3590,3.0,Construction & Landscaping,Pool,Park,Gym,Yoga Studio,Farmers Market,Electronics Store,Ethiopian Restaurant,Event Space,Falafel Restaurant
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.7063,-79.3094,19.0,Pizza Place,Intersection,Breakfast Spot,Gastropub,Gym / Fitness Center,Bank,Pet Store,Pharmacy,Curling Ice,Eastern European Restaurant
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783,16.0,Coffee Shop,Clothing Store,Café,Italian Restaurant,Japanese Restaurant,Hotel,Cosmetics Shop,Diner,Lingerie Store,Theater


With the cluster labels now appended to the dataframe, we check to make sure that there are no null values for cluster labels:

In [15]:
toronto_merged['Cluster Labels'].isnull().values.any()

True

The above indicates that there are some null values for cluster labels. Upon examining the code, we can presume that these null values arose because the FourSquare API returned no venues within 500 meters of the neighborhood's coordinates. Unforutnately, it means that these neighborhoods were not included in our K-means clustering analysis. For the purposes of generating a neighborhood map by cluster, it makes sense to categorize these venue-less neighborhoods into their own cluster.

In [16]:
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].replace({np.nan: 20})
toronto_merged['Cluster Labels'].isnull().values.any()

False

Now, with all null cluster labels assigned to their own cluster, we can generate a neighborhood map.

In [17]:
toronto_merged = toronto_merged.astype({'Cluster Labels': int})

map_clusters = folium.Map(location = [tlat, tlng], zoom_start = 10)

x = np.arange(kclusters + 1)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius = 5,
        popup = label,
        color = rainbow[cluster - 1],
        fill = True,
        fill_color = rainbow[cluster - 1],
        fill_opacity = 0.7).add_to(map_clusters)
       
map_clusters

The above map shows us that it is difficult to visualize 20 different clusters well using colors. It may be more useful to perform this analysis on a smaller geographic area using fewer clusters.