**Segmenting and Clustering Neighborhoods in Toronto**
--


**Part 1: Building a dataframe**

Scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe as follows:

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.


In [2]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

Set up our dataframe

In [3]:
df = pd.DataFrame(columns = ['Postalcode','Borough','Neighbourhood'])

Scrape the table on Wikipedia using BeautifulSoup.

In [4]:
wiki_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

source = requests.get(wiki_url).text
soup = BeautifulSoup(source, 'xml')
table=soup.find('table')

Collect each row of the table, then inserting it into the dataframe

In [5]:
for tr in table.find_all('tr'):
    row=[]
    for td_cell in tr.find_all('td'):
        row.append(td_cell.text.strip())
    if len(row)==3:
        df.loc[len(df)] = row
        
df.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Notice that we have postal codes that are not assigned to any borough. We will remove these. We also have some postal codes that are assigned to a borough but not to a neighbourhood. We will set the names of neighbourhoods of these postal codes to be the same as their borough.

In [6]:
not_assigned_indices = df[ df['Borough'] =='Not assigned'].index

df.drop(not_assigned_indices , inplace=True)

In [7]:
df.loc[df['Neighbourhood'] =='Not assigned' , 'Neighbourhood'] = df['Borough']

In [8]:
df.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Lastly, we combine rows with the same postal code into one row.

In [9]:
df_final = df.groupby(['Postalcode','Borough'], sort=False).agg( ', '.join)

In [10]:
df_final = df_final.reset_index()
df_final.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park


Check the dimensions of the final dataframe:

In [11]:
df_final.shape

(103, 3)

**Part 2: Adding longitude and lattitude values to our dataframe**

We will use the provided .csv file to create a dataframe of the coordinates of each postal code, then merge our old data frame with it.

In [12]:
df_coords = pd.read_csv('Geospatial_Coordinates.csv')
df_coords.columns = ['Postalcode','Latitude','Longitude']
df_coords.head()

Unnamed: 0,Postalcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [13]:
df_Toronto = pd.merge(df_final, df_coords,on='Postalcode')

In [14]:
df_Toronto.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494


**Part 3: Exploring and clustering the neighbourhoods of Toronto**

In [15]:
from geopy.geocoders import Nominatim 
import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

import folium 

from ipython_secrets import *


print('Libraries imported.')

Libraries imported.


We will use GeoPy to get the lattitude and longitude of Toronto.

In [16]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.653963, -79.387207.


With Folium, we can create a map of Toronto with all its neighbourhoods labelled.

In [17]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighbourhood in zip(df_Toronto['Latitude'], df_Toronto['Longitude'], df_Toronto['Borough'], df_Toronto['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=4,
        popup=label,
        color=' #f37735 ',
        fill=True,
        fill_color='#ffffff',
        fill_opacity=0.2,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

We will now cluster the neighbourhoods of Toronto with respect to the venues that they contain using the K-means algorithm. We first find venues near each neighbourhood using the Foursquare API.

In [18]:
CLIENT_ID = 'VYLFQ02EYZHFIT3YVCSJNZNWMQPRPGHZKLHHURXHKE0QZPQQ' 
CLIENT_SECRET = 'Redacted'
VERSION = '20180605' 
LIMIT = 100


In [19]:
def getNearbyVenues(names, latitudes, longitudes, radius=000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [20]:
Toronto_venues = getNearbyVenues(names=df_Toronto['Neighbourhood'],
                                   latitudes=df_Toronto['Latitude'],
                                   longitudes=df_Toronto['Longitude']
                                  )

In [21]:
Toronto_venues.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Allwyn's Bakery,43.75984,-79.324719,Caribbean Restaurant
1,Parkwoods,43.753259,-79.329656,Donalda Golf & Country Club,43.752816,-79.342741,Golf Course
2,Parkwoods,43.753259,-79.329656,Graydon Hall Manor,43.763923,-79.342961,Event Space
3,Parkwoods,43.753259,-79.329656,LCBO,43.757774,-79.314257,Liquor Store
4,Parkwoods,43.753259,-79.329656,Island Foods,43.745866,-79.346035,Caribbean Restaurant


Let us see how many venues we have procured per neighbourhood.

In [22]:
Toronto_venues_grouped = Toronto_venues.groupby('Neighbourhood').count()[['Venue']]
Toronto_venues_grouped.columns = ["Venue Count"]
Toronto_venues_grouped

Unnamed: 0_level_0,Venue Count
Neighbourhood,Unnamed: 1_level_1
"Adelaide, King, Richmond",100
Agincourt,100
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",100
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",80
"Alderwood, Long Branch",100
...,...
Willowdale West,100
Woburn,100
"Woodbine Gardens, Parkview Hill",100
Woodbine Heights,100


In [23]:
print('There are {} unique categories of venues.'.format(len(Toronto_venues['Venue Category'].unique())))

There are 284 unique categories of venues.


Use one-hot encoding to prepare the data for classification.

In [24]:
# one hot encoding
Toronto_onehot = pd.get_dummies(Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Toronto_onehot['Neighbourhood'] = Toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [Toronto_onehot.columns[-1]] + list(Toronto_onehot.columns[:-1])
Toronto_onehot = Toronto_onehot[fixed_columns]

Toronto_onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,American Restaurant,Amphitheater,Antique Shop,Aquarium,Arcade,Art Gallery,...,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo,Zoo Exhibit
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we will take the mean of the frequency of each venue type by neighbourhood.

In [25]:
Toronto_freq = Toronto_onehot.groupby('Neighbourhood').mean().reset_index()
Toronto_freq.head()

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,American Restaurant,Amphitheater,Antique Shop,Aquarium,Arcade,Art Gallery,...,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo,Zoo Exhibit
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.02,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0125,0.0,0.0
4,"Alderwood, Long Branch",0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,...,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0


Create a dataframe with the top 10 most common venues by neighbourhood

In [26]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [27]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = Toronto_freq['Neighbourhood']

for ind in np.arange(Toronto_freq.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Toronto_freq.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Hotel,Theater,Café,Pizza Place,Plaza,Sandwich Place,Japanese Restaurant,Gym,Vegetarian / Vegan Restaurant
1,Agincourt,Chinese Restaurant,Indian Restaurant,Clothing Store,Coffee Shop,Caribbean Restaurant,Restaurant,Pharmacy,Bubble Tea Shop,Bookstore,Gas Station
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Chinese Restaurant,Japanese Restaurant,Coffee Shop,Bubble Tea Shop,Vietnamese Restaurant,Bakery,Gas Station,Asian Restaurant,Hong Kong Restaurant,Dessert Shop
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Coffee Shop,Fast Food Restaurant,Pizza Place,Indian Restaurant,Gas Station,Grocery Store,Sandwich Place,Bank,Department Store,Pharmacy
4,"Alderwood, Long Branch",Coffee Shop,Burger Joint,Department Store,Fast Food Restaurant,Seafood Restaurant,Grocery Store,Burrito Place,Restaurant,Bakery,Furniture / Home Store


We can now run k-means on this dataset.

In [28]:
# set number of clusters
kclusters = 8

Toronto_freq_clustering = Toronto_freq.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Toronto_freq_clustering)

Add labels to the most common venues dataframe.

In [29]:
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Toronto_merged = df_Toronto

Toronto_merged = Toronto_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

Toronto_merged.head() 

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,5,Middle Eastern Restaurant,Coffee Shop,Chinese Restaurant,Supermarket,Japanese Restaurant,Café,Burger Joint,Mediterranean Restaurant,Gym / Fitness Center,Hakka Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,5,Middle Eastern Restaurant,Supermarket,Burger Joint,Restaurant,Bakery,Japanese Restaurant,Grocery Store,Coffee Shop,Chinese Restaurant,Movie Theater
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636,4,Coffee Shop,Café,Park,Japanese Restaurant,Italian Restaurant,Gastropub,Diner,Hotel,Farmers Market,Theater
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,5,Clothing Store,Coffee Shop,Furniture / Home Store,Bagel Shop,Dessert Shop,Cosmetics Shop,Fried Chicken Joint,Italian Restaurant,Men's Store,Toy / Game Store
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494,6,Coffee Shop,Café,Pizza Place,Sandwich Place,Bakery,Restaurant,Pharmacy,Sushi Restaurant,Thai Restaurant,Japanese Restaurant


Finally we will display the clustered points on a map.

In [32]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_merged['Latitude'], Toronto_merged['Longitude'], Toronto_merged['Neighbourhood'],Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters