# Clustering of neighborhoods in Toronto

Import necessary packages. I will be using BeautifulSoup to read the table and requests to obtain the html text.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc, 'html.parser')

Read table header from html and find row and header elements. Extract header names from "th" elements.

In [2]:
table_html = soup.table
rows = table_html.find_all("tr")
header_row = rows[0].find_all("th")
headers = [x.get_text().strip() for x in header_row]

Loop through rows of soup object and extract values for each row using the "td" elements. Skip unassigned boroughs as requested.

In [3]:
df = pd.DataFrame(columns=headers)
for row in rows[1:]:
    row_val = [x.get_text().strip() for x in row.find_all("td")]
    if row_val[1] == 'Not assigned':
        continue
    df = df.append({headers[i]:row_val[i] for i in range(len(headers))}, ignore_index=True)

Rename columns to match those requested for the assignment. Show dataframe.

In [4]:
df.columns = ["PostalCode", "Borough", "Neighborhood"]
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Illustrate that there are no instances where a borough has been assigned, but a neighborhood has not.

In [5]:
df[df["Neighborhood"] == "Not Assigned"]

Unnamed: 0,PostalCode,Borough,Neighborhood


In [6]:
df.shape

(103, 3)

Use given link to download latitude and longitude. Tried geocoder, but it didn't work.

In [7]:
lat_long = pd.read_csv("https://cocl.us/Geospatial_data")
lat_long.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merge df with lat_long on Postal Code.

In [8]:
df = pd.merge(df, lat_long, how="left", left_on="PostalCode", right_on="Postal Code")

Drop redundant column.

In [9]:
df.drop("Postal Code", axis=1, inplace=True)

In [10]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Limit to only neighborhoods in Toronto.

In [11]:
toronto = df[df["Borough"].str.contains("Toronto")]

Provide Foursquare credentials and other inputs to url name, including 500 m radius and 100 venues only.

In [12]:
CLIENT_ID = 'KJE04CZL5523T4QADSED3SRCW0N1BRODX2AMKZBOIAY4NRLD'
CLIENT_SECRET = 'JQA5H4XV3E5BPAVRTXKRUVWNORPQOF3B4OTDA212TZMHMAYA'
VERSION = '20200531'
radius = 500
LIMIT = 100

Use same function as in lab to determine category names.

In [13]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Create loop that finds latitude and longitude for every postal code in Toronto and outputs a dataframe with information on 100 venues. Used some code from lab.

In [14]:
from pandas.io.json import json_normalize

response_df = pd.DataFrame()
for i in range(toronto.shape[0]):
    print(toronto.iloc[i,0])
    latitude = toronto.iloc[i,3]
    longitude = toronto.iloc[i,4]
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'\
        .format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
    results = requests.get(url).json()
    response = results["response"]["groups"][0]["items"]
    nearby_venues = json_normalize(response)
    filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
    nearby_venues =nearby_venues.loc[:, filtered_columns]
    nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
    nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
    nearby_venues["PostalCode"] = toronto.iloc[i,0]
    response_df = pd.concat([response_df, nearby_venues], axis=0)

M5A
M7A
M5B
M5C
M4E
M5E
M5G
M6G
M5H
M6H
M5J
M6J
M4K
M5K
M6K
M4L
M5L
M4M
M4N
M5N
M4P
M5P
M6P
M4R
M5R
M6R
M4S
M5S
M6S
M4T
M5T
M4V
M5V
M4W
M5W
M4X
M5X
M4Y
M7Y


In [15]:
response_df.shape

(1613, 5)

Set postal code as index of df.

In [16]:
response_df.set_index("PostalCode", inplace=True)

In [17]:
response_df.head()

Unnamed: 0_level_0,name,categories,lat,lng
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M5A,Roselle Desserts,Bakery,43.653447,-79.362017
M5A,Tandem Coffee,Coffee Shop,43.653559,-79.361809
M5A,Morning Glory Cafe,Breakfast Spot,43.653947,-79.361149
M5A,Cooper Koo Family YMCA,Distribution Center,43.653249,-79.358008
M5A,Body Blitz Spa East,Spa,43.654735,-79.359874


Use same process as in lab. Create one hot df with venue categories. Then group by postal code and reset index.

In [18]:
toronto_one_hot = pd.get_dummies(response_df[["categories"]], prefix="", prefix_sep="")
toronto_grouped = toronto_one_hot.groupby(toronto_one_hot.index).mean().reset_index()
toronto_grouped.head()

Unnamed: 0,PostalCode,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M4E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M4K,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,...,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381
2,M4L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M4M,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.025
4,M4N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
toronto_grouped.shape

(39, 240)

Create function to select most common categories for each postal code.

In [20]:
def most_common_cats(postal_code, num_values):
    postal_cats = toronto_grouped.set_index("PostalCode").loc[postal_code,:]
    most_common = postal_cats.sort_values(ascending=False).iloc[:num_values]
    most_common_cats = list(most_common.index)
    return most_common_cats

Use lab code, adjusted with change in common venue function format. Create dataframe of top 10 most common venues.

In [21]:
import numpy as np

num_top_venues = 10
indicators = ['st', 'nd', 'rd']
columns = ['PostalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
postal_venues_sorted = pd.DataFrame(columns=columns)
postal_venues_sorted['PostalCode'] = toronto_grouped['PostalCode']
for ind in np.arange(toronto_grouped.shape[0]):
    postal_venues_sorted.iloc[ind, 1:] = most_common_cats(toronto_grouped.iloc[ind, 0], num_top_venues)

postal_venues_sorted.head()

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,Trail,Health Food Store,Neighborhood,Coffee Shop,Pub,Donut Shop,Discount Store,Distribution Center,Dog Run,Doner Restaurant
1,M4K,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Restaurant,Bubble Tea Shop,Indian Restaurant,Spa,Juice Bar
2,M4L,Sandwich Place,Park,Fast Food Restaurant,Ice Cream Shop,Pub,Burrito Place,Board Shop,Fish & Chips Shop,Italian Restaurant,Restaurant
3,M4M,Café,Coffee Shop,Bakery,Gastropub,American Restaurant,Brewery,Yoga Studio,Sandwich Place,Fish Market,Italian Restaurant
4,M4N,Park,Swim School,Bus Line,Yoga Studio,Diner,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant


Cluster postal codes using K Means.

In [22]:
from sklearn.cluster import KMeans

kclusters = 5
toronto_grouped_clustering = toronto_grouped.drop('PostalCode', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=42).fit(toronto_grouped_clustering)

Add K Means labels to venues and merge with data on latitude and longitude.

In [23]:
postal_venues_sorted["Labels"] = kmeans.labels_
toronto_merged = toronto.join(postal_venues_sorted.set_index('PostalCode'), on="PostalCode")
toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Labels
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Coffee Shop,Park,Pub,Bakery,Theater,Breakfast Spot,Café,Ice Cream Shop,French Restaurant,Chocolate Shop,3
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Coffee Shop,Sushi Restaurant,Yoga Studio,Creperie,Beer Bar,Smoothie Shop,Sandwich Place,Burger Joint,Burrito Place,Café,3
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,Clothing Store,Coffee Shop,Middle Eastern Restaurant,Cosmetics Shop,Japanese Restaurant,Italian Restaurant,Café,Bubble Tea Shop,Bookstore,Tea Room,3
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,Coffee Shop,Café,Cocktail Bar,American Restaurant,Restaurant,Gastropub,Creperie,Theater,Gym,Clothing Store,3
19,M4E,East Toronto,The Beaches,43.676357,-79.293031,Trail,Health Food Store,Neighborhood,Coffee Shop,Pub,Donut Shop,Discount Store,Distribution Center,Dog Run,Doner Restaurant,3


Create map of clusters in Toronto. Use same format as used in lab with Folium. Note that the map does not render in GitHub.

In [26]:
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

latitude = 43.6532
longitude = 79.3832
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['PostalCode'], toronto_merged['Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters