# Segmenting and Clustering Neigbourhoods in Toronto - Assignment

## 1) Scraping Canadian postcodes beginning with M (Toronto area) from Wikipedia

First we shall import the necessary libraries, scrape the data from the webpage and prepare it to be converted into a Dataframe:

In [1]:
#Import libraries

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import lxml
import requests as re


In [2]:
#Create web address object

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [3]:
#Scrape data from Wikipedia page

r = re.get(url).text

#Create Beautiful Soup object

sp = bs(r, 'lxml')

In [4]:
#Extracting and cleaning rows from html

rows = sp.find_all('tr')
lst = []

for row in rows:
    td = row.find_all('td')             #Finding cells in each row
    cells = str(td)                     #Converting to string
    ct = bs(cells, "lxml").get_text()   #Cleaning text
    ct2 = ct.replace("[", "").replace("]", "").replace("\n", "").replace("Not assigned", "")     #Replace unwanted characters and Not assigned values
    lst.append(ct2)

Now we will convert our data from list form to Dataframe for and reformat it to to meet the criteria of the assignment:

In [5]:
#Converting rows to dataframe

pc = pd.DataFrame(data = lst)
pc.drop([0, 289, 290, 291, 292, 293], inplace = True)        #Dropping unwanted rows
pc2 = pc[0].str.split(',', expand = True)         #Splitting data
pc2.rename(columns = {0 : "PostalCode", 1 : "Borough", 2 : "Neighbourhood"}, inplace = True)         #Renaming columns
pc3 = pc2[["PostalCode", "Borough" , "Neighbourhood"]]             #Dropping unwanted columns

In [6]:
#Grouping the Postcodes

pc4 = pc3.groupby(['PostalCode', 'Borough'], as_index=False, sort=False).agg(', '.join)

In [7]:
#Dropping Not Assigned Boroughs

pc5 = pc4[pc4.Borough != ' ']

In [9]:
#Setting Not Assigned Neigbourhoods to Boroughs

pc5.loc[pc5.Neighbourhood == ' ', 'Neighbourhood'] = pc5.Borough

In [10]:
#Reset index

pc5.reset_index(drop = True, inplace = True)

### Answer for Question 1:

In [11]:
#Finished dataframe

pc5

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


Finally, the shape of our Dataframe:

In [12]:
pc5.shape

(103, 3)

## 2) Adding the latitude and longitude for each postcode from Foursqaure

We shall import the data from the csv, push it to a dataframe and merge that dataframe with our existing Canadian postcode dataframe:

In [13]:
#Create csv file object

csv = "https://cocl.us/Geospatial_data"

In [14]:
#Read data to dataframe

ll = pd.read_csv(csv)
ll.rename(columns = {"Postal Code" : "PostalCode"}, inplace = True)    #Renaming column to match
#ll2 = ll[pc5]

In [15]:
#Merging latitude and longitude data with Canadian postcode dataframe

pl = pc5.merge(right = ll)

### Answer for Question 2:

In [16]:
#Finished Dataframe

pl

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.654260,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937


# 3) Exploring and Clustering Neigbourhoods in Toronto

Importing the necessary packages, libraries and credentials:

In [17]:
#Install packages

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim

!conda install -c conda-forge folium=0.5.0 --yes
import folium

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.6.16          |           py36_1         149 KB  conda-forge
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    ca-certificates-2019.6.16  |       hecc5488_0         145 KB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.49-py_0         conda-forge
    geopy:           1.20.0-py_0       conda-forge

The following packages will be UPDATED:

    ca-

In [18]:
#Import libraries

import requests

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

In [19]:
#Importing Foursquare credentials (have since been reset)

CLIENT_ID = 'JIGNDYJU5UYTDOZQMOKXHXZINYDD4OTC0CCHKDXV21OXPYA4' # Foursquare ID
CLIENT_SECRET = 'Z4NS0GCKN2LE1XNO353BG42JXUVXHVERDPJWESZZW1TJTJZB' # Foursquare Secret
VERSION = '20180605' # Foursquare API version

Fetching coordinates and creating map with neigbourhoods marked:

In [20]:
#Fetching Toronto's coordinates

address = 'Toronto'

geolocator = Nominatim(user_agent = "Tor")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The coordinates of Toronto are 43.653963, -79.387207.


In [34]:
#Create map of Toronto
Tmap = folium.Map(location = [latitude + 0.04, longitude], zoom_start=11)        #Added 0.04 latitude to catch all neigbourhoods

#Add markers to map
for lat, lng, label in zip(pl['Latitude'], pl['Longitude'], pl['Neighbourhood']):
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7,
        parse_html=False).add_to(Tmap)  
    
Tmap

Fetching nearby venues for each neighbourhood from Foursquare:

In [40]:
#Defining function to get nearby venues

def getNearbyVenues(names, latitudes, longitudes, radius = 500, limit = 100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [39]:
#Fetching venues

venues = getNearbyVenues(names = pl['Neighbourhood'],
                                 latitudes = pl['Latitude'],
                                 longitudes = pl['Longitude']
                                  )

 Parkwoods
 Victoria Village
 Harbourfront,  Regent Park
 Lawrence Heights,  Lawrence Manor
 Queen's Park
 Islington Avenue
 Rouge,  Malvern
 Don Mills North
 Woodbine Gardens,  Parkview Hill
 Ryerson,  Garden District
 Glencairn
 Cloverdale,  Islington,  Martin Grove,  Princess Gardens,  West Deane Park
 Highland Creek,  Rouge Hill,  Port Union
 Flemingdon Park,  Don Mills South
 Woodbine Heights
 St. James Town
 Humewood-Cedarvale
 Bloordale Gardens,  Eringate,  Markland Wood,  Old Burnhamthorpe
 Guildwood,  Morningside,  West Hill
 The Beaches
 Berczy Park
 Caledonia-Fairbanks
 Woburn
 Leaside
 Central Bay Street
 Christie
 Cedarbrae
 Hillcrest Village
 Bathurst Manor,  Downsview North,  Wilson Heights
 Thorncliffe Park
 Adelaide,  King,  Richmond
 Dovercourt Village,  Dufferin
 Scarborough Village
 Fairview,  Henry Farm,  Oriole
 Northwood Park,  York University
 East Toronto
 Harbourfront East,  Toronto Islands,  Union Station
 Little Portugal,  Trinity
 East Birchmount Park,  Ion

One hot encoding nearby venues and establishing the mean occurence of venue categories per neigbourhood, then creating a new dataframe of the top ten venues per neighbourhood:

In [37]:
#One hot encoding

t1h = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")

#Add and move neighborhood column
t1h['Neighbourhood'] = venues['Neighbourhood'] 
fixed_columns = [t1h.columns[-1]] + list(t1h.columns[:-1])
t1h = t1h[fixed_columns]

In [41]:
#Finding the mean of category occurence in each neighbourhood

tg = t1h.groupby('Neighbourhood').mean().reset_index()
tg.head()

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,...,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Agincourt North, L'Amoreaux East, Milliken,...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Albion Gardens, Beaumond Heights, Humbergat...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [42]:
#Define function to sort neigbourhoods by most common venues

def common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending = False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [103]:
#Create dataframe of top ten venues for each neighbourhood

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

#Create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# Create the new dataframe
ns = pd.DataFrame(columns=columns)
ns['Neighbourhood'] = tg['Neighbourhood']

for ind in np.arange(tg.shape[0]):
    ns.iloc[ind, 1:] = common_venues(tg.iloc[ind, :], num_top_venues)

In [104]:
ns.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Thai Restaurant,Bar,Steakhouse,Breakfast Spot,Hotel,Restaurant,American Restaurant,Gym
1,Agincourt,Breakfast Spot,Clothing Store,Lounge,Skating Rink,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant
2,"Agincourt North, L'Amoreaux East, Milliken,...",Playground,Park,Yoga Studio,Dumpling Restaurant,Discount Store,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore
3,"Albion Gardens, Beaumond Heights, Humbergat...",Grocery Store,Coffee Shop,Discount Store,Japanese Restaurant,Sandwich Place,Beer Store,Fried Chicken Joint,Pizza Place,Pharmacy,Fast Food Restaurant
4,"Alderwood, Long Branch",Pizza Place,Gym,Skating Rink,Pharmacy,Sandwich Place,Coffee Shop,Pub,Pool,Dog Run,Diner


Performing k-means clustering on our neighbourhoods, with four clusters, and analysing each cluster to characterise it with a label:

In [105]:
#Set number of clusters
kclusters = 4

tc = tg.drop('Neighbourhood', 1)

#Run k-means clustering
kmeans = KMeans(n_clusters = kclusters, random_state = 3).fit(tc)

In [106]:
#Add clustering labels
ns.insert(0, 'Cluster Labels', kmeans.labels_)

pm = pl

#Merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
pm = pm.join(ns.set_index('Neighbourhood'), on = 'Neighbourhood')

In [107]:
#Drop Neighbourhoods missing venue data

pm.dropna(axis = 0, how = "any", inplace = True)

In [109]:
pm.loc[pm['Cluster Labels'] == 0, pm.columns[[1] + list(range(5, pm.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,0.0,Fast Food Restaurant,Park,Bus Stop,Food & Drink Shop,Falafel Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Discount Store
6,Scarborough,0.0,Fast Food Restaurant,Yoga Studio,Eastern European Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
10,North York,0.0,Park,Japanese Restaurant,Asian Restaurant,Pub,Yoga Studio,Dumpling Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore
16,York,0.0,Park,Field,Hockey Arena,Trail,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore
21,York,0.0,Park,Fast Food Restaurant,Women's Store,Market,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
35,East York,0.0,Park,Pizza Place,Convenience Store,Yoga Studio,Dumpling Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore
44,Scarborough,0.0,Bus Line,Bakery,Park,Soccer Field,Bus Station,Metro Station,Intersection,Fast Food Restaurant,Cuban Restaurant,Donut Shop
49,North York,0.0,Park,Construction & Landscaping,Bakery,Basketball Court,Yoga Studio,Electronics Store,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant
61,Central Toronto,0.0,Park,Swim School,Bus Line,Yoga Studio,Dumpling Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant
64,York,0.0,Park,Convenience Store,Yoga Studio,Electronics Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant


In [113]:
#Choose a label for cluster 0 based on the venues

cl0 = "Green Neighbourhoods"

In [110]:
pm.loc[pm['Cluster Labels'] == 1, pm.columns[[1] + list(range(5, pm.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,Etobicoke,1.0,Bank,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store,Filipino Restaurant


In [117]:
#Choose a label for cluster 1 based on the venues

cl1 = "Unique Neighbourhood"

In [111]:
pm.loc[pm['Cluster Labels'] == 2, pm.columns[[1] + list(range(5, pm.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
32,Scarborough,2.0,Playground,Yoga Studio,Eastern European Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
40,North York,2.0,Park,Airport,Playground,Yoga Studio,Eastern European Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore
83,Central Toronto,2.0,Restaurant,Tennis Court,Playground,Drugstore,Discount Store,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
85,Scarborough,2.0,Playground,Park,Yoga Studio,Dumpling Restaurant,Discount Store,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore


In [121]:
#Choose a label for cluster 2 based on the venues

cl2 = "Kid Friendly Neighbourhoods"

In [112]:
pm.loc[pm['Cluster Labels'] == 3, pm.columns[[1] + list(range(5, pm.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,3.0,Coffee Shop,Financial or Legal Service,Hockey Arena,Intersection,Portuguese Restaurant,Dumpling Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop
2,Downtown Toronto,3.0,Coffee Shop,Park,Pub,Café,Bakery,Breakfast Spot,Theater,Mexican Restaurant,Ice Cream Shop,French Restaurant
3,North York,3.0,Furniture / Home Store,Event Space,Miscellaneous Shop,Clothing Store,Arts & Crafts Store,Coffee Shop,Accessories Store,Vietnamese Restaurant,Boutique,Women's Store
4,Queen's Park,3.0,Coffee Shop,Park,Gym,Diner,Persian Restaurant,Smoothie Shop,Seafood Restaurant,Burger Joint,Sandwich Place,Burrito Place
7,North York,3.0,Caribbean Restaurant,Gym / Fitness Center,Japanese Restaurant,Café,Baseball Field,Dumpling Restaurant,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant
8,East York,3.0,Fast Food Restaurant,Pizza Place,Pet Store,Athletics & Sports,Gastropub,Intersection,Pharmacy,Breakfast Spot,Bank,Gym / Fitness Center
9,Downtown Toronto,3.0,Coffee Shop,Clothing Store,Cosmetics Shop,Middle Eastern Restaurant,Café,Restaurant,Bookstore,Japanese Restaurant,Diner,Ice Cream Shop
12,Scarborough,3.0,History Museum,Bar,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
13,North York,3.0,Gym,Beer Store,Coffee Shop,Asian Restaurant,Sandwich Place,Bike Shop,Sporting Goods Shop,Supermarket,Japanese Restaurant,Italian Restaurant
14,East York,3.0,Skating Rink,Curling Ice,Park,Pharmacy,Bus Stop,Video Store,Cosmetics Shop,Beer Store,Yoga Studio,Doner Restaurant


In [116]:
#Choose a label for cluster 3 based on the venues

cl3 = "Dinner and Coffee Neighbourhoods"

In [124]:
#Create list object of Neighbourhood labels

nll = (cl0, cl1, cl2, cl3)

### Answer for question 3:

Create a new map of Toronto and add the neighbourhood markers, colouring each marker according to our four k-means clusters and labelling each marker according to the characterising label for the nearby venues:

In [125]:
#Create map
clustermap = folium.Map(location=[latitude + 0.04, longitude], zoom_start=11)

#Set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

#Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(pm['Latitude'], pm['Longitude'], pm['Neighbourhood'], pm['Cluster Labels']):
    label = folium.Popup(nll[int(cluster)])
    folium.CircleMarker(
        [lat, lon],
        radius = 5,
        popup = label,
        color = rainbow[int(cluster - 1)],
        fill = True,
        fill_color = rainbow[int(cluster - 1)],
        fill_opacity = 0.7).add_to(clustermap)
       
clustermap