# IBM Data Science Certificate Capstone

This notebook is a part of the capstone project for IBM's data science cerificate program provided by coursera. 

The aim of this project is to leverage data from Foursquare to cluster the neighborhoods of Toronto based on similar venues.

In [1]:
import pandas as pd
import numpy as np

# Requests and BeautifulSoup Libraries are used for webscraping
import requests
from bs4 import BeautifulSoup

#Regular expressions library is needed for cleaning some data
import re

#Importing the model to be used for clustering
from sklearn.cluster import KMeans

#Folium will be used to visualize clusters on map
import folium

#Importing styling libraries from matplotlib
import matplotlib.cm as cm
import matplotlib.colors as colors

#To keep information secure, Client ID and Client Secret are stored as path variables on local machine
#The os module is needed to access these path variables
import os

## Scraping
The first task to complete for this project is collecting data on Toronto's postal codes, Boroughs and Neighborhoods. 

In [3]:
#Making request for wikipedia page which lists postal codes in Toronto
wikiUrl="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
wikiResponse=requests.get(wikiUrl)

In [4]:
#Using BeautifulSoup to parse response
wikiSoup=BeautifulSoup(wikiResponse.text,"html.parser")

In [5]:
print(wikiSoup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"3ade1005-914c-482b-85fd-9a2ba97a08f6","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":1029579868,"wgRevisionId":1029579868,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Communica

The following code block extracts postal codes with their associated boroughs and neighborhoods from the beautifulsoup object. Regular expressions are used to ensure that neighborhoods are consistently formatted, regardless of how they are formatted in the original document. 

In [7]:
#initializing list variables for postal codes, boroughs and neighborhoods
pcodes=[]
boroughs=[]
neighborhoods=[]

#Looping through table of postal codes
for pcode in wikiSoup.tbody.find_all("p"):
    #Ignoring postal codes that are not assigned
    if pcode.span.string=="Not assigned":
        pass
    else:
        #Postal code always appears in bold, so string is taken from <b> tag
        pcodes.append(pcode.b.string)
        
        #creating list of lines within table cell that follow postal code
        contents=[x.string for x in pcode.b.find_next_sibling('span').contents if getattr(x, 'name', None) != 'br']
        
        #Borough is always the first line after postal code
        boroughs.append(contents[0])
        
        #Combining all lines following the borough into one string
        neighString=""
        for i in contents[1:]:
            neighString=neighString+i
            
        #Some cells do not list a neighborhood, but instead say "Enclave of (postal code)"
        #For these postal codes the Borough is used as the neighborhood
        if re.search("Enclave",neighString):
            neighborhoods.append(contents[0])
        else:
            #splitting string into list at any of the special characters which separate neighborhoods in original document
            neighList=re.split("\(| \/ ",neighString)
            neighString=""
            
            #Recomibining list of neigborhoods into string, with neighborhoods separated by commas
            for i,neigh in enumerate(neighList[1:]):
                    
                    #Not adding comma at end of list
                    if i==len(neighList)-2:
                        neighString=neighString+neigh.split(")")[0]
                    else:
                        neighString=neighString+neigh.split(")")[0]+", "
            neighborhoods.append(neighString)
                
                
    

In [8]:
#Initializing Data Frame
neighborhoodsDF=pd.DataFrame(columns=["Postal Code","Borough","Neighborhoods"])

In [9]:
#Using lists as values for Data Frame
neighborhoodsDF["Postal Code"]=pcodes
neighborhoodsDF["Borough"]=boroughs
neighborhoodsDF["Neighborhoods"]=neighborhoods

In [10]:
neighborhoodsDF["Borough"].unique()

array(['North York', 'Downtown Toronto', "Queen's Park", 'Etobicoke',
       'Scarborough', 'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

In [45]:
neighborhoodsDF

Unnamed: 0,Postal Code,Borough,Neighborhoods
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,East Toronto
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [11]:
neighborhoodsDF.shape

(103, 3)

## Adding Coordinates

Coordinates for each postal code are read in from a file, then joined onto neighborhoods data frame.

In [12]:
geo_cor=pd.read_csv("Geospatial_Coordinates.csv").set_index("Postal Code")

In [13]:
geo_cor.head()

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


In [14]:
neighborhoodsGeocodes=neighborhoodsDF.join(geo_cor,on="Postal Code")

In [15]:
neighborhoodsGeocodes

Unnamed: 0,Postal Code,Borough,Neighborhoods,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,East Toronto,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


## Using Foursquare API

Now that each group of neighborhoods has an associated set to geocoordinates, the foursquare api can be used to find the venues nearest to each location. 

In [16]:
#Using path variables to keep user information secure
ClientID=os.environ.get("fsClientID")
ClientSecret=os.environ.get("fsClientSecret")
Version='20180605'

In [49]:
#Removing entries outside of Toronto 
neighborhoodsGeocodes.drop(neighborhoodsGeocodes[neighborhoodsGeocodes['Borough']=="Mississauga"].index,inplace=True)

(102, 5)

In [17]:
#Defining function to call Foursquare API
#Other Radii were tested and 500 meters seems to produce best results
def getNearbyVenues(names, latitudes, longitudes, radius=500, limit=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            ClientID, 
            ClientSecret, 
            Version, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [18]:
torontoVenues=getNearbyVenues(names=neighborhoodsGeocodes["Neighborhoods"],latitudes=neighborhoodsGeocodes["Latitude"],longitudes=neighborhoodsGeocodes["Longitude"],radius=1500)

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Ontario Provincial Government
Islington Avenue
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills, Flemingdon Park
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
The Danforth  East
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmount Park
Bayview Village
Downsview, CFB Toronto

In [20]:
#Counting number of venues found in each Neighborhood
torontoCount=torontoVenues[["Neighborhood",'Venue']].groupby('Neighborhood').count().sort_values("Venue")
torontoCount

Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
Upper Rouge,3
"Rouge Hill, Port Union, Highland Creek",11
"Birch Cliff, Cliffside West",15
Bayview Village,16
Islington Avenue,17
...,...
Church and Wellesley,100
Studio District,100
East Toronto,100
The Beaches,100


Some neighborhoods simply do not have enough venues for their data to be considered useful. The decision was made to drop any neighborhoods with fewer than 10 venues.

In [21]:
print(torontoVenues.shape)
for neigh in list(torontoCount[torontoCount['Venue']<10].index):
    torontoVenues.drop(torontoVenues[torontoVenues['Neighborhood']==neigh].index,inplace=True)
print(torontoVenues.shape)

(6849, 7)
(6846, 7)


In [22]:
TorontoOnehot= pd.get_dummies(torontoVenues[['Venue Category']])

In [23]:
TorontoOnehot["Neighborhood"]=torontoVenues["Neighborhood"]
fixed_columns = [TorontoOnehot.columns[-1]] + list(TorontoOnehot.columns[:-1])
TorontoOnehot = TorontoOnehot[fixed_columns]

In [24]:
TorontoOnehot.head(20)

Unnamed: 0,Neighborhood,Venue Category_ATM,Venue Category_Accessories Store,Venue Category_Adult Boutique,Venue Category_Afghan Restaurant,Venue Category_African Restaurant,Venue Category_Airport,Venue Category_Airport Lounge,Venue Category_American Restaurant,Venue Category_Amphitheater,...,Venue Category_Volleyball Court,Venue Category_Warehouse Store,Venue Category_Whisky Bar,Venue Category_Wine Bar,Venue Category_Wings Joint,Venue Category_Women's Store,Venue Category_Xinjiang Restaurant,Venue Category_Yoga Studio,Venue Category_Zoo,Venue Category_Zoo Exhibit
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
toronto_grouped = TorontoOnehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Venue Category_ATM,Venue Category_Accessories Store,Venue Category_Adult Boutique,Venue Category_Afghan Restaurant,Venue Category_African Restaurant,Venue Category_Airport,Venue Category_Airport Lounge,Venue Category_American Restaurant,Venue Category_Amphitheater,...,Venue Category_Volleyball Court,Venue Category_Warehouse Store,Venue Category_Whisky Bar,Venue Category_Wine Bar,Venue Category_Wings Joint,Venue Category_Women's Store,Venue Category_Xinjiang Restaurant,Venue Category_Yoga Studio,Venue Category_Zoo,Venue Category_Zoo Exhibit
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.022222,0.0,0.0,0.000000,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.012821,0.0,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,"Willowdale, Newtonbrook",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.013699,0.0,0.0,0.000000,0.0,0.0
95,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0
96,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0
97,York Mills West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.012821,0.0,0.0


In [26]:
toronto_grouped_clustering=toronto_grouped.drop("Neighborhood",1)

### Clustering
With the the data properly cleaned, i can now begin to cluster the data. Here the neighborhoods of Toronto are grouped using k-means clustering. 

In [27]:
# K=4 seems to give best results. Adding more clusters would result in clusters including only one neighborhood
kClusters=4
kMeanModel=KMeans(n_clusters=kClusters)

In [28]:
kMeanModel.fit(toronto_grouped_clustering)


KMeans(n_clusters=4)

In [29]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [30]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhoods']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhoods'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhoods,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Venue Category_Chinese Restaurant,Venue Category_Cantonese Restaurant,Venue Category_Coffee Shop,Venue Category_Bakery,Venue Category_Hong Kong Restaurant,Venue Category_Shopping Mall,Venue Category_Caribbean Restaurant,Venue Category_Breakfast Spot,Venue Category_Gym / Fitness Center,Venue Category_Bank
1,"Alderwood, Long Branch",Venue Category_Café,Venue Category_Discount Store,Venue Category_Park,Venue Category_Burger Joint,Venue Category_Coffee Shop,Venue Category_Department Store,Venue Category_Grocery Store,Venue Category_Pizza Place,Venue Category_Bank,Venue Category_Burrito Place
2,"Bathurst Manor, Wilson Heights, Downsview North",Venue Category_Park,Venue Category_Pizza Place,Venue Category_Coffee Shop,Venue Category_Bank,Venue Category_Gas Station,Venue Category_Ice Cream Shop,Venue Category_Fried Chicken Joint,Venue Category_Ski Chalet,Venue Category_Baseball Field,Venue Category_Sandwich Place
3,Bayview Village,Venue Category_Park,Venue Category_Gas Station,Venue Category_Bank,Venue Category_Trail,Venue Category_Japanese Restaurant,Venue Category_Playground,Venue Category_Café,Venue Category_Chinese Restaurant,Venue Category_Athletics & Sports,Venue Category_Restaurant
4,"Bedford Park, Lawrence Manor East",Venue Category_Bakery,Venue Category_Italian Restaurant,Venue Category_Sushi Restaurant,Venue Category_Coffee Shop,Venue Category_Bagel Shop,Venue Category_Pharmacy,Venue Category_Pizza Place,Venue Category_Pub,Venue Category_Café,Venue Category_Asian Restaurant


In [31]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kMeanModel.labels_)


## Visualizing Clusters
I now use the folium library to visualize the clusters on a map.

In [39]:
# Adding coordinates to Neighborhoods venues dataframe
toronto_merged = neighborhoodsGeocodes.join(neighborhoods_venues_sorted.set_index('Neighborhoods'), on='Neighborhoods',how='right')

toronto_merged.head() 

Unnamed: 0,Postal Code,Borough,Neighborhoods,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
78,M1S,Scarborough,Agincourt,43.7942,-79.262029,1,Venue Category_Chinese Restaurant,Venue Category_Cantonese Restaurant,Venue Category_Coffee Shop,Venue Category_Bakery,Venue Category_Hong Kong Restaurant,Venue Category_Shopping Mall,Venue Category_Caribbean Restaurant,Venue Category_Breakfast Spot,Venue Category_Gym / Fitness Center,Venue Category_Bank
93,M8W,Etobicoke,"Alderwood, Long Branch",43.602414,-79.543484,0,Venue Category_Café,Venue Category_Discount Store,Venue Category_Park,Venue Category_Burger Joint,Venue Category_Coffee Shop,Venue Category_Department Store,Venue Category_Grocery Store,Venue Category_Pizza Place,Venue Category_Bank,Venue Category_Burrito Place
28,M3H,North York,"Bathurst Manor, Wilson Heights, Downsview North",43.754328,-79.442259,0,Venue Category_Park,Venue Category_Pizza Place,Venue Category_Coffee Shop,Venue Category_Bank,Venue Category_Gas Station,Venue Category_Ice Cream Shop,Venue Category_Fried Chicken Joint,Venue Category_Ski Chalet,Venue Category_Baseball Field,Venue Category_Sandwich Place
39,M2K,North York,Bayview Village,43.786947,-79.385975,0,Venue Category_Park,Venue Category_Gas Station,Venue Category_Bank,Venue Category_Trail,Venue Category_Japanese Restaurant,Venue Category_Playground,Venue Category_Café,Venue Category_Chinese Restaurant,Venue Category_Athletics & Sports,Venue Category_Restaurant
55,M5M,North York,"Bedford Park, Lawrence Manor East",43.733283,-79.41975,2,Venue Category_Bakery,Venue Category_Italian Restaurant,Venue Category_Sushi Restaurant,Venue Category_Coffee Shop,Venue Category_Bagel Shop,Venue Category_Pharmacy,Venue Category_Pizza Place,Venue Category_Pub,Venue Category_Café,Venue Category_Asian Restaurant


In [40]:
#Centering the map on the average position of all the neighborhoods
latitude,longitude=toronto_merged[["Latitude","Longitude"]].mean()

In [44]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kClusters)
ys = [i + x + (i*x)**2 for i in range(kClusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhoods'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [43]:
toronto_merged["Cluster Labels"].value_counts()

2    42
3    38
1    11
0    11
Name: Cluster Labels, dtype: int64