This notebook is apart of the IBM Data Science Capstone on Coursera. The goal is to segment and cluster different neighborhoods in Toronto. I do this by scraping web data from wikipedia about different boroughs and then access the foursquare API to explore different places within those boroughs for the cluster analysis.

**Note:** There is code commented out throught the notebook that impacts how boroughs with multiple corresponding neighborhoods select latitudes and longitudes with which to access the foursquare API. Currently, the program just selects the first neighborhood listed after the borough in the wikipedia table the data is scraped from.

In [1]:
# webscraping
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
#!conda install requests
import requests
import json

# for gathering latitude and longitude
from geopy.geocoders import Nominatim

# for progress bar (latitude, longitude processing is kind of slow)
from tqdm import tqdm

import numpy as np
import pandas as pd
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
# !conda install -c conda-forge folium=0.5.0 --yes
import folium

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py_0 conda-forge
    branca:  0.3.0-py_0 conda-forge
    folium:  0.5.0-py_0 conda-forge
    vincent: 0.4.4-py_1 conda-forge

branca-0.3.0-p 100% |################################| Time: 0:00:00 777.29 kB/s
vincent-0.4.4- 100% |################################| Time: 0:00:00 893.70 kB/s
altair-2.2.2-p 100% |################################| Time: 0:00:00   1.12 MB/s
folium-0.5.0-p 100% |################################| Time: 0:00:00   1.47 MB/s


In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [3]:
# open connection and grab page
uClient = ureq(url)

In [4]:
html = uClient.read()


In [5]:
uClient.close()

In [6]:
# parse html
soup = soup(html, "html.parser")

In [7]:
df_dict = {"PostalCode":[], "Borough":[], "Neighborhood":[]}
for tr in soup.find_all('tr'):
    tds = tr.find_all('td')
    br = False
    for td in tds:
        # make sure we are still in the main table
        if td.b is None or td.b.text[0] != 'M':
            br = True
            break
        # make sure postal code is assigned
        curr_td_code = td.b.text
        if td.span is not None:
            txt = td.span.text
            spl = txt.strip(")").split("(")
            borough = spl[0]
            
            # special case
            if borough == "Queen's Park\n":
                #print("q's park")
                df_dict["PostalCode"].append(td.findAll('b')[0].text)
                df_dict["Borough"].append("Queen's Park")
                df_dict["Neighborhood"].append(td.findAll('b')[1].text)
            
            elif borough != 'Not assigned':
                df_dict["PostalCode"].append(td.b.text)
                # special cases (makes lat, long processing smoother to handle here)
                if td.b.text == "M4J":
                    df_dict["Borough"].append("East York")
                elif td.b.text == "M7R":
                    df_dict["Borough"].append("Mississauga")
                elif td.b.text == "M5W":
                    df_dict["Borough"].append("Downtown Toronto")
                elif td.b.text == "M7Y":
                    df_dict["Borough"].append("East Toronto")
                elif td.b.text == "M9W":
                    df_dict["Borough"].append("Etobicoke")
                else:
                    df_dict["Borough"].append(borough)
                neighborhoods = []
                # for loop because some cells have multiple paranethesis 
                for i in range(1,len(spl)):
                    hoods = spl[i].replace(")", " ").replace(",", "/").split("/")
#                     for hood in hoods:
#                         neighborhoods.append(hood.strip())
                #df_dict["Neighborhood"].append(neighborhoods)
                df_dict["Neighborhood"].append(hoods[0].strip())
    # abort mission once we leave the main table
    if br:
        break

In [8]:
df = pd.DataFrame.from_dict(df_dict)
df = df.reindex(columns=["PostalCode", "Borough", "Neighborhood"])
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park
3,M6A,North York,Lawrence Manor
4,M7A,Queen's Park,Ontario Provincial Government


In [9]:
df.shape

(103, 3)

In [11]:
lats = []
lons = []
# some boroughs contain multiple neighborhoods in their neighborhoods list
# this keeps track of the index of which one was used to find the lat and lon
# hood_used_indices = []
geolocator = Nominatim(user_agent="GeocodeEarth")
with tqdm(total=len(list(df.iterrows()))) as pbar:
    for index,row in df.iterrows():
        pbar.update(1)
        # handle special case
        if row['Borough'] == "Queen's Park":
            location = geolocator.geocode("Queen's Park, Toronto")
        else:
#             hood_used_idx = None
#             for i in range(len(row['Neighborhood'])):
#                 hood_used_idx = i
#                 location = geolocator.geocode(row['Borough'] + ', ' + row['Neighborhood'][i])
#                 if location != None:
#                     break
            location = geolocator.geocode(row['Neighborhood'] + ", " + row['Borough'] + ', Toronto')
            if location == None:
                location = geolocator.geocode(row['Borough'] + ', Toronto')
        #print(index, location.address)
        lats.append(location.latitude)
        lons.append(location.longitude)
       # hood_used_indices.append(hood_used_idx)

100%|██████████| 103/103 [01:21<00:00,  1.23it/s]


In [12]:
df["Latitude"] = lats
df["Longitude"] = lons
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.761224,-79.323986
1,M4A,North York,Victoria Village,43.732658,-79.311189
2,M5A,Downtown Toronto,Regent Park,43.654027,-79.3802
3,M6A,North York,Lawrence Manor,43.722079,-79.437507
4,M7A,Queen's Park,Ontario Provincial Government,43.65998,-79.390369
5,M9A,Etobicoke,Islington Avenue,43.714904,-79.554973
6,M1B,Scarborough,Malvern,43.809196,-79.221701
7,M3B,North York,Don Mills North,43.737178,-79.343451
8,M4B,East York,Parkview Hill,43.670862,-79.372792
9,M5B,Downtown Toronto,Garden District,43.65254,-79.377276


In [14]:
location = geolocator.geocode("Toronto")
lat = location.latitude
lon = location.longitude

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[lat, lon], zoom_start=10)

# # add markers to map
# for lat, lon, borough, neighborhood, hood_used_idx in zip(df['Latitude'], df['Longitude'], 
#                                               df['Borough'], df['Neighborhood'], hood_used_indices):
#     label = '{}, {}'.format(neighborhood[hood_used_idx].strip(), borough)
#     label = folium.Popup(label, parse_html=True)
#     folium.CircleMarker(
#         [lat, lon],
#         radius=5,
#         popup=label,
#         color='blue',
#         fill=True,
#         fill_color='#3186cc',
#         fill_opacity=0.7,
#         parse_html=False).add_to(map_toronto)  
    
# map_toronto

# add markers to map
for lat, lon, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [15]:
# TESTING CELL
index = 101
assert index < df.shape[0]
neighborhood_latitude = df.loc[index, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df.loc[index, 'Longitude'] # neighborhood longitude value

neighborhood_name = df.loc[index, 'Neighborhood'] # neighborhood name

# print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name[hood_used_indices[index]].strip(), 
#                                                                neighborhood_latitude, 
#                                                                neighborhood_longitude))

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Old Mill South are 43.6513012, -79.4954753.


### FOURSQUARE API
* [Foursquare](https://foursquare.com/) is a company with a massive dataset of accurate and comprehensive location data that power data for Apple Maps, Uber, Snapchat, Twitter, and others. The following cells access the foursquare developer API

* The developer API gives access to user data, aggregated or individual venue data, and allows for a variety of different types of queries to the foursquare database
    * For example, when doing a search query for an individual venue the API gives access to the venue's name, unique ID, location, category, available statistics, contact information, tips, ratings, and a URL. A search query for a user may give access to first name, last name, friends on foursquare, contact info, ID, tips, and gender. 
        * Note that some of these features may requires a premium account

* The following code uses the explore endpoint to query the foursquare database and gather information on multiple venues within a specified radius of the latitude and longitude found for each borough

check out the trending endpoint

In [16]:
# The code was removed by Watson Studio for sharing.

In [17]:
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

venue_results = requests.get(url).json()

In [18]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [19]:
venues = venue_results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(13) # this is the size of index = 0 neighborhood

Unnamed: 0,name,categories,lat,lng
0,Old Mill Toronto,American Restaurant,43.651011,-79.493222
1,Étienne Brulé Park,Park,43.652814,-79.492178
2,Old Mill Subway Station,Metro Station,43.649892,-79.495322
3,The Spa at The Old Mill,Spa,43.650824,-79.49319
4,Home Smith Park,Park,43.652469,-79.498786
5,King's Mill Park,Park,43.649332,-79.492828


In [20]:
# print('{} venues were returned by Foursquare in {}'.format(nearby_venues.shape[0], neighborhood_name[hood_used_indices[index]].strip()))
print('{} venues were returned by Foursquare in {}'.format(nearby_venues.shape[0], neighborhood_name))

6 venues were returned by Foursquare in Old Mill South


In [21]:
def getNearbyVenues(names, indices, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, hood_used_idx, lat, lng in zip(names, indices, latitudes, longitudes):
        name = name[hood_used_idx].strip()
        # print(name) 
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [29]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # print(name)
        
        # handle empty requests
        if results == []:
            venues_list.append([(
            name, 
            lat, 
            lng, 
            None, 
            None, 
            None,  
            None)])
        
        else: 
            # return only relevant information for each nearby venue
            venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return nearby_venues

In [30]:
venues = getNearbyVenues(names=df['Neighborhood'],
                         latitudes=df['Latitude'],
                         longitudes=df['Longitude'])

In [31]:
# TESTING
# check sizes match for num unique neighborhoods
venues['Neighborhood'].nunique(), df['Neighborhood'].nunique()

(101, 101)

In [32]:
# TESTING 
# to see which boroughs (if any) are causing problems
v=venues['Neighborhood'].unique()
d=df['Neighborhood'].unique()
for i in range(90,100):
    print(d[i], ' | ', v[i])

Rosedale  |  Rosedale
Enclave of M5E  |  Enclave of M5E
Alderwood  |  Alderwood
Clairville  |  Clairville
Upper Rouge  |  Upper Rouge
First Canadian Place  |  First Canadian Place
The Kingsway  |  The Kingsway
Church and Wellesley  |  Church and Wellesley
Enclave of M4L  |  Enclave of M4L
Old Mill South  |  Old Mill South


In [33]:
print(venues.shape, venues['Neighborhood'].nunique())
venues.head()

(3303, 7) 101


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.761224,-79.323986,Allwyn's Bakery,43.75984,-79.324719,Caribbean Restaurant
1,Parkwoods,43.761224,-79.323986,Tim Hortons,43.760668,-79.326368,Café
2,Parkwoods,43.761224,-79.323986,A&W Canada,43.760623,-79.326829,Fast Food Restaurant
3,Parkwoods,43.761224,-79.323986,Food Basics,43.760865,-79.326015,Supermarket
4,Parkwoods,43.761224,-79.323986,Shoppers Drug Mart,43.760857,-79.324961,Pharmacy


In [34]:
print('There are {} uniques categories.'.format(len(venues['Venue Category'].unique())))

There are 241 uniques categories.


In [35]:
onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")
print(onehot.shape)
onehot.head()

(3303, 240)


Unnamed: 0,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Terminal,American Restaurant,Antique Shop,Art Gallery,Art Museum,Arts & Crafts Store,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [36]:
# one hot encoding
onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
onehot['Neighborhood'] = venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = ['Neighborhood'] + list(onehot.drop('Neighborhood', axis=1).columns)
onehot = onehot[fixed_columns]

print(onehot.shape)
onehot.head()

(3303, 240)


Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Terminal,American Restaurant,Antique Shop,Art Gallery,Art Museum,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Compute the occurence frequency of each venue type in each neighborhood

In [37]:
grouped = onehot.groupby('Neighborhood').mean().reset_index()
print(grouped.shape)
grouped.head()

(101, 240)


Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Terminal,American Restaurant,Antique Shop,Art Gallery,Art Museum,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0
1,Alderwood,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Bathurst Manor,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Bedford Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [38]:
num_top_venues = 5
num2print = 5

for hood in grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = grouped[grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')
    num2print -= 1
    if num2print <= 0:
        break

----Agincourt----
                  venue  freq
0    Chinese Restaurant   0.3
1  Cantonese Restaurant   0.1
2           Coffee Shop   0.1
3         Train Station   0.1
4      Asian Restaurant   0.1


----Alderwood----
            venue  freq
0     Pizza Place  0.25
1             Pub  0.12
2     Coffee Shop  0.12
3            Bank  0.12
4  Sandwich Place  0.12


----Bathurst Manor----
                     venue  freq
0        Convenience Store  0.25
1               Playground  0.25
2                     Park  0.25
3           Baseball Field  0.25
4  New American Restaurant  0.00


----Bayview Village----
                 venue  freq
0  Sporting Goods Shop  0.07
1      Bubble Tea Shop  0.07
2          Coffee Shop  0.07
3       Hardware Store  0.07
4       Breakfast Spot  0.07


----Bedford Park----
                        venue  freq
0                   Locksmith  0.33
1  Construction & Landscaping  0.33
2          Seafood Restaurant  0.33
3                        Park  0.00
4           

In [39]:
def return_most_common_venues(row, num_top_venues):
    """sort venues in descending order"""
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [40]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = grouped['Neighborhood']

for ind in np.arange(grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Chinese Restaurant,Train Station,Coffee Shop,Vietnamese Restaurant,Hong Kong Restaurant,Korean Restaurant,Asian Restaurant,Cantonese Restaurant,Yoga Studio,Flower Shop
1,Alderwood,Pizza Place,Pool,Bank,Skating Rink,Coffee Shop,Pub,Sandwich Place,Fish Market,Fast Food Restaurant,Filipino Restaurant
2,Bathurst Manor,Playground,Park,Convenience Store,Baseball Field,Yoga Studio,Food & Drink Shop,Fish & Chips Shop,Fish Market,Flower Shop,Food
3,Bayview Village,Pizza Place,Sporting Goods Shop,Grocery Store,Sandwich Place,Fish Market,Coffee Shop,Bank,Hardware Store,Outdoor Supply Store,Breakfast Spot
4,Bedford Park,Construction & Landscaping,Seafood Restaurant,Locksmith,Yoga Studio,Filipino Restaurant,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck


In [41]:
grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Terminal,American Restaurant,Antique Shop,Art Gallery,Art Museum,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0
1,Alderwood,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Bathurst Manor,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Bedford Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [45]:
# set number of clusters
kclusters = 4

grouped_clustering = grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_ 

array([0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 3, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 3, 0, 3, 0, 0,
       3, 0, 0, 0, 3, 3, 0, 3, 0, 0, 0, 0, 3, 0, 0, 3, 3, 3, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 0, 3,
       0, 0, 0, 0, 0, 0, 0, 1, 1], dtype=int32)

In [46]:
merged = df.groupby("Neighborhood").mean().reset_index() 

# add clustering labels
merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
merged = merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

merged.head() # check the last columns!

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,43.785353,-79.278549,0,Chinese Restaurant,Train Station,Coffee Shop,Vietnamese Restaurant,Hong Kong Restaurant,Korean Restaurant,Asian Restaurant,Cantonese Restaurant,Yoga Studio,Flower Shop
1,Alderwood,43.601717,-79.545232,0,Pizza Place,Pool,Bank,Skating Rink,Coffee Shop,Pub,Sandwich Place,Fish Market,Fast Food Restaurant,Filipino Restaurant
2,Bathurst Manor,43.763893,-79.456367,3,Playground,Park,Convenience Store,Baseball Field,Yoga Studio,Food & Drink Shop,Fish & Chips Shop,Fish Market,Flower Shop,Food
3,Bayview Village,43.769197,-79.376662,0,Pizza Place,Sporting Goods Shop,Grocery Store,Sandwich Place,Fish Market,Coffee Shop,Bank,Hardware Store,Outdoor Supply Store,Breakfast Spot
4,Bedford Park,43.737388,-79.410925,0,Construction & Landscaping,Seafood Restaurant,Locksmith,Yoga Studio,Filipino Restaurant,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck


In [47]:
# create map
location = geolocator.geocode("Toronto")
lat = location.latitude
lon = location.longitude

map_clusters = folium.Map(location=[lat, lon], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(merged['Latitude'], merged['Longitude'], merged['Neighborhood'], merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### CLUSTER ANALYSIS

In [56]:
merged['Cluster Labels'].value_counts()

0    79
3    19
1     2
2     1
Name: Cluster Labels, dtype: int64

The current split with four clusters finds two main groups and two outlier groups

#### CLUSTER 0
* This cluster seems to be the typical downtown areas, with lots of restaurants, stores, and caffes.

In [49]:
merged.loc[merged['Cluster Labels'] == 0, merged.columns[[0] + list(range(3, merged.shape[1]))]].head(10)

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,0,Chinese Restaurant,Train Station,Coffee Shop,Vietnamese Restaurant,Hong Kong Restaurant,Korean Restaurant,Asian Restaurant,Cantonese Restaurant,Yoga Studio,Flower Shop
1,Alderwood,0,Pizza Place,Pool,Bank,Skating Rink,Coffee Shop,Pub,Sandwich Place,Fish Market,Fast Food Restaurant,Filipino Restaurant
3,Bayview Village,0,Pizza Place,Sporting Goods Shop,Grocery Store,Sandwich Place,Fish Market,Coffee Shop,Bank,Hardware Store,Outdoor Supply Store,Breakfast Spot
4,Bedford Park,0,Construction & Landscaping,Seafood Restaurant,Locksmith,Yoga Studio,Filipino Restaurant,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck
5,Berczy Park,0,Clothing Store,Coffee Shop,Restaurant,Cosmetics Shop,Tea Room,Café,Plaza,Seafood Restaurant,Burger Joint,American Restaurant
6,Birch Cliff,0,Clothing Store,Coffee Shop,Food Court,Fast Food Restaurant,Cosmetics Shop,Sandwich Place,Tea Room,Smoothie Shop,Fish & Chips Shop,Movie Theater
7,Brockton,0,Bar,Park,Vietnamese Restaurant,Jazz Club,Bakery,Dive Bar,Portuguese Restaurant,Coffee Shop,Café,French Restaurant
8,CFB Toronto,0,Ramen Restaurant,Restaurant,Korean Restaurant,Coffee Shop,Pizza Place,Sandwich Place,Café,Japanese Restaurant,Movie Theater,Steakhouse
9,CN Tower,0,Clothing Store,Coffee Shop,Restaurant,Cosmetics Shop,Tea Room,Café,Plaza,Seafood Restaurant,Burger Joint,American Restaurant
10,Caledonia-Fairbanks,0,Sandwich Place,Discount Store,Wine Shop,Grocery Store,Convenience Store,Coffee Shop,Park,Food,Fish Market,Flower Shop


#### CLUSTER 1
* This outlier contains two data points but clearly the foursquare API did not distinguish between York Mills and York Mills West, as they have the same exact venue exploration results and in fact they used the same latitude and longitude coordinates for their search (see cell below the table)

In [50]:
merged.loc[merged['Cluster Labels'] == 1, merged.columns[[0] + list(range(3, merged.shape[1]))]].head(10)

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
99,York Mills,1,Wine Shop,Yoga Studio,Filipino Restaurant,Gaming Cafe,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck,Food Court
100,York Mills West,1,Wine Shop,Yoga Studio,Filipino Restaurant,Gaming Cafe,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck,Food Court


In [60]:
merged[merged['Neighborhood']=='York Mills'][['Neighborhood', 'Latitude', 'Longitude']], merged[merged['Neighborhood']=='York Mills West'][['Neighborhood', 'Latitude', 'Longitude']]

(   Neighborhood   Latitude  Longitude
 99   York Mills  43.746184 -79.420454,
         Neighborhood   Latitude  Longitude
 100  York Mills West  43.746184 -79.420454)

#### CLUSTER 2
* Cluster 2 is another outlier that when we look at the map is also a geopraphic outlier. Maybe it is a suburb and would be best not left in a final analysis if we are only concerned with the city of Toronto

In [51]:
merged.loc[merged['Cluster Labels'] == 2, merged.columns[[0] + list(range(3, merged.shape[1]))]].head(10)

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
88,Upper Rouge,2,Trail,Yoga Studio,Filipino Restaurant,Gaming Cafe,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck,Food Court


#### CLUSTER 3
* Cluster 3 is all about parks, fitness, the great outdoors, and a little bit of yoga. It's interesting that these locations are pretty well geographically dispersed throughout the city. I was expecting them to be more geograpically clustered before I checked their spatial distribution on the map.

In [52]:
merged.loc[merged['Cluster Labels'] == 3, merged.columns[[0] + list(range(3, merged.shape[1]))]].head(10)

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Bathurst Manor,3,Playground,Park,Convenience Store,Baseball Field,Yoga Studio,Food & Drink Shop,Fish & Chips Shop,Fish Market,Flower Shop,Food
16,Clarks Corners,3,Park,Caribbean Restaurant,Yoga Studio,Fish & Chips Shop,Gaming Cafe,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck
29,Enclave of M4L,3,Park,Light Rail Station,Gym / Fitness Center,Building,American Restaurant,Trail,Lake,Bus Stop,French Restaurant,Food Truck
31,Eringate,3,Playground,Park,Yoga Studio,Fast Food Restaurant,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck,Food Court
35,Forest Hill North & West,3,Park,Light Rail Station,Gym / Fitness Center,Building,American Restaurant,Trail,Lake,Bus Stop,French Restaurant,Food Truck
41,High Park,3,Park,Light Rail Station,Gym / Fitness Center,Building,American Restaurant,Trail,Lake,Bus Stop,French Restaurant,Food Truck
43,Humber Summit,3,Empanada Restaurant,Park,Bakery,Yoga Studio,Fish & Chips Shop,Gaming Cafe,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant
46,India Bazaar,3,Park,Light Rail Station,Gym / Fitness Center,Building,American Restaurant,Trail,Lake,Bus Stop,French Restaurant,Food Truck
50,Kingsview Village,3,Park,Yoga Studio,Filipino Restaurant,Gaming Cafe,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck,Food Court
51,Lawrence Manor,3,Bank,Park,Kids Store,Electronics Store,Food Court,Fish Market,Flower Shop,Food,Food & Drink Shop,Food Truck
