   # Segmenting and Clustering Neighborhoods in Toronto
 

This is an assignment from the Applied Data Science Capstone Course which is the last part of the IBM Data Science Specialization on Coursera.  
Requirements:
* Explore, segment, and cluster the neighborhoods in the city of Toronto. 
* Scrape the Wikipedia page containing the Toronto neighbourhood data, wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.
* Utilize the Foursquare API to explore venues and spots in the neighbourhoods.
* Cluster the neighbourhoods using K-means

## Part 1 
In this section, I built a dataframe of the postal code of each neighborhood in Toronto along with their borough name and neighborhood, by utilizing the Post Code tabular data from Wikipedia.

In [1]:
import pandas as pd  #import the necessary libraries
import numpy as np 

Cn_PostalCodes = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0] #obtain the Canadian tabular postal code data from Wikipedia 
Cn_PostalCodes.head()  #first five elements of the dataframe

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


The code below deletes rows where the Boroughs do not have assigned values.

In [2]:
indexNames = Cn_PostalCodes[(Cn_PostalCodes['Borough'] == 'Not assigned')].index #Get indices of rows to be dropped. We want to drop rows where Boroughs do not have assigned values
Cn_PostalCodes.drop(indexNames, inplace = True)
Cn_PostalCodes.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


The code below satisfies the following condition; 'If a cell has a borough but does not have an assigned neighborhood, then the neighborhood will be the same as the borough'. And since all the Boroughs with not assigned values have been removed, we can assume that any Neighbour that doesn't have an assigned value, has a Borough.

In [4]:
for index, row in Cn_PostalCodes.iterrows(): #iterate over the dataframe
    if row['Neighbourhood'] == 'Not assigned': #check for Neighborhoods with not assigned values
        row['Neighbourhood'] = row['Borough']  #Assign the Borough values to such Neighborhoods

The assignment gives a condition such that if a Postal Code has more than one Neighborhood, the rows should be combined into one row, with the neighbourhoods separated with a comma. But from the few rows seen above, I observed that this was already done. My assumption is that the data on the Wikipedia website was recently modified to incorporate this condition. To be sure, I wrote a code below to check if any Post Code occured more than once.

In [6]:
countofPostCode = Cn_PostalCodes.groupby(['Postal Code']).count()  #dataframe to count the frequencies of the objects over the Post Code Column 
countofPostCode
for index, row in countofPostCode.iterrows():  #iterate over the frequency dataframe
    if row['Neighbourhood'] != 1:   #Check if any Post Code has more than 1 Neighborhood
        print("We have a culprit")  #If so, print "We have a culprit"

The code above does not generate any output which means all the Neighborhood entries are unique for each Postal Code.

In [7]:
Cn_PostalCodes.shape[0]  #number of rows of the dataframe

103

## Part 2
Inorder to utilize the Foursquare location data, the latitude and the longitude coordinates of each neighborhood need to be added to the dataframe. In this section, the coordinates were added by utilizing a csv file that has the geographical coordinates of each postal code.

In [8]:
Lat_lng = pd.read_csv('http://cocl.us/Geospatial_data')   #File containing the Coordinates of the Postal Codes

In [9]:
Lat_lng.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [10]:
Cn_PostalCodesCord = Cn_PostalCodes.merge(Lat_lng, on = 'Postal Code')  #Combine the two dataframes to incorporate the Coordinates into the neighbourhood info

In [11]:
Cn_PostalCodesCord.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## Part 3
The main analysis was done in this section. The neighbourhoods were visualized on a map using folium.
The venues for each neighbourhood were retrieved from the Foursquare API, explored and thereafter clustered.

In [12]:
import json   #import library to handle json files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim   # for converting an address into latitude and longitude values

import requests   # library to handle requests
from pandas.io.json import json_normalize  #for transforming the json file into a dataframe

#matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans  #import k-means

!conda install -c conda-forge folium=0.5.0 --yes
import folium   #map rendering library

print('Libraries imported')

Collecting package metadata (repodata.json): ...working... failed



CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/win-64/repodata.json>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
'https://conda.anaconda.org/conda-forge/win-64'




Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries imported


Here, the coordinates of Toronto were derived by using Nominatim from geopy library.

In [13]:
address = 'Toronto'
geolocator = Nominatim(user_agent = "Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geographical coordinates of Toronto are 43.6534817, -79.3839347.


Display Toronto on a map and incorporate the Neighbourhoods in the Map. Click the blue circle on the map to view the name of the Neighbourhood it's Borough.

In [14]:
Toronto_Map = folium.Map(location = [latitude, longitude], zoom_start = 10)

for lat, long, borough, neighbourhood in zip(Cn_PostalCodesCord['Latitude'], Cn_PostalCodesCord['Longitude'], Cn_PostalCodesCord['Borough'], Cn_PostalCodesCord['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
    [lat, long],
    radius = 7,
    popup = label,
    color = 'purple',
    fill = True,
    fill_color = '#3186cc',
    fill_opacity = 0.7,
    parse_html = False).add_to(Toronto_Map)
    
    
Toronto_Map

I will be working on Neighbourhoods with Boroughs containing 'Toronto'


In [95]:
Toronto = Cn_PostalCodesCord[Cn_PostalCodesCord['Borough'].str.contains("Toronto")].reset_index(drop=True)
Toronto

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
7,M6G,Downtown Toronto,Christie,43.669542,-79.422564
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


In [96]:
Cn_PostalCodesCord.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [97]:
Toronto_Map = folium.Map(location = [latitude, longitude], zoom_start = 10)

for lat, long, borough, neighbourhood in zip(Toronto['Latitude'], Toronto['Longitude'], Toronto['Borough'], Toronto['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
    [lat, long],
    radius = 7,
    popup = label,
    color = 'purple',
    fill = True,
    fill_color = '#3186cc',
    fill_opacity = 0.7,
    parse_html = False).add_to(Toronto_Map)
    
    
Toronto_Map

Start exploring with Foursquare API data

In [98]:
CLIENT_ID = '4YYQLYZT2URIQVILLESZCPPFLPNAM4KKBCPKGJSXUEDPKIXH'
CLIENT_SECRET = 'GQFM1YNPJO5VG1APVLM2OTOXKM4UBT45JTRURFY1DTENEKLZ'
VERSION = '20180605'

print('My credentials:')
print('CLIENT ID:' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

My credentials:
CLIENT ID:4YYQLYZT2URIQVILLESZCPPFLPNAM4KKBCPKGJSXUEDPKIXH
CLIENT_SECRET:GQFM1YNPJO5VG1APVLM2OTOXKM4UBT45JTRURFY1DTENEKLZ


In [99]:
Toronto.loc[0, 'Neighbourhood']

'Regent Park, Harbourfront'

In [100]:
neighborhood_latitude = Toronto.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = Toronto.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = Toronto.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Regent Park, Harbourfront are 43.6542599, -79.3606359.


Get the top 50 Venues from the Four Square API within radius 500 metres of Regent Park, Harbourfront

In [101]:
LIMIT = 50  #The limit of the number of venues retunred by FourSquare API

radius = 500   #Within radius 500

#create the GET request URL 
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

url



'https://api.foursquare.com/v2/venues/explore?&client_id=4YYQLYZT2URIQVILLESZCPPFLPNAM4KKBCPKGJSXUEDPKIXH&client_secret=GQFM1YNPJO5VG1APVLM2OTOXKM4UBT45JTRURFY1DTENEKLZ&v=20180605&ll=43.6542599,-79.3606359&radius=500&limit=50'

Send the GET request

In [102]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5f1da6a9d8d6542aaa2bac65'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Corktown',
  'headerFullLocation': 'Corktown, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 46,
  'suggestedBounds': {'ne': {'lat': 43.6587599045, 'lng': -79.3544279001486},
   'sw': {'lat': 43.6497598955, 'lng': -79.36684389985142}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '54ea41ad498e9a11e9e13308',
       'name': 'Roselle Desserts',
       'location': {'address': '362 King St E',
        'crossStreet': 'Trinity St',
        'lat': 43.653446723052674,
        'lng': -79.3620167174383,
        'labeledLatLngs': [{'label': 'display',
 

In [103]:
#create a function to extract the category of the venue

def get_category_type(row):
    try: 
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else: 
        return categories_list[0]['name']
    

Now to clean the json file and structure it into a data frame

In [104]:
venues= results['response']['groups'][0]['items']

nearby_venues = json_normalize(venues)  #flatten venues

#filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]

#filter the category for each row
nearby_venues['venues.categories'] = nearby_venues.apply(get_category_type, axis = 1)

#clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng,categories.1
0,Roselle Desserts,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",43.653447,-79.362017,Bakery
1,Tandem Coffee,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",43.653559,-79.361809,Coffee Shop
2,Cooper Koo Family YMCA,"[{'id': '52e81612bcbc57f1066b7a37', 'name': 'D...",43.653249,-79.358008,Distribution Center
3,Body Blitz Spa East,"[{'id': '4bf58dd8d48988d1ed941735', 'name': 'S...",43.654735,-79.359874,Spa
4,Impact Kitchen,"[{'id': '4bf58dd8d48988d1c4941735', 'name': 'R...",43.656369,-79.35698,Restaurant


Number of Venues returned by Foursquare

In [105]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

46 venues were returned by Foursquare.


Create a function to repeat the same process to all the neighborhoods with Boroughs containing 'Toronto'

In [106]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Run the above function on each neighborhood and create a new dataframe called Toronto_venues.

In [108]:
Toronto_venues = getNearbyVenues(names=Toronto['Neighbourhood'],
                                   latitudes=Toronto['Latitude'],
                                   longitudes=Toronto['Longitude']
                                  )

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West, Forest Hill Road Park
High Park, The Junction South
North Toronto West, Lawrence Park
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
R

In [109]:
Toronto_venues

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.654260,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.654260,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.654260,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.654260,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.654260,-79.360636,Impact Kitchen,43.656369,-79.356980,Restaurant
5,"Regent Park, Harbourfront",43.654260,-79.360636,Corktown Common,43.655618,-79.356211,Park
6,"Regent Park, Harbourfront",43.654260,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
7,"Regent Park, Harbourfront",43.654260,-79.360636,The Distillery Historic District,43.650244,-79.359323,Historic Site
8,"Regent Park, Harbourfront",43.654260,-79.360636,The Extension Room,43.653313,-79.359725,Gym / Fitness Center
9,"Regent Park, Harbourfront",43.654260,-79.360636,Dominion Pub and Kitchen,43.656919,-79.358967,Pub


Check the size of the dataframe

In [111]:
print(Toronto_venues.shape)
Toronto_venues.head()

(1192, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


Check how many venues were returned for each neighborhood

In [113]:
Toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,50,50,50,50,50,50
"Brockton, Parkdale Village, Exhibition Place",23,23,23,23,23,23
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",17,17,17,17,17,17
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",15,15,15,15,15,15
Central Bay Street,50,50,50,50,50,50
Christie,17,17,17,17,17,17
Church and Wellesley,50,50,50,50,50,50
"Commerce Court, Victoria Hotel",50,50,50,50,50,50
Davisville,34,34,34,34,34,34
Davisville North,9,9,9,9,9,9


Let's find out how many unique categories can be curated from all the returned venues¶

In [114]:
print('There are {} uniques categories.'.format(len(Toronto_venues['Venue Category'].unique())))

There are 213 uniques categories.


Analyze each neighbourhood

In [115]:
# one hot encoding
Toronto_onehot = pd.get_dummies(Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Toronto_onehot['Neighbourhood'] = Toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [Toronto_onehot.columns[-1]] + list(Toronto_onehot.columns[:-1])
Toronto_onehot = Toronto_onehot[fixed_columns]

Toronto_onehot.head()

Unnamed: 0,Neighbourhood,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Examine the new dataframe size

In [118]:
Toronto_onehot.shape

(1192, 214)

Next,  group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [119]:
Toronto_grouped = Toronto_onehot.groupby('Neighbourhood').mean().reset_index()
Toronto_grouped

Unnamed: 0,Neighbourhood,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.066667,0.066667,0.133333,0.2,0.066667,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.02,0.0,0.02
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,...,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.02,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.029412,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [120]:
Toronto_grouped.shape

(39, 214)

Print each neighborhood along with the top 5 most common venues

In [121]:
num_top_venues = 5

for hood in Toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = Toronto_grouped[Toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
            venue  freq
0     Coffee Shop  0.08
1        Beer Bar  0.04
2          Bakery  0.04
3      Restaurant  0.04
4  Farmers Market  0.04


----Brockton, Parkdale Village, Exhibition Place----
                   venue  freq
0                   Café  0.13
1         Breakfast Spot  0.09
2  Performing Arts Venue  0.09
3            Coffee Shop  0.09
4                    Gym  0.04


----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
                venue  freq
0  Light Rail Station  0.12
1         Yoga Studio  0.06
2       Auto Workshop  0.06
3          Comic Shop  0.06
4         Pizza Place  0.06


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
             venue  freq
0  Airport Service  0.20
1   Airport Lounge  0.13
2          Airport  0.07
3              Bar  0.07
4      Coffee Shop  0.07


----Central Bay Street----
                venue  freq
0       

Putting that into a dataframe

In [122]:
#function to sort the venues in descending order.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

The new dataframe and display the top 10 venues for each neighborhood.

In [142]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoodss_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoodss_venues_sorted['Neighbourhood'] = Toronto_grouped['Neighbourhood']

for ind in np.arange(Toronto_grouped.shape[0]):
    neighbourhoodss_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoodss_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Bakery,Café,Farmers Market,Restaurant,Cheese Shop,Beer Bar,Cocktail Bar,Seafood Restaurant,Bistro
1,"Brockton, Parkdale Village, Exhibition Place",Café,Performing Arts Venue,Coffee Shop,Breakfast Spot,Bakery,Convenience Store,Pet Store,Climbing Gym,Restaurant,Burrito Place
2,"Business reply mail Processing Centre, South C...",Light Rail Station,Yoga Studio,Spa,Fast Food Restaurant,Farmers Market,Comic Shop,Park,Pizza Place,Recording Studio,Restaurant
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Airport,Bar,Harbor / Marina,Plane,Rental Car Location,Sculpture Garden,Boat or Ferry,Coffee Shop
4,Central Bay Street,Coffee Shop,Sandwich Place,Italian Restaurant,Bubble Tea Shop,Café,Comic Shop,Salad Place,Burger Joint,Poke Place,Pizza Place


# Cluster the Neighbourhoods

Run k-means to cluster the neighborhood into 5 clusters.

In [143]:
# set number of clusters
kclusters = 5

Toronto_grouped_clustering = Toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, init = 'k-means++', random_state=0, n_init = 12).fit(Toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 0, 1, 1, 1, 1, 1])

A new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [144]:
# add clustering labels
neighbourhoodss_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Toronto_merged = Toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Toronto_merged = Toronto_merged.join(neighbourhoodss_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

Toronto_merged.head() # First Five columns

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Park,Bakery,Pub,Breakfast Spot,Café,Theater,Dessert Shop,Shoe Store,Restaurant
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Coffee Shop,Diner,Park,Bar,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Café,College Auditorium
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1,Coffee Shop,Café,Tea Room,Bookstore,Clothing Store,Ramen Restaurant,Theater,Cosmetics Shop,Fast Food Restaurant,Japanese Restaurant
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Café,Cosmetics Shop,Coffee Shop,Seafood Restaurant,Hotel,Creperie,Gastropub,Restaurant,Farmers Market,Bookstore
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Neighborhood,Pub,Coffee Shop,Asian Restaurant,Health Food Store,Trail,Department Store,Dessert Shop,Dim Sum Restaurant,Yoga Studio


Visualizing the Clusters

In [145]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_merged['Latitude'], Toronto_merged['Longitude'], Toronto_merged['Neighbourhood'], Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Neighbourhoods in each cluster and their top 10 most common venues

In [146]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 0, Toronto_merged.columns[[2] + list(range(5, Toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Regent Park, Harbourfront",0,Coffee Shop,Park,Bakery,Pub,Breakfast Spot,Café,Theater,Dessert Shop,Shoe Store,Restaurant
1,"Queen's Park, Ontario Provincial Government",0,Coffee Shop,Diner,Park,Bar,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Café,College Auditorium
4,The Beaches,0,Neighborhood,Pub,Coffee Shop,Asian Restaurant,Health Food Store,Trail,Department Store,Dessert Shop,Dim Sum Restaurant,Yoga Studio
6,Central Bay Street,0,Coffee Shop,Sandwich Place,Italian Restaurant,Bubble Tea Shop,Café,Comic Shop,Salad Place,Burger Joint,Poke Place,Pizza Place
13,"Toronto Dominion Centre, Design Exchange",0,Coffee Shop,Café,Hotel,Seafood Restaurant,Restaurant,Japanese Restaurant,Beer Bar,Ice Cream Shop,Salad Place,Pub
31,"Summerhill West, Rathnelly, South Hill, Forest...",0,Pub,Coffee Shop,Fried Chicken Joint,Supermarket,Sushi Restaurant,Bank,Sports Bar,Pizza Place,Restaurant,Liquor Store


In [147]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 1, Toronto_merged.columns[[2] + list(range(5, Toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,"Garden District, Ryerson",1,Coffee Shop,Café,Tea Room,Bookstore,Clothing Store,Ramen Restaurant,Theater,Cosmetics Shop,Fast Food Restaurant,Japanese Restaurant
3,St. James Town,1,Café,Cosmetics Shop,Coffee Shop,Seafood Restaurant,Hotel,Creperie,Gastropub,Restaurant,Farmers Market,Bookstore
5,Berczy Park,1,Coffee Shop,Bakery,Café,Farmers Market,Restaurant,Cheese Shop,Beer Bar,Cocktail Bar,Seafood Restaurant,Bistro
7,Christie,1,Grocery Store,Café,Park,Athletics & Sports,Nightclub,Candy Store,Baby Store,Coffee Shop,Restaurant,Italian Restaurant
8,"Richmond, Adelaide, King",1,Coffee Shop,Café,Steakhouse,Concert Hall,Hotel,American Restaurant,Restaurant,Noodle House,Department Store,Smoke Shop
9,"Dufferin, Dovercourt Village",1,Pharmacy,Pizza Place,Bakery,Bank,Middle Eastern Restaurant,Brewery,Park,Wine Shop,Music Venue,Supermarket
10,"Harbourfront East, Union Station, Toronto Islands",1,Coffee Shop,Aquarium,Café,Plaza,Hotel,Brewery,Park,Sandwich Place,History Museum,Salad Place
11,"Little Portugal, Trinity",1,Bar,Asian Restaurant,Restaurant,Vietnamese Restaurant,Vegetarian / Vegan Restaurant,Café,Coffee Shop,Men's Store,Yoga Studio,Record Shop
12,"The Danforth West, Riverdale",1,Greek Restaurant,Italian Restaurant,Coffee Shop,Furniture / Home Store,Restaurant,Ice Cream Shop,Cosmetics Shop,Brewery,Bubble Tea Shop,Café
14,"Brockton, Parkdale Village, Exhibition Place",1,Café,Performing Arts Venue,Coffee Shop,Breakfast Spot,Bakery,Convenience Store,Pet Store,Climbing Gym,Restaurant,Burrito Place


In [149]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 2, Toronto_merged.columns[[2] + list(range(5, Toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
29,"Moore Park, Summerhill East",2,Restaurant,Trail,Cupcake Shop,Donut Shop,Doner Restaurant,Dog Run,Distribution Center,Discount Store,Diner,Dim Sum Restaurant


In [150]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 3, Toronto_merged.columns[[2] + list(range(5, Toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
21,"Forest Hill North & West, Forest Hill Road Park",3,Jewelry Store,Trail,Sushi Restaurant,Park,Garden,Cupcake Shop,Doner Restaurant,Dog Run,Distribution Center,Discount Store
33,Rosedale,3,Park,Playground,Trail,Cupcake Shop,Donut Shop,Doner Restaurant,Dog Run,Distribution Center,Discount Store,Diner


In [151]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 4, Toronto_merged.columns[[2] + list(range(5, Toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,Roselawn,4,Garden,Music Venue,Home Service,Yoga Studio,Dance Studio,Donut Shop,Doner Restaurant,Dog Run,Distribution Center,Discount Store
