# Peer-graded Assignment: Capstone Project Notebook
### Created by: Eric J. Puttock

# Assignment Week 2:

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

### Import all the libraries to be used for later just in case.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


## Part 1:

We will read in data from the table from the following Wikipedia page listing postal codes of Canada.

In [2]:
WikiLink = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

The data that from the table is read into a pandas dataframe.

In [3]:
import pandas as pd
wikidf = pd.read_html(WikiLink)[0]
wikidf

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


We remove the rows of 'Borough' with 'Not assigned' from the dataframe (and reset index).

In [4]:
wikidf = wikidf[wikidf['Borough'] != 'Not assigned'].reset_index(drop=True)
wikidf

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


If a cell has a 'Borough' but a 'Not assigned' 'Neighborhood', then the 'Neighborhood' should be assigned the same as borough.
However, as shown below, there aren't any 'Neighbourhood' rows with 'Not assigned' after removal of 'Not assigned' 'Borough.

Therefore, no additional effort will be made to code that portion.

In [5]:
wikidf[wikidf['Neighbourhood']=='Not assigned']

Unnamed: 0,Postal Code,Borough,Neighbourhood


Here, we show our cleaned up dataframe contain 103 rows (Postal Codes) from Canada.

In [6]:
wikidf.shape

(103, 3)

## Part 2:

Remark: 
Here is my attempt at creating the Geospatial data dataframe (as instructed by the assignment using a while loop). I added a maximum iteration run and print outs so I can keep track and end the iteration. The following cell requires geocoder using Google API. However, Google-Geocode denies the request.

We are kindly provided with the Geospatial data in case we run into such problems for this assignment.

Please skip to "Import Geospatial data of Toronto".

In [7]:
# Install geocoder if needed:
# !conda install -c conda-forge geocoder --yes

In [8]:
import geocoder # import geocoder
# Attempt to find geolocatin of 'M1B'
postal_code = 'M1B'

# initialize your variable to None
lat_lng_coords = None
kk = 0
maxKK = 10

# loop until you get the coordinates (or max number of iterations).
while(lat_lng_coords is None and kk < maxKK):
    kk +=1
    query = '{}, Toronto, Ontario'.format(postal_code)
    g = geocoder.google(query)
    print(kk, g)
    lat_lng_coords = g.latlng

# Try to get coordinates. If it's empty, the exception would print out 'Did not get coordinates'.
try:
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    print(lat_lng_coords)
except:
    print('Did not get coordinates.')

1 <[REQUEST_DENIED] Google - Geocode [empty]>
2 <[REQUEST_DENIED] Google - Geocode [empty]>
3 <[REQUEST_DENIED] Google - Geocode [empty]>
4 <[REQUEST_DENIED] Google - Geocode [empty]>
5 <[REQUEST_DENIED] Google - Geocode [empty]>
6 <[REQUEST_DENIED] Google - Geocode [empty]>
7 <[REQUEST_DENIED] Google - Geocode [empty]>
8 <[REQUEST_DENIED] Google - Geocode [empty]>
9 <[REQUEST_DENIED] Google - Geocode [empty]>
10 <[REQUEST_DENIED] Google - Geocode [empty]>
Did not get coordinates.


### Import Geospatial data of Toronto.
Since, the above geocoder with google failed, I will continue with the provided CSV data as requested.

In [9]:
GS_url = 'http://cocl.us/Geospatial_data'

In [10]:
geodf = pd.read_csv(GS_url)
geodf.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We will join latitude & longitude data using Postal Code onto wikidf dataframe.

In [11]:
wikigeodf = wikidf.join(geodf.set_index('Postal Code'), on='Postal Code')
wikigeodf

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


## Part 3:

We will use geocoder to obtain the geographical coordinate of Toronto.
Install geocoder if needed, or geocoder was not used in Part 2.

In [12]:
# !conda install -c conda-forge geocoder --yes

In [13]:
import geocoder
from geopy.geocoders import Nominatim 

address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of Toronto is {}, {}.'.format(latitude, longitude))

The geographical coordinate of Toronto is 43.6534817, -79.3839347.


Now, we will take the map of Toronto and place markers in each Neighbourhood.

In [14]:
import folium

# create map of Toronto using latitude and longitude values
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(wikigeodf['Latitude'], wikigeodf['Longitude'], wikigeodf['Borough'], wikigeodf['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        ).add_to(map_Toronto)  
    
map_Toronto

In [15]:
print(wikigeodf.shape)
wikigeodf.head()

(103, 5)


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [16]:
print(wikigeodf['Borough'].unique())

print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(wikigeodf['Borough'].unique()),
        wikigeodf.shape[0])
)

['North York' 'Downtown Toronto' 'Etobicoke' 'Scarborough' 'East York'
 'York' 'East Toronto' 'West Toronto' 'Central Toronto' 'Mississauga']
The dataframe has 10 boroughs and 103 neighborhoods.


Let's look into one of the Boroughs of Toronto: North York.

In [17]:
NorthYork_data = wikigeodf[wikigeodf['Borough'] == 'North York'].reset_index(drop=True)
NorthYork_data.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
3,M3B,North York,Don Mills,43.745906,-79.352188
4,M6B,North York,Glencairn,43.709577,-79.445073


In [18]:
NorthYork_data.shape

(24, 5)

In [19]:
address = 'North York, Toronto'
geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of North York is {}, {}.'.format(latitude, longitude))

The geograpical coordinate of North York is 43.7543263, -79.44911696639593.


In [20]:
# create map of Manhattan using latitude and longitude values
map_NorthYork = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip( NorthYork_data['Latitude'], NorthYork_data['Longitude'], NorthYork_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_NorthYork)  
    
map_NorthYork

Here is a sample of data extraction of the first row from NorthYork_data.

In [21]:
neighbourhood_latitude = NorthYork_data.loc[12, 'Latitude'] # neighborhood latitude value
neighbourhood_longitude = NorthYork_data.loc[12, 'Longitude'] # neighborhood longitude value
neighbourhood_name = NorthYork_data.loc[12, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighbourhood_name, 
                                                               neighbourhood_latitude, 
                                                               neighbourhood_longitude))

Latitude and longitude values of York Mills, Silver Hills are 43.7574902, -79.37471409999999.


Enter Foursquare Credentials.
My credentials are hidden for obvious reasons!

In [22]:
## Hide:
#<--
CLIENT_ID = '-' # your Foursquare ID
CLIENT_SECRET = '-' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

#print('Your credentials:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)
#-->

Create getNearbyVenues to obtain venues from North York.

Send the get request to FourSquare API and examine the results for Nearby Venues.

In [23]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        if len(results)==0:
            print(r'**WARNING: {} does not have any categories within {} radius.**'.format(name,radius))
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [24]:
NorthYork_venues = getNearbyVenues(names=NorthYork_data['Neighbourhood'],
                                   latitudes=NorthYork_data['Latitude'],
                                   longitudes=NorthYork_data['Longitude']
                                  )

Parkwoods
Victoria Village
Lawrence Manor, Lawrence Heights
Don Mills
Glencairn
Don Mills
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Fairview, Henry Farm, Oriole
Northwood Park, York University
Bayview Village
Downsview
York Mills, Silver Hills
Downsview
North Park, Maple Leaf Park, Upwood Park
Humber Summit
Willowdale, Newtonbrook
Downsview
Bedford Park, Lawrence Manor East
Humberlea, Emery
Willowdale, Willowdale East
Downsview
York Mills West
Willowdale, Willowdale West


Let's check out the some of the venues (and their locations)

Remark: After reviewing the data, there are no recommendations for 'York Mills, Silver Hills' within radius of 500.

In [25]:
NorthYork_venues[NorthYork_venues['Neighbourhood']=='York Mills, Silver Hills']

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category


Example: 'York Mills West' showing two categories within radius of 500.

In [26]:
NorthYork_venues[NorthYork_venues['Neighbourhood']=='York Mills West']

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
236,York Mills West,43.752758,-79.400049,Kitchen Food Fair,43.751298,-79.401393,Convenience Store
237,York Mills West,43.752758,-79.400049,Tournament Park,43.751257,-79.399717,Park


Let's see how many unique categories are in each neighbourhood.

Remark 1: We have duplicated Neighbourhoods within 'North York'. Example: Downsview is duplicated, but have different area codes. This presents a problem in terms of simply grouping and counting. This needs additional work to separate out correctly by area codes, or even the coodrinates themselves with Neighborhoods. See the second set dataframe table.

Remark 2: We lost at exactly one neighbourhood 'York Mills, Silver Hills'. There were no categories recommended within radius of 500.

In [27]:
NYvCounts = NorthYork_venues[['Neighbourhood','Venue Category']].groupby('Neighbourhood').count()
print("Originally, we considered {} neighborhoods of 'North York' borough.".format(NorthYork_data.shape[0]))
print('We only have {} rows neighborhoods.'.format(NYvCounts.shape[0]))
print('This is due to duplications of same named neighborhoods and some neighborhoods having no categories.')
NYvCounts

Originally, we considered 24 neighborhoods of 'North York' borough.
We only have 19 rows neighborhoods.
This is due to duplications of same named neighborhoods and some neighborhoods having no categories.


Unnamed: 0_level_0,Venue Category
Neighbourhood,Unnamed: 1_level_1
"Bathurst Manor, Wilson Heights, Downsview North",23
Bayview Village,4
"Bedford Park, Lawrence Manor East",27
Don Mills,28
Downsview,13
"Fairview, Henry Farm, Oriole",64
Glencairn,4
Hillcrest Village,5
Humber Summit,2
"Humberlea, Emery",2


In [28]:
NYvCounts2 = NorthYork_venues[['Neighbourhood','Neighbourhood Longitude','Venue Category']].groupby(['Neighbourhood','Neighbourhood Longitude']).count()
print("Originally, we considered {} neighborhoods of 'North York' borough.".format(NorthYork_data.shape[0]))
print('There are only {} neighborhoods that have at least one category for recommendation.'.format(NYvCounts2.shape[0]))
print('We have lost {} neighborhoods.'.format(NorthYork_data.shape[0]-NYvCounts2.shape[0]))
NYvCounts2

Originally, we considered 24 neighborhoods of 'North York' borough.
There are only 23 neighborhoods that have at least one category for recommendation.
We have lost 1 neighborhoods.


Unnamed: 0_level_0,Unnamed: 1_level_0,Venue Category
Neighbourhood,Neighbourhood Longitude,Unnamed: 2_level_1
"Bathurst Manor, Wilson Heights, Downsview North",-79.442259,23
Bayview Village,-79.385975,4
"Bedford Park, Lawrence Manor East",-79.41975,27
Don Mills,-79.352188,6
Don Mills,-79.340923,22
Downsview,-79.520999,4
Downsview,-79.506944,4
Downsview,-79.495697,2
Downsview,-79.464763,3
"Fairview, Henry Farm, Oriole",-79.346556,64


In [29]:
print('There are {} unique categories.'.format(len(NorthYork_venues['Venue Category'].unique())),'within')

There are 106 unique categories. within


In [30]:
NorthYork_venues.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,649 Variety,43.754513,-79.331942,Convenience Store
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


Let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category.

In [31]:
# one hot encoding
NorthYork_onehot = pd.get_dummies(NorthYork_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
NorthYork_onehot['Neighbourhood'] = NorthYork_venues['Neighbourhood']
# We also add Longitude to prevent duplicated names from different area codes grouping together.
NorthYork_onehot['Neighbourhood Longitude'] = NorthYork_venues['Neighbourhood Longitude'] 
# move neighborhood column to the first column
fixed_columns = [NorthYork_onehot.columns[-1]] + list(NorthYork_onehot.columns[:-1])
NorthYork_onehot = NorthYork_onehot[fixed_columns]

# let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
#NorthYork_grouped = NorthYork_onehot.groupby('Neighbourhood').mean().reset_index()
# let's group rows by neighborhood and longtigude by taking the mean of the frequency of occurrence of each category
NorthYork_grouped = NorthYork_onehot.groupby(['Neighbourhood','Neighbourhood Longitude']).mean().reset_index()
NorthYork_grouped.shape

(23, 108)

In [32]:
NorthYork_grouped.head()

Unnamed: 0,Neighbourhood,Neighbourhood Longitude,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,Bar,Baseball Field,Basketball Court,Beer Store,Bike Shop,Bookstore,Boutique,Breakfast Spot,Bridal Shop,Bubble Tea Shop,Burger Joint,Burrito Place,Bus Station,Bus Stop,Butcher,Café,Caribbean Restaurant,Chinese Restaurant,Chocolate Shop,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Electronics Store,Event Space,Fast Food Restaurant,Financial or Legal Service,Food & Drink Shop,Food Court,Food Service,Food Truck,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Furniture / Home Store,Gas Station,Gift Shop,Golf Course,Greek Restaurant,Grocery Store,Gym,Hobby Shop,Hockey Arena,Hotel,Ice Cream Shop,Indian Restaurant,Indonesian Restaurant,Italian Restaurant,Japanese Restaurant,Jewelry Store,Juice Bar,Liquor Store,Lounge,Luggage Store,Massage Studio,Mediterranean Restaurant,Metro Station,Middle Eastern Restaurant,Miscellaneous Shop,Mobile Phone Shop,Movie Theater,Park,Pet Store,Pharmacy,Pizza Place,Plaza,Pool,Portuguese Restaurant,Pub,Ramen Restaurant,Restaurant,Salon / Barbershop,Sandwich Place,Shoe Store,Shopping Mall,Sporting Goods Shop,Steakhouse,Supermarket,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Trail,Video Game Store,Vietnamese Restaurant,Women's Store
0,"Bathurst Manor, Wilson Heights, Downsview North",-79.442259,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.086957,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.086957,0.0,0.0,0.043478,0.0,0.043478,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.043478,0.0,0.043478,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.043478,0.0,0.0,0.043478,0.043478,0.043478,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.043478,0.0,0.043478,0.0,0.0,0.043478,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bayview Village,-79.385975,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",-79.41975,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.037037,0.0,0.0,0.0,0.0,0.074074,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.037037,0.0,0.037037,0.0,0.0,0.0,0.037037,0.0,0.111111,0.0,0.0,0.037037,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.037037,0.0,0.0,0.0,0.037037,0.0,0.074074,0.0,0.074074,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.037037,0.0,0.037037,0.0,0.0,0.0,0.0
3,Don Mills,-79.352188,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Don Mills,-79.340923,0.0,0.0,0.0,0.045455,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.045455,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.045455,0.0,0.045455,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.045455,0.0,0.0,0.045455,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's find the top 5 venues from each neighbourhood.

Note: Some neighbourhoods may not have 5 venues. If they don't have additional venues 'No More Venues' will be returned.

In [33]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    # Some neighboorhoods don't have enough venues. In this case, we want to return 'No More Venues'.
    LRCS = len(row_categories_sorted.index.values[row_categories_sorted!=0])
    
    if LRCS < num_top_venues:
        output = row_categories_sorted.index.values[0:LRCS]
        for k in range(LRCS,num_top_venues):
            output=np.insert(output,len(output),'No More Venues')
        return output
    else:
        output = row_categories_sorted.index.values[0:num_top_venues]
        return output

In [34]:
num_top_venues = 5
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
#columns = ['Neighbourhood']
columns = ['Neighbourhood','Neighbourhood Longitude']

for ind in np.arange(num_top_venues):
    #columns.append('Most Common Venue')
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = NorthYork_grouped['Neighbourhood']
neighbourhoods_venues_sorted['Neighbourhood Longitude'] = NorthYork_grouped['Neighbourhood Longitude']

for ind in np.arange(NorthYork_grouped.shape[0]):
    #neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(NorthYork_grouped.iloc[ind, :], num_top_venues)
    neighbourhoods_venues_sorted.iloc[ind, 2:] = return_most_common_venues(NorthYork_grouped.iloc[ind, :], num_top_venues)
    
neighbourhoods_venues_sorted.tail()

Unnamed: 0,Neighbourhood,Neighbourhood Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
18,Victoria Village,-79.315572,French Restaurant,Financial or Legal Service,Coffee Shop,Pizza Place,Hockey Arena
19,"Willowdale, Newtonbrook",-79.408493,Park,Women's Store,No More Venues,No More Venues,No More Venues
20,"Willowdale, Willowdale East",-79.408493,Ramen Restaurant,Café,Restaurant,Pizza Place,Sandwich Place
21,"Willowdale, Willowdale West",-79.442259,Coffee Shop,Pharmacy,Pizza Place,Bank,Women's Store
22,York Mills West,-79.400049,Park,Convenience Store,Women's Store,No More Venues,No More Venues


Remark: Since we joined them using on group, we have a few duplicates of the data. We need to drop data that wasn't joined correctly. For exampe, Don Mills with duplicated Neighbourhoods, have few incorect joints. We can drop rows with Longitudes that do not agree with Neighbourhood Longitude (like a foregin key to a database). I need to figure out how to connect them using multiple index keys in the future.

In [35]:
NorthYork_merged = NorthYork_data
NorthYork_merged = NorthYork_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on=['Neighbourhood'])
NYm = NorthYork_merged.reset_index(drop=True)
print(NYm.shape)
NYm

(38, 11)


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Neighbourhood Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,-79.329656,Park,Food & Drink Shop,Convenience Store,Women's Store,No More Venues
1,M4A,North York,Victoria Village,43.725882,-79.315572,-79.315572,French Restaurant,Financial or Legal Service,Coffee Shop,Pizza Place,Hockey Arena
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,-79.464763,Clothing Store,Furniture / Home Store,Boutique,Accessories Store,Vietnamese Restaurant
3,M3B,North York,Don Mills,43.745906,-79.352188,-79.352188,Caribbean Restaurant,Café,Construction & Landscaping,Athletics & Sports,Gym
4,M3B,North York,Don Mills,43.745906,-79.352188,-79.340923,Gym,Restaurant,Beer Store,Coffee Shop,Bike Shop
5,M6B,North York,Glencairn,43.709577,-79.445073,-79.445073,Park,Sushi Restaurant,Japanese Restaurant,Pub,Women's Store
6,M3C,North York,Don Mills,43.7259,-79.340923,-79.352188,Caribbean Restaurant,Café,Construction & Landscaping,Athletics & Sports,Gym
7,M3C,North York,Don Mills,43.7259,-79.340923,-79.340923,Gym,Restaurant,Beer Store,Coffee Shop,Bike Shop
8,M2H,North York,Hillcrest Village,43.803762,-79.363452,-79.363452,Mediterranean Restaurant,Golf Course,Fast Food Restaurant,Dog Run,Pool
9,M3H,North York,"Bathurst Manor, Wilson Heights, Downsview North",43.754328,-79.442259,-79.442259,Coffee Shop,Bank,Frozen Yogurt Shop,Sandwich Place,Ice Cream Shop


In [36]:
NYMedit = NYm[['Longitude','Neighbourhood Longitude']]
NYMedit.head()

Unnamed: 0,Longitude,Neighbourhood Longitude
0,-79.329656,-79.329656
1,-79.315572,-79.315572
2,-79.464763,-79.464763
3,-79.352188,-79.352188
4,-79.352188,-79.340923


Identify the rows where the two columns don't agree.

In [37]:
drop_list = []
keep_list = []
for rownum in range(0,NYMedit.shape[0]):
    if NYMedit['Longitude'][rownum] != NYMedit['Neighbourhood Longitude'][rownum]:
        # Just to keep the special cases of when we don't have any venues categories.
        if np.isnan(NYMedit['Neighbourhood Longitude'][rownum]):
                keep_list.append(rownum)
                continue
        drop_list.append(rownum)
    else:
        keep_list.append(rownum)

print('We keep the numbers in rows' , keep_list)

We keep the numbers in rows [0, 1, 2, 3, 5, 7, 8, 9, 10, 11, 12, 16, 17, 19, 22, 23, 24, 27, 29, 30, 31, 32, 36, 37]


In [38]:
NYm_fixed = NYm.iloc[keep_list,:].reset_index(drop=True)
NYm_fixed.drop(columns=['Neighbourhood Longitude'],inplace=True)

Here's our nice organized dataframe!

In [39]:
NYm_fixed

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,Park,Food & Drink Shop,Convenience Store,Women's Store,No More Venues
1,M4A,North York,Victoria Village,43.725882,-79.315572,French Restaurant,Financial or Legal Service,Coffee Shop,Pizza Place,Hockey Arena
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,Clothing Store,Furniture / Home Store,Boutique,Accessories Store,Vietnamese Restaurant
3,M3B,North York,Don Mills,43.745906,-79.352188,Caribbean Restaurant,Café,Construction & Landscaping,Athletics & Sports,Gym
4,M6B,North York,Glencairn,43.709577,-79.445073,Park,Sushi Restaurant,Japanese Restaurant,Pub,Women's Store
5,M3C,North York,Don Mills,43.7259,-79.340923,Gym,Restaurant,Beer Store,Coffee Shop,Bike Shop
6,M2H,North York,Hillcrest Village,43.803762,-79.363452,Mediterranean Restaurant,Golf Course,Fast Food Restaurant,Dog Run,Pool
7,M3H,North York,"Bathurst Manor, Wilson Heights, Downsview North",43.754328,-79.442259,Coffee Shop,Bank,Frozen Yogurt Shop,Sandwich Place,Ice Cream Shop
8,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,Clothing Store,Coffee Shop,Fast Food Restaurant,Restaurant,Juice Bar
9,M3J,North York,"Northwood Park, York University",43.76798,-79.487262,Caribbean Restaurant,Metro Station,Massage Studio,Coffee Shop,Bar


In [40]:
NYm_fixed.loc[:,'1st Most Common Venue':] = NYm_fixed.loc[:,'1st Most Common Venue':].fillna('No More Venues')

Now we have successfully wrangled with North York Borough data to extract the most common venues (if any).
If there aren't any venues, or enough venues, those features were filled with 'No More Venues'.

In [41]:
NYm_fixed.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,Park,Food & Drink Shop,Convenience Store,Women's Store,No More Venues
1,M4A,North York,Victoria Village,43.725882,-79.315572,French Restaurant,Financial or Legal Service,Coffee Shop,Pizza Place,Hockey Arena
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,Clothing Store,Furniture / Home Store,Boutique,Accessories Store,Vietnamese Restaurant
3,M3B,North York,Don Mills,43.745906,-79.352188,Caribbean Restaurant,Café,Construction & Landscaping,Athletics & Sports,Gym
4,M6B,North York,Glencairn,43.709577,-79.445073,Park,Sushi Restaurant,Japanese Restaurant,Pub,Women's Store


Now, let's try to see if we can group these into clusters using their top 5 most common venues.

### Make the Training set. This will consist of the top 5 venues of each row.
### We will use KMeans unsupervised clustering to attempt to cluster.

In [42]:
DataToFit = NYm_fixed.loc[:,'1st Most Common Venue':'5th Most Common Venue']
print(DataToFit.shape)

# Position of Missing Values: DataToFit.iloc[[12],:]

(24, 5)


Train the Model using 6 (arbitrary) clusters.

In [43]:
NorthYork_grouped.head()

Unnamed: 0,Neighbourhood,Neighbourhood Longitude,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,Bar,Baseball Field,Basketball Court,Beer Store,Bike Shop,Bookstore,Boutique,Breakfast Spot,Bridal Shop,Bubble Tea Shop,Burger Joint,Burrito Place,Bus Station,Bus Stop,Butcher,Café,Caribbean Restaurant,Chinese Restaurant,Chocolate Shop,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Electronics Store,Event Space,Fast Food Restaurant,Financial or Legal Service,Food & Drink Shop,Food Court,Food Service,Food Truck,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Furniture / Home Store,Gas Station,Gift Shop,Golf Course,Greek Restaurant,Grocery Store,Gym,Hobby Shop,Hockey Arena,Hotel,Ice Cream Shop,Indian Restaurant,Indonesian Restaurant,Italian Restaurant,Japanese Restaurant,Jewelry Store,Juice Bar,Liquor Store,Lounge,Luggage Store,Massage Studio,Mediterranean Restaurant,Metro Station,Middle Eastern Restaurant,Miscellaneous Shop,Mobile Phone Shop,Movie Theater,Park,Pet Store,Pharmacy,Pizza Place,Plaza,Pool,Portuguese Restaurant,Pub,Ramen Restaurant,Restaurant,Salon / Barbershop,Sandwich Place,Shoe Store,Shopping Mall,Sporting Goods Shop,Steakhouse,Supermarket,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Trail,Video Game Store,Vietnamese Restaurant,Women's Store
0,"Bathurst Manor, Wilson Heights, Downsview North",-79.442259,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.086957,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.086957,0.0,0.0,0.043478,0.0,0.043478,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.043478,0.0,0.043478,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.043478,0.0,0.0,0.043478,0.043478,0.043478,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.043478,0.0,0.043478,0.0,0.0,0.043478,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bayview Village,-79.385975,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",-79.41975,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.037037,0.0,0.0,0.0,0.0,0.074074,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.037037,0.0,0.037037,0.0,0.0,0.0,0.037037,0.0,0.111111,0.0,0.0,0.037037,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.037037,0.0,0.0,0.0,0.037037,0.0,0.074074,0.0,0.074074,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.037037,0.0,0.037037,0.0,0.0,0.0,0.0
3,Don Mills,-79.352188,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Don Mills,-79.340923,0.0,0.0,0.0,0.045455,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.045455,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.045455,0.0,0.045455,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.045455,0.0,0.0,0.045455,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Drop the columns that we don't need for Kmeans. We're only interested in the scoring of each venue.

In [44]:
NorthYork_grouped_clustering = NorthYork_grouped.drop(columns = ['Neighbourhood', 'Neighbourhood Longitude'],axis=1)
NorthYork_grouped_clustering.shape

(23, 106)

Adding in a row of 0's to take place for the one missing neighborhood (and re-order dataframe appropriately) 'York Mills, Silver Hills'.

In [45]:
NorthYork_grouped_clustering = NorthYork_grouped_clustering.append(pd.Series(np.zeros(NorthYork_grouped_clustering.shape[1]), index=NorthYork_grouped_clustering.columns), ignore_index=True)
order = [1,2,3,4,5,6,7,8,9,10,11,-1,12,13,14,15,16,17,18,19,20,21,22]
NorthYork_grouped_clustering.iloc[order,:].reset_index(drop=True,inplace=True)
NorthYork_grouped_clustering.shape

(24, 106)

In [46]:
NorthYork_grouped_clustering.head()

Unnamed: 0,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,Bar,Baseball Field,Basketball Court,Beer Store,Bike Shop,Bookstore,Boutique,Breakfast Spot,Bridal Shop,Bubble Tea Shop,Burger Joint,Burrito Place,Bus Station,Bus Stop,Butcher,Café,Caribbean Restaurant,Chinese Restaurant,Chocolate Shop,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Electronics Store,Event Space,Fast Food Restaurant,Financial or Legal Service,Food & Drink Shop,Food Court,Food Service,Food Truck,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Furniture / Home Store,Gas Station,Gift Shop,Golf Course,Greek Restaurant,Grocery Store,Gym,Hobby Shop,Hockey Arena,Hotel,Ice Cream Shop,Indian Restaurant,Indonesian Restaurant,Italian Restaurant,Japanese Restaurant,Jewelry Store,Juice Bar,Liquor Store,Lounge,Luggage Store,Massage Studio,Mediterranean Restaurant,Metro Station,Middle Eastern Restaurant,Miscellaneous Shop,Mobile Phone Shop,Movie Theater,Park,Pet Store,Pharmacy,Pizza Place,Plaza,Pool,Portuguese Restaurant,Pub,Ramen Restaurant,Restaurant,Salon / Barbershop,Sandwich Place,Shoe Store,Shopping Mall,Sporting Goods Shop,Steakhouse,Supermarket,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Trail,Video Game Store,Vietnamese Restaurant,Women's Store
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.086957,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.086957,0.0,0.0,0.043478,0.0,0.043478,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.043478,0.0,0.043478,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.043478,0.0,0.0,0.043478,0.043478,0.043478,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.043478,0.0,0.043478,0.0,0.0,0.043478,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.037037,0.0,0.0,0.0,0.0,0.074074,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.037037,0.0,0.037037,0.0,0.0,0.0,0.037037,0.0,0.111111,0.0,0.0,0.037037,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.037037,0.0,0.0,0.0,0.037037,0.0,0.074074,0.0,0.074074,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.037037,0.0,0.037037,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.045455,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.045455,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.045455,0.0,0.045455,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.045455,0.0,0.0,0.045455,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Run the Kmeans Algorithm to classify their groups.

In [47]:
# Set number of clusters
kclusters = 6
DataToFit = NorthYork_grouped_clustering
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=4).fit(DataToFit)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 0, 1, 1, 1, 3, 5, 1])

In [48]:
NYm_fixed.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,Park,Food & Drink Shop,Convenience Store,Women's Store,No More Venues
1,M4A,North York,Victoria Village,43.725882,-79.315572,French Restaurant,Financial or Legal Service,Coffee Shop,Pizza Place,Hockey Arena
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,Clothing Store,Furniture / Home Store,Boutique,Accessories Store,Vietnamese Restaurant
3,M3B,North York,Don Mills,43.745906,-79.352188,Caribbean Restaurant,Café,Construction & Landscaping,Athletics & Sports,Gym
4,M6B,North York,Glencairn,43.709577,-79.445073,Park,Sushi Restaurant,Japanese Restaurant,Pub,Women's Store


Add clustering labels to the fixed data table for summarization.

In [49]:
# add clustering labels
NYm_fixed.insert(5, 'Cluster Labels', kmeans.labels_)
NYm_fixed.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,1,Park,Food & Drink Shop,Convenience Store,Women's Store,No More Venues
1,M4A,North York,Victoria Village,43.725882,-79.315572,1,French Restaurant,Financial or Legal Service,Coffee Shop,Pizza Place,Hockey Arena
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1,Clothing Store,Furniture / Home Store,Boutique,Accessories Store,Vietnamese Restaurant
3,M3B,North York,Don Mills,43.745906,-79.352188,0,Caribbean Restaurant,Café,Construction & Landscaping,Athletics & Sports,Gym
4,M6B,North York,Glencairn,43.709577,-79.445073,1,Park,Sushi Restaurant,Japanese Restaurant,Pub,Women's Store


Visualize clustering on the Map.

In [50]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(NYm_fixed['Latitude'], NYm_fixed['Longitude'], NYm_fixed['Neighbourhood'], NYm_fixed['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Let's take a look at the data tables of the clusters.

In [51]:
Group0 = NYm_fixed[NYm_fixed['Cluster Labels']==0]
Group1 = NYm_fixed[NYm_fixed['Cluster Labels']==1]
Group2 = NYm_fixed[NYm_fixed['Cluster Labels']==2]
Group3 = NYm_fixed[NYm_fixed['Cluster Labels']==3]
Group4 = NYm_fixed[NYm_fixed['Cluster Labels']==4]
Group5 = NYm_fixed[NYm_fixed['Cluster Labels']==5]

In [52]:
Group0

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
3,M3B,North York,Don Mills,43.745906,-79.352188,0,Caribbean Restaurant,Café,Construction & Landscaping,Athletics & Sports,Gym
12,M2L,North York,"York Mills, Silver Hills",43.75749,-79.374714,0,No More Venues,No More Venues,No More Venues,No More Venues,No More Venues


In [53]:
Group1

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,1,Park,Food & Drink Shop,Convenience Store,Women's Store,No More Venues
1,M4A,North York,Victoria Village,43.725882,-79.315572,1,French Restaurant,Financial or Legal Service,Coffee Shop,Pizza Place,Hockey Arena
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1,Clothing Store,Furniture / Home Store,Boutique,Accessories Store,Vietnamese Restaurant
4,M6B,North York,Glencairn,43.709577,-79.445073,1,Park,Sushi Restaurant,Japanese Restaurant,Pub,Women's Store
5,M3C,North York,Don Mills,43.7259,-79.340923,1,Gym,Restaurant,Beer Store,Coffee Shop,Bike Shop
6,M2H,North York,Hillcrest Village,43.803762,-79.363452,1,Mediterranean Restaurant,Golf Course,Fast Food Restaurant,Dog Run,Pool
9,M3J,North York,"Northwood Park, York University",43.76798,-79.487262,1,Caribbean Restaurant,Metro Station,Massage Studio,Coffee Shop,Bar
10,M2K,North York,Bayview Village,43.786947,-79.385975,1,Café,Bank,Japanese Restaurant,Chinese Restaurant,Women's Store
11,M3K,North York,Downsview,43.737473,-79.464763,1,Airport,Park,Bus Stop,Women's Store,No More Venues
14,M6L,North York,"North Park, Maple Leaf Park, Upwood Park",43.713756,-79.490074,1,Basketball Court,Trail,Park,Construction & Landscaping,Bakery


In [54]:
Group2

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
19,M9M,North York,"Humberlea, Emery",43.724766,-79.532242,2,Food Service,Baseball Field,Women's Store,No More Venues,No More Venues


In [55]:
Group3

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
7,M3H,North York,"Bathurst Manor, Wilson Heights, Downsview North",43.754328,-79.442259,3,Coffee Shop,Bank,Frozen Yogurt Shop,Sandwich Place,Ice Cream Shop
13,M3L,North York,Downsview,43.739015,-79.506944,3,Park,Grocery Store,Bank,Shopping Mall,Women's Store


In [56]:
Group4

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
17,M3M,North York,Downsview,43.728496,-79.495697,4,Food Truck,Baseball Field,Women's Store,No More Venues,No More Venues
22,M2P,North York,York Mills West,43.752758,-79.400049,4,Park,Convenience Store,Women's Store,No More Venues,No More Venues


In [57]:
Group5

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
8,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,5,Clothing Store,Coffee Shop,Fast Food Restaurant,Restaurant,Juice Bar


# Report:

As there aren't many distinguishing factors, seems like many neighbourhoods in North York appear to be similar (assuming there was enough data collected). However, there are many factors that I feel makes this analysis inconclusive. Many of these neighbourhoods don't have many Venues. Had there been many more, the analysis would make a lot more sense, and perhaps had been more interesting. There may be a better Borough to analyze in the future.

Group 1: Quite possible that the magnitudes of both groups were small or near 0 (compared to the rest).
Group 2: Inconlusive, seems similar to group 4. Shopping centers?
Group 3: They share a bank. Not sure otherwise.
Group 4: Inconclusive, seems similar to group 2. Shopping centers.
Group 5: Many food shops.

Rather than comparing what's in the top group (lack of data), the separation between these clusters are perhaps more related to the magnitudes of the top five fractions. Additional study is needed.