Firstly I load the packages I need.

In [93]:
import io
import requests
import pandas as pd
import numpy as np

Secondly I use the pandas .read_html method to scrape the table with the neighbourhoods in Toronto from the Wikipedia page. I specified by using [0] for the particular table I want since the page contains more than one table. 

In [94]:
Toronto = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header=0,
                     )
Toronto[0]

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Not assigned
8,M8A,Not assigned,Not assigned
9,M9A,Queen's Park,Queen's Park


The scraped table contains 287 rows and 3 columns. Then I convert the table into a pandas dataframe called df.

In [95]:
df = pd.DataFrame(Toronto[0])
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Not assigned
8,M8A,Not assigned,Not assigned
9,M9A,Queen's Park,Queen's Park


I then delete all cells with a Borough that is Not assigned so that I only process the cells that have an assigned borough.

In [96]:
df = df[df.Borough != 'Not assigned']
df

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Not assigned
9,M9A,Queen's Park,Queen's Park
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


This reduces my dataframe from 287 rows to 210 rows, meaning that 77 cells had a Borough that was Not assigned. 

I then combine into one row all Neighbourhoods with the same Postcode, with the Neighbourhoods separated with a comma. To do that, firstly I use the .groupby method on condition of the Neighbourhoods having the same Postcode and then use the .apply method to execute the grouping and the .join method to separate the Neighbourhoods with a coma and save the result in a new dataframe called df1.

In [97]:
df1 = df.groupby('Postcode')['Neighbourhood'].apply(','.join).reset_index()
df1

Unnamed: 0,Postcode,Neighbourhood
0,M1B,"Rouge,Malvern"
1,M1C,"Highland Creek,Rouge Hill,Port Union"
2,M1E,"Guildwood,Morningside,West Hill"
3,M1G,Woburn
4,M1H,Cedarbrae
5,M1J,Scarborough Village
6,M1K,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,"Clairlea,Golden Mile,Oakridge"
8,M1M,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,"Birch Cliff,Cliffside West"


The returned output above excludes the Borough column. I use the pandas .merge method to combine the Borough column from the df dataframe and the df1 dataframe on condition that they're merged on matching the Borough to the Neighbourhood based on the Postcode and I save the results in a new dataframe called df2.

In [98]:
df2 = pd.merge(df1, df[["Postcode", "Borough"]], on="Postcode")
df2

Unnamed: 0,Postcode,Neighbourhood,Borough
0,M1B,"Rouge,Malvern",Scarborough
1,M1B,"Rouge,Malvern",Scarborough
2,M1C,"Highland Creek,Rouge Hill,Port Union",Scarborough
3,M1C,"Highland Creek,Rouge Hill,Port Union",Scarborough
4,M1C,"Highland Creek,Rouge Hill,Port Union",Scarborough
5,M1E,"Guildwood,Morningside,West Hill",Scarborough
6,M1E,"Guildwood,Morningside,West Hill",Scarborough
7,M1E,"Guildwood,Morningside,West Hill",Scarborough
8,M1G,Woburn,Scarborough
9,M1H,Cedarbrae,Scarborough


I then rearrange the order of my columns to start with Postcode, followed by Borough and thirdly Neighbourhood.

In [99]:
df2 = df2[['Postcode', 'Borough', 'Neighbourhood']]
df2

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1B,Scarborough,"Rouge,Malvern"
2,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
3,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
4,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
5,M1E,Scarborough,"Guildwood,Morningside,West Hill"
6,M1E,Scarborough,"Guildwood,Morningside,West Hill"
7,M1E,Scarborough,"Guildwood,Morningside,West Hill"
8,M1G,Scarborough,Woburn
9,M1H,Scarborough,Cedarbrae


I then drop all resulting duplicates based on the same Postcode and save the results in a new dataframe called df3.

In [100]:
df3 = df2.drop_duplicates(subset="Postcode")
df3

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
2,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
5,M1E,Scarborough,"Guildwood,Morningside,West Hill"
8,M1G,Scarborough,Woburn
9,M1H,Scarborough,Cedarbrae
10,M1J,Scarborough,Scarborough Village
11,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
14,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
17,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
20,M1N,Scarborough,"Birch Cliff,Cliffside West"


The code below is to find out how many Neighbourhoods are Not assigned. The result is just one, with a Borough named Queen's Park.

In [101]:
print(df3.loc[df3['Neighbourhood'] == 'Not assigned'])

    Postcode       Borough Neighbourhood
158      M7A  Queen's Park  Not assigned


I then replace the Not assigned Neighbourhood with its Borough name by using the .loc method.

In [102]:
df3.loc[df3['Neighbourhood'] == 'Not assigned'] = "Queen's Park"
df3

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
2,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
5,M1E,Scarborough,"Guildwood,Morningside,West Hill"
8,M1G,Scarborough,Woburn
9,M1H,Scarborough,Cedarbrae
10,M1J,Scarborough,Scarborough Village
11,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
14,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
17,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
20,M1N,Scarborough,"Birch Cliff,Cliffside West"


To ascertain the number of rows and columns in my final df3 dataframe, I use the .shape method and this shows that my final dataframe has 103 rows and 3 columns.

In [103]:
df3.shape

(103, 3)

First I use the pandas .read_csv method to read the cvs file containing the Latitudes and Longitudes of the Toronto Neighbourhoods based on Postcodes and I save the csv data in a pandas dataframe called Coords.

In [104]:
Coords = pd.read_csv("https://cocl.us/Geospatial_data") 
Coords

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


I then rename the Postal Code column in the Coords dataframe to Postcode like the df3 dataframe to make it easier for me to merge the 2 dataframes together.

In [105]:
Coords.rename(columns={'Postal Code':'Postcode'}, inplace=True)

print(Coords.columns)

Index(['Postcode', 'Latitude', 'Longitude'], dtype='object')


Using the pandas .merge method to combine the two dataframes (df3 and Coords) together on condition that they're merged on the Postcode being the same and I save the results in a new dataframe called df4.

In [106]:
df4 = pd.merge(df3, Coords[["Postcode", "Latitude", "Longitude"]], on="Postcode")
df4

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


The resulting dataframe after the merge has one less row, meaning that one row was dropped maybe due to a mismatch.

## Exploring and clustering the neighbourhoods in Toronto

First I get all the dependencies that I will need, that I hadn't got as yet.

In [107]:
import sys
!{sys.executable} -m pip install geopy



In [108]:
import sys
!{sys.executable} -m pip install folium



In [109]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


Checking the number of Boroughs and Neighbourhoods in the dataset.

In [110]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df4['Borough'].unique()),
        df4.shape[0]
    )
)

The dataframe has 11 boroughs and 102 neighborhoods.


Firstly, in order to define an instance of the geocoder, we need to define a user_agent. I will name my user agent toronto_explorer, as shown below.

In [111]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


#### I then create a map of Toronto with neighborhoods superimposed on top.

In [112]:
# First I create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# then I add markers to map
for lat, lng, Borough, Neighbourhood in zip(df4['Latitude'], df4['Longitude'], df4['Borough'], df4['Neighbourhood']):
    label = '{}, {}'.format(Neighbourhood, Borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

The map shows the neighbourhoods of Toronto superimposed on the map of Toronto.

I then slice the original dataframe (df4) and create a new dataframe of the neighbourhoods in North York Borough and I save the data in a new datafame called northyork_data.

In [113]:
northyork_data = df4[df4['Borough'] == 'North York'].reset_index(drop=True)
northyork_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M2H,North York,Hillcrest Village,43.803762,-79.363452
1,M2J,North York,"Fairview,Henry Farm,Oriole",43.778517,-79.346556
2,M2K,North York,Bayview Village,43.786947,-79.385975
3,M2L,North York,"Silver Hills,York Mills",43.75749,-79.374714
4,M2M,North York,"Newtonbrook,Willowdale",43.789053,-79.408493


Then I get the coordinates of North York.

In [114]:
address = 'North York, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of North York are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of North York are 43.7708175, -79.4132998.


I then visualise North York with the neighbourhoods in it, as shown below.

In [115]:
# create map of North York using latitude and longitude values
map_northyork = folium.Map(location=[latitude, longitude], zoom_start=11)

# I then add markers to map
for lat, lng, label in zip(northyork_data['Latitude'], northyork_data['Longitude'], northyork_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='purple',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_northyork)  
    
map_northyork

#### Defining Foursquare Credentials and Version

In [116]:
CLIENT_ID = 'DK0DZVYAYUF2IWJNSSGAFNWQNELVW13HV2OJOMBFB04RXRIL' # your Foursquare ID
CLIENT_SECRET = '0DOANYV3WCUVAFTLL5K5WYC2CG0F2L4MQMKBRDW2V3EWTDRM' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: DK0DZVYAYUF2IWJNSSGAFNWQNELVW13HV2OJOMBFB04RXRIL
CLIENT_SECRET:0DOANYV3WCUVAFTLL5K5WYC2CG0F2L4MQMKBRDW2V3EWTDRM


## 1. Exploring Neighborhoods in North York

In [117]:
radius = 500
LIMIT = 100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id=DK0DZVYAYUF2IWJNSSGAFNWQNELVW13HV2OJOMBFB04RXRIL&client_secret=0DOANYV3WCUVAFTLL5K5WYC2CG0F2L4MQMKBRDW2V3EWTDRM&v=20180605&ll=43.7708175,-79.4132998&radius=500&limit=30'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

 The code below runs the above function on each neighbourhood and creates a new dataframe called northyork_venues.

In [118]:
northyork_venues = getNearbyVenues(names=northyork_data['Neighbourhood'],
                                   latitudes=northyork_data['Latitude'],
                                   longitudes=northyork_data['Longitude']
                                  )



Hillcrest Village
Fairview,Henry Farm,Oriole
Bayview Village
Silver Hills,York Mills
Newtonbrook,Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park,Don Mills South
Bathurst Manor,Downsview North,Wilson Heights
Northwood Park,York University
CFB Toronto,Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Bedford Park,Lawrence Manor East
Lawrence Heights,Lawrence Manor
Glencairn
Downsview,North Park,Upwood Park
Humber Summit
Emery,Humberlea


##### I then check the size of the resulting dataframe.

In [119]:
print(northyork_venues.shape)
northyork_venues.head()

(720, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hillcrest Village,43.803762,-79.363452,The Captain's Boil,43.773255,-79.413805,Seafood Restaurant
1,Hillcrest Village,43.803762,-79.363452,Aroma Espresso Bar,43.769449,-79.413081,Café
2,Hillcrest Village,43.803762,-79.363452,Loblaws,43.768722,-79.412101,Grocery Store
3,Hillcrest Village,43.803762,-79.363452,Cineplex Cinemas Empress Walk,43.768625,-79.412613,Movie Theater
4,Hillcrest Village,43.803762,-79.363452,Konjiki Ramen,43.766998,-79.412222,Ramen Restaurant


The code below checks how many venues were returned for each neighbourhood.

In [120]:
northyork_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Bathurst Manor,Downsview North,Wilson Heights",30,30,30,30,30,30
Bayview Village,30,30,30,30,30,30
"Bedford Park,Lawrence Manor East",30,30,30,30,30,30
"CFB Toronto,Downsview East",30,30,30,30,30,30
Don Mills North,30,30,30,30,30,30
Downsview Central,30,30,30,30,30,30
Downsview Northwest,30,30,30,30,30,30
Downsview West,30,30,30,30,30,30
"Downsview,North Park,Upwood Park",30,30,30,30,30,30
"Emery,Humberlea",30,30,30,30,30,30


I want to find out how many unique categories can be curated from all the returned venues.

In [121]:
print('There are {} uniques categories.'.format(len(northyork_venues['Venue Category'].unique())))

There are 23 uniques categories.


## 2. Analyzing Each Neighbourhood

In [122]:
# one hot encoding
northyork_onehot = pd.get_dummies(northyork_venues[['Venue Category']], prefix="", prefix_sep="")

# I add neighbourhood column back to dataframe
northyork_onehot['Neighbourhood'] = northyork_venues['Neighbourhood'] 

# then I move neighbourhood column to the first column
fixed_columns = [northyork_onehot.columns[-1]] + list(northyork_onehot.columns[:-1])
northyork_onehot = northyork_onehot[fixed_columns]

northyork_onehot.head()

Unnamed: 0,Neighbourhood,Arts & Crafts Store,Burrito Place,Café,Coffee Shop,Fast Food Restaurant,Grocery Store,Indonesian Restaurant,Japanese Restaurant,Juice Bar,Korean Restaurant,Movie Theater,Pet Store,Plaza,Ramen Restaurant,Restaurant,Sandwich Place,Seafood Restaurant,Shopping Mall,Sports Bar,Steakhouse,Supermarket,Sushi Restaurant,Vietnamese Restaurant
0,Hillcrest Village,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,Hillcrest Village,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Hillcrest Village,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Hillcrest Village,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
4,Hillcrest Village,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


In [123]:
# The reulting dataframe
northyork_onehot.shape

(720, 24)

###  I then group rows by neighbourhood and by taking the mean of the frequency of occurrence of each category.

In [124]:
northyork_grouped = northyork_onehot.groupby('Neighbourhood').mean().reset_index()
northyork_grouped

Unnamed: 0,Neighbourhood,Arts & Crafts Store,Burrito Place,Café,Coffee Shop,Fast Food Restaurant,Grocery Store,Indonesian Restaurant,Japanese Restaurant,Juice Bar,Korean Restaurant,Movie Theater,Pet Store,Plaza,Ramen Restaurant,Restaurant,Sandwich Place,Seafood Restaurant,Shopping Mall,Sports Bar,Steakhouse,Supermarket,Sushi Restaurant,Vietnamese Restaurant
0,"Bathurst Manor,Downsview North,Wilson Heights",0.033333,0.033333,0.066667,0.1,0.033333,0.033333,0.033333,0.033333,0.033333,0.066667,0.033333,0.033333,0.033333,0.133333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333
1,Bayview Village,0.033333,0.033333,0.066667,0.1,0.033333,0.033333,0.033333,0.033333,0.033333,0.066667,0.033333,0.033333,0.033333,0.133333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333
2,"Bedford Park,Lawrence Manor East",0.033333,0.033333,0.066667,0.1,0.033333,0.033333,0.033333,0.033333,0.033333,0.066667,0.033333,0.033333,0.033333,0.133333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333
3,"CFB Toronto,Downsview East",0.033333,0.033333,0.066667,0.1,0.033333,0.033333,0.033333,0.033333,0.033333,0.066667,0.033333,0.033333,0.033333,0.133333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333
4,Don Mills North,0.033333,0.033333,0.066667,0.1,0.033333,0.033333,0.033333,0.033333,0.033333,0.066667,0.033333,0.033333,0.033333,0.133333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333
5,Downsview Central,0.033333,0.033333,0.066667,0.1,0.033333,0.033333,0.033333,0.033333,0.033333,0.066667,0.033333,0.033333,0.033333,0.133333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333
6,Downsview Northwest,0.033333,0.033333,0.066667,0.1,0.033333,0.033333,0.033333,0.033333,0.033333,0.066667,0.033333,0.033333,0.033333,0.133333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333
7,Downsview West,0.033333,0.033333,0.066667,0.1,0.033333,0.033333,0.033333,0.033333,0.033333,0.066667,0.033333,0.033333,0.033333,0.133333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333
8,"Downsview,North Park,Upwood Park",0.033333,0.033333,0.066667,0.1,0.033333,0.033333,0.033333,0.033333,0.033333,0.066667,0.033333,0.033333,0.033333,0.133333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333
9,"Emery,Humberlea",0.033333,0.033333,0.066667,0.1,0.033333,0.033333,0.033333,0.033333,0.033333,0.066667,0.033333,0.033333,0.033333,0.133333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333


In [125]:
# The new size is:
northyork_grouped.shape

(24, 24)

#### Below I uprint each neighborhood along with the top 3 most common venues.

In [126]:
num_top_venues = 3

for hood in northyork_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = northyork_grouped[northyork_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bathurst Manor,Downsview North,Wilson Heights----
              venue  freq
0  Ramen Restaurant  0.13
1       Coffee Shop  0.10
2              Café  0.07


----Bayview Village----
              venue  freq
0  Ramen Restaurant  0.13
1       Coffee Shop  0.10
2              Café  0.07


----Bedford Park,Lawrence Manor East----
              venue  freq
0  Ramen Restaurant  0.13
1       Coffee Shop  0.10
2              Café  0.07


----CFB Toronto,Downsview East----
              venue  freq
0  Ramen Restaurant  0.13
1       Coffee Shop  0.10
2              Café  0.07


----Don Mills North----
              venue  freq
0  Ramen Restaurant  0.13
1       Coffee Shop  0.10
2              Café  0.07


----Downsview Central----
              venue  freq
0  Ramen Restaurant  0.13
1       Coffee Shop  0.10
2              Café  0.07


----Downsview Northwest----
              venue  freq
0  Ramen Restaurant  0.13
1       Coffee Shop  0.10
2              Café  0.07


----Downsview West----
   

The following function sorts the venues in descending order.

In [127]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

I then create the new dataframe and display the top 5 venues for each neighbourhood.

In [128]:
num_top_venues = 7

indicators = ['st', 'nd', 'rd']

# I create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# I create a new dataframe
ny_neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
ny_neighbourhoods_venues_sorted['Neighbourhood'] = northyork_grouped['Neighbourhood']

for ind in np.arange(northyork_grouped.shape[0]):
        ny_neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(northyork_grouped.iloc[ind, :], num_top_venues)

ny_neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
0,"Bathurst Manor,Downsview North,Wilson Heights",Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place
1,Bayview Village,Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place
2,"Bedford Park,Lawrence Manor East",Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place
3,"CFB Toronto,Downsview East",Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place
4,Don Mills North,Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place


## 3. Clustering Neighbourhoods

I run k-means to cluster the neighborhood into 4 clusters.

In [129]:
# I set number of clusters
kclusters = 4

northyork_grouped_clustering = northyork_grouped.drop('Neighbourhood', 1)

# I run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(northyork_grouped_clustering)

# I then check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

I create a new dataframe that includes the cluster as well as the top 7 venues for each neighbourhood.

In [130]:
# add clustering labels
ny_neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

northyork_merged = northyork_data

# I merge northyork_merged with northyork_data to add latitude/longitude for each neighbourhood
northyork_merged = northyork_merged.join(ny_neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

northyork_merged.head() # to check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
0,M2H,North York,Hillcrest Village,43.803762,-79.363452,0,Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place
1,M2J,North York,"Fairview,Henry Farm,Oriole",43.778517,-79.346556,0,Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place
2,M2K,North York,Bayview Village,43.786947,-79.385975,0,Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place
3,M2L,North York,"Silver Hills,York Mills",43.75749,-79.374714,0,Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place
4,M2M,North York,"Newtonbrook,Willowdale",43.789053,-79.408493,0,Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place


### Then we visualize the resulting clusters

In [131]:
# First I create map
map_northyork_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# I add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(northyork_merged['Latitude'], northyork_merged['Longitude'], northyork_merged['Neighbourhood'], northyork_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_northyork_clusters)
       
map_northyork_clusters

## 4. Examining Clusters

Now I examine each cluster and determine the discriminating venue categories that distinguish each cluster.

In [132]:
northyork_merged.loc[northyork_merged['Cluster Labels'] == 0, northyork_merged.columns[[1] + list(range(2, northyork_merged.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
0,North York,Hillcrest Village,43.803762,-79.363452,0,Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place
1,North York,"Fairview,Henry Farm,Oriole",43.778517,-79.346556,0,Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place
2,North York,Bayview Village,43.786947,-79.385975,0,Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place
3,North York,"Silver Hills,York Mills",43.75749,-79.374714,0,Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place
4,North York,"Newtonbrook,Willowdale",43.789053,-79.408493,0,Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place
5,North York,Willowdale South,43.77012,-79.408493,0,Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place
6,North York,York Mills West,43.752758,-79.400049,0,Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place
7,North York,Willowdale West,43.782736,-79.442259,0,Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place
8,North York,Parkwoods,43.753259,-79.329656,0,Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place
9,North York,Don Mills North,43.745906,-79.352188,0,Ramen Restaurant,Coffee Shop,Café,Korean Restaurant,Vietnamese Restaurant,Movie Theater,Burrito Place


### Cluster 2

In [133]:
northyork_merged.loc[northyork_merged['Cluster Labels'] == 1, northyork_merged.columns[[1] + list(range(2, northyork_merged.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue


### Cluster 3

In [134]:
northyork_merged.loc[northyork_merged['Cluster Labels'] == 2, northyork_merged.columns[[1] + list(range(2, northyork_merged.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue


### Cluster 4

In [135]:
northyork_merged.loc[northyork_merged['Cluster Labels'] == 3, northyork_merged.columns[[1] + list(range(2, northyork_merged.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue


The result shows that ALL neighbourhoods have fallen into the same cluster, i.e., cluster 1. This is no wonder since they are in close proximity and have the venues in common.