### Part A: Explore and Cluster the Neighbourhoods in Toronto

Importing some of the important libraries for the assignment.

In [14]:
import pandas as pd
import numpy as np
import urllib.request

The wikipedia website that needs to be scrapped is assigned to a variable url 

In [15]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

The url website is requested to open and assigned a variable wpage

In [16]:
wpage = urllib.request.urlopen(url)

Beautiful Soup function is imported to parse the data returned from the website,

In [17]:
from bs4 import BeautifulSoup

Parse the html in the wscrp variable and store it in Beautiful Soup format.

In [18]:
wscrp = BeautifulSoup(wpage)

The required Table is scrapped from the website using find attribute.

In [19]:
wtable = wscrp.find('table', class_='wikitable sortable')

Then each row is scrapped using 'tr'

In [20]:
wtable_rows = wtable.find_all('tr')

Then each row scrapped for data in the row and appended into the columns A, B and C

In [21]:
A = []
B = []
C = []
for row in wtable_rows:
    dat = row.find_all('td')
    if len(dat)==3:
        A.append(dat[0].find(text=True))
        B.append(dat[1].find(text=True))
        C.append(dat[2].find(text=True))

The columns A, B and C are assigned names

In [22]:
import pandas as pd
df=pd.DataFrame(A,columns=['PostalCode'])
df['Borough']=B
df['Neighborhood']=C
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


The \n is replaced with blank after each data entry in the row

In [23]:
df["PostalCode"] = df["PostalCode"].str.replace("\n","")
df["Borough"] = df["Borough"].str.replace("\n","")
df["Neighborhood"] = df["Neighborhood"].str.replace("\n","")
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


The "Non-assigned" rows in the column Borough are dropped.

In [24]:
df_A = df[~df.Borough.str.contains("Not assigned")]
df_A.reset_index(drop=True, inplace=True)
df_A.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


The data are sorted on Postal Code to merge the same postal codes in one row.

In [25]:
df_A.sort_values(by=['PostalCode'])
df_A.reset_index(drop=True, inplace=True)
df_A.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


The data is checked for any "Not assigned" in the column 'Neighbourhood'

In [26]:
print (df_A[df_A['Neighborhood'].str.contains('Not assigned')])

Empty DataFrame
Columns: [PostalCode, Borough, Neighborhood]
Index: []


There is no row with "Not assigned" in the column 'Neighbourhood' and hence no further operation is done on the data set.
Finally the dimension of the data set is checked.

In [27]:
df_A.shape

(103, 3)

### Part B: Explore and Cluster the Neighbourhoods in Toronto

In [28]:
import os
curr_dir = os.getcwd()
curr_dir

'C:\\Users\\Krishno'

After checking the present directory, the csv file containing the latitude and longitude is saved in the same directory and then the csv file is read using the function.

In [29]:
import pandas as pd
file = "C:\\Users\\Krishno\\Geospatial_Coordinates.csv"
df_B = pd.read_csv(file)
df_B.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


the column "Postal Code" is renamed so as to have the same name as in the 1st database. This helps in proper merging of the two databases together.

In [30]:
df_B.rename(columns = {"Postal Code":"PostalCode"}, inplace = True)
df_B

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


Both the databases are then merged to have all the information in a single data base. The common merging column is Postal Code.

In [31]:
df_C = pd.merge(df_A, df_B, how = 'inner', on = 'PostalCode')
df_C.head(102)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
97,M5X,Downtown Toronto,"First Canadian Place, Underground city",43.648429,-79.382280
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558


### Part C: Explore and cluster the neighborhoods in Toronto

Selecting Borough of containing Toronto

In [32]:
df_T = df_C[df_C['Borough'].str.contains("Toronto")].reset_index(drop=True)
df_T.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
7,M6G,Downtown Toronto,Christie,43.669542,-79.422564
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


In [33]:
df_T.shape

(39, 5)

Preprocessing the data

In [34]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Folium installed
Libraries imported.


In [39]:
CLIENT_ID = 'GXXD1GVTUKJS0JBDMSG3WRXN2PNCEV34SNEQIJSI4HXEXMO4' # your Foursquare ID
CLIENT_SECRET = '40NUW1HKHJLRBGD5WOBR4SZRS3FBI4DB340YEGP1ELSP5QJ4' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: GXXD1GVTUKJS0JBDMSG3WRXN2PNCEV34SNEQIJSI4HXEXMO4
CLIENT_SECRET:40NUW1HKHJLRBGD5WOBR4SZRS3FBI4DB340YEGP1ELSP5QJ4


In [41]:
address = "Toronto, ON"

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto city are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto city are 43.6534817, -79.3839347.


In [154]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)
map_toronto

In [43]:
for lat, lng, borough, neighborhood in zip(
        df_T['Latitude'], 
        df_T['Longitude'], 
        df_T['Borough'], 
        df_T['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
 
map_toronto

#### Explore the first neighborhood in our data frame "df_T"

In [55]:
neigh_name = df_T.loc[0, 'Neighborhood']
print(F"The first neighborhood is {neigh_name}")

The first neighborhood is Regent Park, Harbourfront


Getting Neighborhood's latitude and longitude values.

In [61]:
neigh_lat = df_T.loc[0, 'Latitude'] # neighborhood latitude value
neigh_lon = df_T.loc[0, 'Longitude'] # neighborhood longitude value

print('Latitude and Longitude values of {} are {}, {}.'.format(neigh_name, 
                                                               neigh_lat, 
                                                               neigh_lon))

Latitude and Longitude values of Regent Park, Harbourfront are 43.6542599, -79.3606359.


Getting the top 100 venues that are in 'Regent Park, Harbourfront' within a radius of 900 meters.

In [65]:
LIM = 100 # limit of number of venues returned by Foursquare API
RAD = 900 # defining radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, CLIENT_SECRET, VERSION, neigh_lat, neigh_lon, RAD, LIM)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=GXXD1GVTUKJS0JBDMSG3WRXN2PNCEV34SNEQIJSI4HXEXMO4&client_secret=40NUW1HKHJLRBGD5WOBR4SZRS3FBI4DB340YEGP1ELSP5QJ4&v=20180604&ll=43.6542599,-79.3606359&radius=900&limit=100'

In [67]:
# getting the result to a json file
results = requests.get(url).json()
'There are {} venues around Regent Park, Harbourfront neighborhood.'.format(len(results['response']['groups'][0]['items']))

'There are 100 venues around Regent Park, Harbourfront neighborhood.'

Getting the relevant part of the JSON

In [68]:
venues = results['response']['groups'][0]['items']
venues[0]

{'reasons': {'count': 0,
  'items': [{'summary': 'This spot is popular',
    'type': 'general',
    'reasonName': 'globalInteractionReason'}]},
 'venue': {'id': '54ea41ad498e9a11e9e13308',
  'name': 'Roselle Desserts',
  'location': {'address': '362 King St E',
   'crossStreet': 'Trinity St',
   'lat': 43.653446723052674,
   'lng': -79.3620167174383,
   'labeledLatLngs': [{'label': 'display',
     'lat': 43.653446723052674,
     'lng': -79.3620167174383}],
   'distance': 143,
   'postalCode': 'M5A 1K9',
   'cc': 'CA',
   'city': 'Toronto',
   'state': 'ON',
   'country': 'Canada',
   'formattedAddress': ['362 King St E (Trinity St)',
    'Toronto ON M5A 1K9',
    'Canada']},
  'categories': [{'id': '4bf58dd8d48988d16a941735',
    'name': 'Bakery',
    'pluralName': 'Bakeries',
    'shortName': 'Bakery',
    'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/bakery_',
     'suffix': '.png'},
    'primary': True}],
  'photos': {'count': 0, 'groups': []}},
 'referralId': 'e-0

Processing JSON and converting it to clean dataframe

In [76]:
dataframe = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng', 'venue.location.distance']
dataframe_filtered = dataframe.loc[:, filtered_columns]

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
dataframe_filtered['venue.categories'] = dataframe_filtered.apply(get_category_type, axis=1)

# clean columns
dataframe_filtered.columns = [col.split('.')[-1] for col in dataframe_filtered.columns]

dataframe_filtered.head(12)

  """Entry point for launching an IPython kernel.


Unnamed: 0,name,categories,lat,lng,distance
0,Roselle Desserts,Bakery,43.653447,-79.362017,143
1,Tandem Coffee,Coffee Shop,43.653559,-79.361809,122
2,Impact Kitchen,Restaurant,43.656369,-79.35698,376
3,Cooper Koo Family YMCA,Distribution Center,43.653249,-79.358008,239
4,Morning Glory Cafe,Breakfast Spot,43.653947,-79.361149,54
5,Corktown Common,Park,43.655618,-79.356211,387
6,The Distillery Historic District,Historic Site,43.650244,-79.359323,459
7,Distillery Sunday Market,Farmers Market,43.650075,-79.361832,475
8,SOMA chocolatemaker,Chocolate Shop,43.650622,-79.358127,452
9,Souk Tabule,Mediterranean Restaurant,43.653756,-79.35439,506


#### Exploring other neighborhood areas in Toronto City
Exploring Downtown Toronto, East Toronto, North Toronto and Central Toronto.
Creating a function to repeat the same process for all the neighborhoods in Toronto.

In [127]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    dataframe_filtered = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    dataframe_filtered.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(dataframe_filtered)

In [129]:
near_locn = getNearbyVenues(names=df_T['Neighborhood'],
                                   latitudes=df_T['Latitude'],
                                   longitudes=df_T['Longitude']
                                  )

In [130]:
near_locn.head(100)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.654260,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.654260,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.654260,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
3,"Regent Park, Harbourfront",43.654260,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
4,"Regent Park, Harbourfront",43.654260,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
...,...,...,...,...,...,...,...
95,"Garden District, Ryerson",43.657162,-79.378937,Hokkaido Ramen Santouka らーめん山頭火,43.656435,-79.377586,Ramen Restaurant
96,"Garden District, Ryerson",43.657162,-79.378937,306 Yonge Street - Jordan Store,43.656495,-79.381015,Sporting Goods Shop
97,"Garden District, Ryerson",43.657162,-79.378937,Solei Tanning Salon,43.654734,-79.380248,Tanning Salon
98,"Garden District, Ryerson",43.657162,-79.378937,Five Guys,43.657117,-79.380853,Burger Joint


Number of values returened by each neighborhood

In [136]:
near_locn.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,56,56,56,56,56,56
"Brockton, Parkdale Village, Exhibition Place",24,24,24,24,24,24
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",18,18,18,18,18,18
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",18,18,18,18,18,18
Central Bay Street,64,64,64,64,64,64
Christie,17,17,17,17,17,17
Church and Wellesley,80,80,80,80,80,80
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,33,33,33,33,33,33
Davisville North,8,8,8,8,8,8


Number of unique categories that can be curated fom all the returned venues

In [137]:
print('There are {} uniques categories.'.format(len(near_locn['Venue Category'].unique())))

There are 238 uniques categories.


Analyze Each Neighborhood

In [138]:
# one hot encoding
near_locn_onehot = pd.get_dummies(near_locn[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
near_locn_onehot['Neighborhood'] = near_locn['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [near_locn_onehot.columns[-1]] + list(near_locn_onehot.columns[:-1])
near_locn_onehot = near_locn_onehot[fixed_columns]

near_locn_onehot.head()

Unnamed: 0,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Grouping Rows by Neighborhood and taking the mean of the frequency of occurances

In [139]:
near_locn_grouped = near_locn_onehot.groupby('Neighborhood').mean().reset_index()
near_locn_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.055556,0.055556,0.055556,0.111111,0.166667,0.111111,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.015625,0.0,0.0,0.015625,0.0,0.0,0.0


Checking the 10 most common venues in each neighborhood.

In [141]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = near_locn_grouped['Neighborhood']

for ind in np.arange(near_locn_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(near_locn_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Cheese Shop,Beer Bar,Bakery,Café,Restaurant,Eastern European Restaurant,Department Store
1,"Brockton, Parkdale Village, Exhibition Place",Café,Bakery,Breakfast Spot,Coffee Shop,Gym,Grocery Store,Pet Store,Performing Arts Venue,Nightclub,Italian Restaurant
2,"Business reply mail Processing Centre, South C...",Light Rail Station,Yoga Studio,Auto Workshop,Gym / Fitness Center,Garden Center,Garden,Fast Food Restaurant,Farmers Market,Comic Shop,Pizza Place
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Airport Terminal,Bar,Harbor / Marina,Coffee Shop,Boat or Ferry,Rental Car Location,Boutique,Plane
4,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Sandwich Place,Burger Joint,Japanese Restaurant,Department Store,Salad Place,Bubble Tea Shop,Yoga Studio


#### Cluster neighborhoods
Run k-means to cluster the neighborhood into 5 clusters.

In [143]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [144]:
# set number of clusters
kclusters = 5

near_locn_grouped_clustering = near_locn_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(near_locn_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

Creating a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood

In [145]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

near_locn_merged = df_T

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
near_locn_merged = near_locn_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

near_locn_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,2,Coffee Shop,Park,Pub,Bakery,Café,Breakfast Spot,Theater,Restaurant,Ice Cream Shop,Spa
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,2,Coffee Shop,Sushi Restaurant,Gym,Discount Store,Park,Mexican Restaurant,Italian Restaurant,Hobby Shop,Wings Joint,Fried Chicken Joint
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,2,Clothing Store,Coffee Shop,Bubble Tea Shop,Café,Middle Eastern Restaurant,Japanese Restaurant,Italian Restaurant,Cosmetics Shop,Tea Room,Ramen Restaurant
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,2,Café,Coffee Shop,Cocktail Bar,American Restaurant,Gastropub,Gym,Italian Restaurant,Restaurant,Clothing Store,Cosmetics Shop
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Trail,Pub,Health Food Store,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Women's Store


Visualize the resulting clusters

In [146]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(
        near_locn_merged['Latitude'], 
        near_locn_merged['Longitude'], 
        near_locn_merged['Neighborhood'], 
        near_locn_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Examining Clusters¶
Examining each cluster and determine the discriminating venue categories that distinguish each cluster.

#### Cluster 1

In [148]:
near_locn_merged.loc[near_locn_merged['Cluster Labels'] == 0, near_locn_merged.columns[[1] + list(range(5, near_locn_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,East Toronto,0,Trail,Pub,Health Food Store,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Women's Store


#### Cluster 2

In [150]:
near_locn_merged.loc[near_locn_merged['Cluster Labels'] == 1, near_locn_merged.columns[[1] + list(range(5, near_locn_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,Central Toronto,1,Park,Swim School,Bus Line,Dessert Shop,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
21,Central Toronto,1,Park,Jewelry Store,Trail,Sushi Restaurant,Bus Line,Department Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant


#### Cluster 3

In [151]:
near_locn_merged.loc[near_locn_merged['Cluster Labels'] == 2, near_locn_merged.columns[[1] + list(range(5, near_locn_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,2,Coffee Shop,Park,Pub,Bakery,Café,Breakfast Spot,Theater,Restaurant,Ice Cream Shop,Spa
1,Downtown Toronto,2,Coffee Shop,Sushi Restaurant,Gym,Discount Store,Park,Mexican Restaurant,Italian Restaurant,Hobby Shop,Wings Joint,Fried Chicken Joint
2,Downtown Toronto,2,Clothing Store,Coffee Shop,Bubble Tea Shop,Café,Middle Eastern Restaurant,Japanese Restaurant,Italian Restaurant,Cosmetics Shop,Tea Room,Ramen Restaurant
3,Downtown Toronto,2,Café,Coffee Shop,Cocktail Bar,American Restaurant,Gastropub,Gym,Italian Restaurant,Restaurant,Clothing Store,Cosmetics Shop
5,Downtown Toronto,2,Coffee Shop,Cocktail Bar,Seafood Restaurant,Cheese Shop,Beer Bar,Bakery,Café,Restaurant,Eastern European Restaurant,Department Store
6,Downtown Toronto,2,Coffee Shop,Café,Italian Restaurant,Sandwich Place,Burger Joint,Japanese Restaurant,Department Store,Salad Place,Bubble Tea Shop,Yoga Studio
7,Downtown Toronto,2,Grocery Store,Café,Park,Restaurant,Diner,Baby Store,Candy Store,Nightclub,Coffee Shop,Athletics & Sports
8,Downtown Toronto,2,Coffee Shop,Café,Restaurant,Thai Restaurant,Hotel,Deli / Bodega,Clothing Store,Gym,Bookstore,Bakery
9,West Toronto,2,Bakery,Pharmacy,Bank,Bar,Middle Eastern Restaurant,Café,Supermarket,Pizza Place,Park,Pet Store
10,Downtown Toronto,2,Coffee Shop,Aquarium,Hotel,Café,Sporting Goods Shop,Restaurant,Brewery,Scenic Lookout,Italian Restaurant,Fried Chicken Joint


#### Cluster 4

In [152]:
near_locn_merged.loc[near_locn_merged['Cluster Labels'] == 3, near_locn_merged.columns[[1] + list(range(5, near_locn_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,Central Toronto,3,Garden,Women's Store,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


#### Cluster 5

In [153]:
near_locn_merged.loc[near_locn_merged['Cluster Labels'] == 4, near_locn_merged.columns[[1] + list(range(5, near_locn_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
29,Central Toronto,4,Lawyer,Restaurant,Trail,Park,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Deli / Bodega
33,Downtown Toronto,4,Park,Playground,Trail,Deli / Bodega,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
