<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

## Part One

In this section, I will **scrape a website for data** to create a dataframe of boroughs and neighborhoods of Toronto.

Start by importing the needed libraries.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Use the libraries to scrape data from the site and use BeautifulSoup to extract tabel from site.

In [2]:
url_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(url_text, 'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"ab5f5a06-6529-4c7d-849a-efc84880765f","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":967921175,"wgRevisionId":967921175,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Communications in Ontario","Postal codes in Canada","Toron

Now extract the table portion of the scraped data from url text download.

In [3]:
source_table = soup.find('table', {'class': 'wikitable sortable'})
source_table

<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3B
</td>
<td>

Create a blank dataframe for our data.

In [4]:
# Intiate the dataframe
toronto_df = pd.DataFrame(columns = ['PostalCode', 'Borough', 'Neighborhood'])
toronto_df

Unnamed: 0,PostalCode,Borough,Neighborhood


Populate the blank dataframe, one line at a time, with data from the source_table parsed from site.

In [5]:
# Extract the columns we want and write to the dataframe.
tags = source_table('tr')
for tag in tags:
    tagtext = []
    tds = tag.find_all('td')
    for td in tds:
        tagtext.append(td.get_text().strip())
    try:
        toronto_df = toronto_df.append({'PostalCode': tagtext[0], 'Borough': tagtext[1], 
                                    'Neighborhood': tagtext[2]}, ignore_index=True)
    except:
        continue

toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Drop rows where the Borough is 'Not assigned' and reset index.

In [6]:
toronto_df.drop(toronto_df.index[toronto_df['Borough'] == 'Not assigned'], inplace = True)
toronto_df.reset_index(drop = True, inplace = True)
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [7]:
toronto_df.shape

(103, 3)

# Part Two

In this section, I will add longitude and latitude columns to the dataframe created in part one.

Start by importing the needed library.

In [8]:
import geocoder # import geocoder

 Importing the csv file of the coordinates from the link provided into a dataframe using wget.

In [9]:
import wget
wget.download('https://cocl.us/Geospatial_data', 'coordinates.csv')
coord_df = pd.read_csv('coordinates.csv')

 Rename the postal code column to make it similar to the toronto_df column name.  
 
 Take a look at the coordinates dataframe.

In [10]:
coord_df.rename(columns = {'Postal Code': 'PostalCode'}, inplace = True)
coord_df.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


 Join coord_df dataframe to the toronto_df dataframe, to add latitude and longitude to the toronto_df dataframe.
 
 Take a look at the toronto_df dataframe

In [11]:
# merge toronto_df with coord_df to add latitude/longitude for each neighborhood in Toronto
toronto_df = toronto_df.join(coord_df.set_index('PostalCode'), on='PostalCode')
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Take a look at the dataframe now.

In [12]:
toronto_df.shape

(103, 5)

# Part Three

In this section, I'll cluster the Toronto neighborhoods as done for the New York neighborhoods.

Start by importing the required libaries.

In [13]:
import numpy as np # library to handle data in a vectorized manner

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

#### Define Foursquare Credentials and Version

In [41]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('My credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

My credentails:
CLIENT_ID: 
CLIENT_SECRET:


#### Use geopy to get the longitude and latitude of Toronto.

In [15]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


#### Create a map of Toronto with neighborhoods superimposed on top.

In [16]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#### Given the dense nature of neighborhoods in downtown Toronto, looking at Boroughs in downtown Toronto.

In [17]:
dt_toronto = toronto_df[toronto_df['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
dt_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


#### Get the coordinates for downtown Toronto.

In [18]:
address = 'Downtown Toronto, ON'

geolocator = Nominatim(user_agent="dt_to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Toronto are 43.6563221, -79.3809161.


#### Create a map of Toronto with downtown neighborhoods superimposed on top.

In [19]:
# create map of New York using latitude and longitude values
map_dt_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(dt_toronto['Latitude'], dt_toronto['Longitude'], dt_toronto['Borough'], dt_toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dt_toronto)  
    
map_dt_toronto

## Now exploring the neighborhoods in downtown Toronto.

In [20]:
dt_toronto.shape

(19, 5)

In [21]:
address = 'Downtown Toronto, ON'

geolocator = Nominatim(user_agent="dt_to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Toronto are 43.6563221, -79.3809161.


#### Creating a function to pull top 100 venues within 500m radius of each neighborhood in Downtown Toronto.

In [22]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT = 100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue, first for neighborhood and then
        # for each venue from results above in each neighborhood
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Calling the function to get venues for the neighborhoods in Downtown Toronto.

In [23]:
toronto_venues = getNearbyVenues(names=dt_toronto['Neighborhood'],
                                   latitudes=dt_toronto['Latitude'],
                                   longitudes=dt_toronto['Longitude']
                                  )

toronto_venues.head()

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Rosedale
Stn A PO Boxes
St. James Town, Cabbagetown
First Canadian Place, Underground city
Church and Wellesley


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


Checking how many venues were returned for each neighborhood in Downtown Toronto.

In [24]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,57,57,57,57,57,57
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",16,16,16,16,16,16
Central Bay Street,66,66,66,66,66,66
Christie,16,16,16,16,16,16
Church and Wellesley,72,72,72,72,72,72
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
"First Canadian Place, Underground city",100,100,100,100,100,100
"Garden District, Ryerson",100,100,100,100,100,100
"Harbourfront East, Union Station, Toronto Islands",100,100,100,100,100,100
"Kensington Market, Chinatown, Grange Park",66,66,66,66,66,66


Finding how many unique categories there are.

In [25]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 207 uniques categories.


## Analyzing each neighborhood.

In [26]:
# one hot encoding
dt_toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
dt_toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [dt_toronto_onehot.columns[-1]] + list(dt_toronto_onehot.columns[:-1])
# print(list(manhattan_onehot.columns[:-1])) to understand that the list represents contents of that column
dt_toronto_onehot = dt_toronto_onehot[fixed_columns]

dt_toronto_onehot.head()

Unnamed: 0,Yoga Studio,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Examining the dataframe.

In [27]:
dt_toronto_onehot.shape

(1238, 207)

#### Group rows by neighborhood and find the mean of the frequence of occurence.

In [28]:
dt_toronto_grouped = dt_toronto_onehot.groupby('Neighborhood').mean().reset_index()
dt_toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0
1,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0625,0.0625,0.125,0.125,0.125,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Central Bay Street,0.015152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.015152,0.0
3,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Church and Wellesley,0.027778,0.0,0.0,0.0,0.0,0.0,0.013889,0.0,0.0,...,0.013889,0.013889,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0
6,"First Canadian Place, Underground city",0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,...,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0
7,"Garden District, Ryerson",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.02,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.01,0.0
8,"Harbourfront East, Union Station, Toronto Islands",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,...,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0
9,"Kensington Market, Chinatown, Grange Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.045455,0.015152,0.0


Confirming the new size.

In [29]:
dt_toronto_grouped.shape

(19, 207)

### Printing each neighborhood with top 5 most common venues.

In [30]:
num_top_venues = 5

for hood in dt_toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = dt_toronto_grouped[dt_toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
          venue  freq
0   Coffee Shop  0.07
1  Cocktail Bar  0.05
2    Restaurant  0.04
3   Cheese Shop  0.04
4      Beer Bar  0.04


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
              venue  freq
0    Airport Lounge  0.12
1   Airport Service  0.12
2  Airport Terminal  0.12
3          Boutique  0.06
4           Airport  0.06


----Central Bay Street----
                 venue  freq
0          Coffee Shop  0.17
1   Italian Restaurant  0.06
2  Japanese Restaurant  0.05
3       Sandwich Place  0.05
4                 Café  0.05


----Christie----
           venue  freq
0  Grocery Store  0.25
1           Café  0.19
2           Park  0.12
3    Candy Store  0.06
4          Diner  0.06


----Church and Wellesley----
                 venue  freq
0          Coffee Shop  0.10
1  Japanese Restaurant  0.06
2     Sushi Restaurant  0.06
3           Restaurant  0.04
4              Gay Bar  0.04


----Comm

## Putting the above step into a dataframe for the top 10 venues.

Write a function sorting the venues in descending order.

In [31]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now creating a dataframe with the top 10 venues in each neighborhood.

In [32]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = dt_toronto_grouped['Neighborhood']

for ind in np.arange(dt_toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dt_toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Café,Cheese Shop,Seafood Restaurant,Farmers Market,Bakery,Restaurant,Pharmacy,Beer Bar
1,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Lounge,Airport Service,Airport Terminal,Coffee Shop,Boutique,Bar,Plane,Rental Car Location,Boat or Ferry,Harbor / Marina
2,Central Bay Street,Coffee Shop,Italian Restaurant,Sandwich Place,Café,Japanese Restaurant,Thai Restaurant,Bubble Tea Shop,Burger Joint,Bar,Salad Place
3,Christie,Grocery Store,Café,Park,Candy Store,Diner,Italian Restaurant,Restaurant,Baby Store,Coffee Shop,Nightclub
4,Church and Wellesley,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Restaurant,Gay Bar,Yoga Studio,Pub,Mediterranean Restaurant,Hotel,Men's Store


# Clustering the neighborhoods

Running k-mean to cluster the neighborhoods into five clusters.

In [33]:
# set number of clusters
kclusters = 5

dt_toronto_grouped_clustering = dt_toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dt_toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 3, 2, 4, 2, 2, 2, 2, 2, 2])

Creating  new dataframe that contains the cluseters and the top 10 venues for each neighborhood.

In [34]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

dt_toronto_merged = dt_toronto

# merge dt_toronto_grouped with dt_toronto to add latitude/longitude for each neighborhood
dt_toronto_merged = dt_toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

dt_toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Park,Bakery,Pub,Café,Breakfast Spot,Theater,Electronics Store,Event Space,Dessert Shop
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Coffee Shop,College Cafeteria,Diner,Smoothie Shop,Beer Bar,Sandwich Place,Burrito Place,Café,Park,College Auditorium
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,2,Clothing Store,Coffee Shop,Cosmetics Shop,Bubble Tea Shop,Café,Japanese Restaurant,Italian Restaurant,Hotel,Fast Food Restaurant,Bookstore
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,2,Café,Coffee Shop,Restaurant,Clothing Store,Cocktail Bar,American Restaurant,Cosmetics Shop,Italian Restaurant,Seafood Restaurant,Beer Bar
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,2,Coffee Shop,Cocktail Bar,Café,Cheese Shop,Seafood Restaurant,Farmers Market,Bakery,Restaurant,Pharmacy,Beer Bar


Visualizing the resulting clusters.

In [35]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dt_toronto_merged['Latitude'], dt_toronto_merged['Longitude'], dt_toronto_merged['Neighborhood'], dt_toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# Examing the Clusters

### Cluster 1

In [36]:
dt_toronto_merged.loc[dt_toronto_merged['Cluster Labels'] == 0, dt_toronto_merged.columns[[1] + list(range(5, dt_toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,0,Coffee Shop,Park,Bakery,Pub,Café,Breakfast Spot,Theater,Electronics Store,Event Space,Dessert Shop
1,Downtown Toronto,0,Coffee Shop,College Cafeteria,Diner,Smoothie Shop,Beer Bar,Sandwich Place,Burrito Place,Café,Park,College Auditorium


### Cluster 2

In [37]:
dt_toronto_merged.loc[dt_toronto_merged['Cluster Labels'] == 1, dt_toronto_merged.columns[[1] + list(range(5, dt_toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Downtown Toronto,1,Park,Trail,Playground,Cupcake Shop,Donut Shop,Doner Restaurant,Dog Run,Distribution Center,Discount Store,Diner


### Cluster 3

In [38]:
dt_toronto_merged.loc[dt_toronto_merged['Cluster Labels'] == 2, dt_toronto_merged.columns[[1] + list(range(5, dt_toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Downtown Toronto,2,Clothing Store,Coffee Shop,Cosmetics Shop,Bubble Tea Shop,Café,Japanese Restaurant,Italian Restaurant,Hotel,Fast Food Restaurant,Bookstore
3,Downtown Toronto,2,Café,Coffee Shop,Restaurant,Clothing Store,Cocktail Bar,American Restaurant,Cosmetics Shop,Italian Restaurant,Seafood Restaurant,Beer Bar
4,Downtown Toronto,2,Coffee Shop,Cocktail Bar,Café,Cheese Shop,Seafood Restaurant,Farmers Market,Bakery,Restaurant,Pharmacy,Beer Bar
5,Downtown Toronto,2,Coffee Shop,Italian Restaurant,Sandwich Place,Café,Japanese Restaurant,Thai Restaurant,Bubble Tea Shop,Burger Joint,Bar,Salad Place
7,Downtown Toronto,2,Coffee Shop,Café,Hotel,Restaurant,Clothing Store,Thai Restaurant,Gym,Deli / Bodega,Bookstore,Concert Hall
8,Downtown Toronto,2,Coffee Shop,Aquarium,Café,Hotel,Restaurant,Sporting Goods Shop,Scenic Lookout,Fried Chicken Joint,Brewery,Bakery
9,Downtown Toronto,2,Coffee Shop,Hotel,Café,Restaurant,Italian Restaurant,Seafood Restaurant,American Restaurant,Japanese Restaurant,Salad Place,Steakhouse
10,Downtown Toronto,2,Coffee Shop,Restaurant,Café,Hotel,Gym,American Restaurant,Italian Restaurant,Seafood Restaurant,Deli / Bodega,Japanese Restaurant
11,Downtown Toronto,2,Café,Sandwich Place,Bar,Japanese Restaurant,Bookstore,Bakery,Restaurant,Yoga Studio,Beer Bar,Beer Store
12,Downtown Toronto,2,Café,Coffee Shop,Vietnamese Restaurant,Bar,Mexican Restaurant,Dessert Shop,Grocery Store,Park,Burger Joint,Bakery


### Cluster 4

In [39]:
dt_toronto_merged.loc[dt_toronto_merged['Cluster Labels'] == 3, dt_toronto_merged.columns[[1] + list(range(5, dt_toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
13,Downtown Toronto,3,Airport Lounge,Airport Service,Airport Terminal,Coffee Shop,Boutique,Bar,Plane,Rental Car Location,Boat or Ferry,Harbor / Marina


### Cluster 5

In [40]:
dt_toronto_merged.loc[dt_toronto_merged['Cluster Labels'] == 4, dt_toronto_merged.columns[[1] + list(range(5, dt_toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Downtown Toronto,4,Grocery Store,Café,Park,Candy Store,Diner,Italian Restaurant,Restaurant,Baby Store,Coffee Shop,Nightclub


#### OBSERVATION: Most of the venues are in Cluster 3, which is a stretch of road from the University of Toronto down to the bay around Union station, which makes sense given the higher population of students and offices around the area.