# Toronto Neighborhoods Segmenting and Clustering

Table of Content:
1. <a href="#Toronto-Neighborhoods-Data-Scraping-(Part-1)">Toronto Neighborhoods Data Scraping (Part 1)</a>
2. <a href="#Toronto-Neighborhoods-Geographical-Coordinates-(Part-2)">Toronto Neighborhoods Geographical Coordinates (Part 2)</a>
3. <a href="#Toronto-Neigborhood-Clustering-(Part-3)">Toronto Neigborhood Clustering (Part 3)</a>

## Toronto Neighborhoods Data Scraping (Part 1)

Hi, this is the first part of Data Science Capstone 2nd Project for my Coursera IBM Data Specialization Course.

This first part is intended to do the following things:
    1. Scrape Toronto neighborhood data (Postal Code, Borough, Neighborhood) from Wikipedia with BeautifulSoup
    2. Turn the scraped data to Pandas Dataframe
    
The Toronto neighborhood data is taken from [here](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M).

### 1.1 Import the required libraries

In [1]:
import numpy as np
import pandas as pd

#!conda install -c anaconda beautifulsoup4
from bs4 import BeautifulSoup as bs

import requests

### 1.2 Load the Wikipedia page html code to a variable

In [2]:
data_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_code = requests.get(data_link).text

### 1.3 Convert the html code to a more readable version

We will use BeautifulSoup prettify function to see how the html tag is nested

In [3]:
soup = bs(html_code, 'lxml')
# print(soup.prettify())

Based on the html code, we see that our data is in class "wikitable sortable".

### 1.4 Extract table data to Pandas DataFrame

Find the html code for the 'wikitable sortable' class

In [4]:
data_table = soup.find('table', {'class':'wikitable sortable'})

Find the header of the table

In [5]:
header = data_table.findAll('th')
header

[<th>Postcode</th>, <th>Borough</th>, <th>Neighbourhood
 </th>]

Extract the header to a list

In [6]:
df_head = []
for head in header:
    j = head.text.strip()
    df_head.append(j)

df_head

['Postcode', 'Borough', 'Neighbourhood']

For the last column, we see that the header is 'Neighbourhood' instead of 'Neighborhood', so let's change that, list is mutable in Python anyway.

In [7]:
df_head[2] = 'Neighborhood'

Find the row data from the table

In [8]:
hood_data = data_table.findAll('td')
hood_data[0:6]

[<td>M1A</td>, <td>Not assigned</td>, <td>Not assigned
 </td>, <td>M2A</td>, <td>Not assigned</td>, <td>Not assigned
 </td>]

Extract each row to a list with its corresponding column.

**Skip the row** which has 'Not assigned' value in the Borough column.

In [9]:
df_post_code = []
df_borough = []
df_neighborhood = []
order = 0

extract = True

for i in range(len(hood_data)):
    if(order == 0):
        if (hood_data[i+1].text.strip() == 'Not assigned'):
            extract = False
        else:
            df_post_code.append(hood_data[i].text.strip())
            
        order += 1
    elif(order == 1):
        if (extract == True):
            df_borough.append(hood_data[i].text.strip())
            
        order += 1
    elif(order == 2):
        if (extract == False):
            extract = True
        else:
            df_neighborhood.append(hood_data[i].text.strip())
            
        order = 0
        
print(df_post_code[0:6])
print(df_borough[0:6])
print(df_neighborhood[0:6])

['M3A', 'M4A', 'M5A', 'M5A', 'M6A', 'M6A']
['North York', 'North York', 'Downtown Toronto', 'Downtown Toronto', 'North York', 'North York']
['Parkwoods', 'Victoria Village', 'Harbourfront', 'Regent Park', 'Lawrence Heights', 'Lawrence Manor']


In [10]:
df_tor = pd.DataFrame({
    df_head[0]:df_post_code,
    df_head[1]:df_borough,
    df_head[2]:df_neighborhood
})


df_tor.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Print the DataFrame shape

In [11]:
df_tor.shape

(211, 3)

If the row has value in Borough, but 'Not assigned' Neighborhood, then assign the Borough value to Neighborhood.

Let's see which Neighborhood has 'Not assigned' value.

In [12]:
df_tor[df_tor['Neighborhood'] == 'Not assigned']

Unnamed: 0,Postcode,Borough,Neighborhood
6,M7A,Queen's Park,Not assigned


So, we find out that only 1 row has 'Not assigned' value.

Let's replace the "Not assigned" value to "Queen's Park".

In [13]:
df_tor.Neighborhood[df_tor.Neighborhood == 'Not assigned'] = df_tor.Borough

df_tor.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


### 1.5 Group different rows with same post code.

Different rows which has the same post code but different Neighborhood will be combined to 1 row and the Neighborhood(s) will be separated by comma.

In [14]:
# df_tor.groupby(by = ['Postcode', 'Borough']).agg(lambda x:','.join(x))
df_tor = df_tor.groupby(['Postcode','Borough'], sort = False).agg(lambda x: ','.join(x))

df_tor.reset_index(inplace = True)

df_tor

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


### 1.6 Print the shape of the final dataframe

In [15]:
print("DataFrame row is {}, column is {}".format(df_tor.shape[0], df_tor.shape[1]))

DataFrame row is 103, column is 3


## Toronto Neighborhoods Geographical Coordinates (Part 2)

In this part, we need to find the geographical coordinates of each postal code, and merge it to our previous dataframe.

### 2.1 Import geocoder

In [17]:
#!pip install geocoder 
import geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 14.7MB/s ta 0:00:01
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


### 2.2 Find the latitude and longitude data of each postal code

Find each postal code latitude and longitude, and then put it in the dataframe.

In this case I use the geocoder.arcgis function as geocoder.google is returning None.

In [18]:
df_lat = []
df_long = []

for postal in df_tor['Postcode']:
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal))
        lat_lng_coords = g.latlng

    df_lat.append(lat_lng_coords[0])
    df_long.append(lat_lng_coords[1])
    
df_tor['Latitude'] = df_lat
df_tor['Longitude'] = df_long

df_tor.head(5)

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75242,-79.329242
1,M4A,North York,Victoria Village,43.7306,-79.313265
2,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.650295,-79.359166
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.72327,-79.451286
4,M7A,Queen's Park,Queen's Park,43.66115,-79.391715


Let's print out all of the dataframe data

In [19]:
df_tor

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752420,-79.329242
1,M4A,North York,Victoria Village,43.730600,-79.313265
2,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.650295,-79.359166
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.723270,-79.451286
4,M7A,Queen's Park,Queen's Park,43.661150,-79.391715
5,M9A,Etobicoke,Islington Avenue,43.662299,-79.528195
6,M1B,Scarborough,"Rouge,Malvern",43.811525,-79.195517
7,M3B,North York,Don Mills North,43.749055,-79.362227
8,M4B,East York,"Woodbine Gardens,Parkview Hill",43.707535,-79.311773
9,M5B,Downtown Toronto,"Ryerson,Garden District",43.657363,-79.378180


Looks like it's already perfect, let's go to the next and last part.

## Toronto Neigborhood Clustering (Part 3)

In the last part, we will use our data to explore and cluster Toronto neighborhood.

I am actually interested to explore Toronto's restaurant, let's find the most common restaurant in Downtown Toronto borough.

Then I will cluster those borough based on the restaurant type similarity.

### 3.1 Import the required libraries

In [20]:
#!pip install folium
import folium
from pandas.io.json import json_normalize

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/72/ff/004bfe344150a064e558cb2aedeaa02ecbf75e60e148a55a9198f0c41765/folium-0.10.0-py2.py3-none-any.whl (91kB)
[K     |████████████████████████████████| 92kB 14.9MB/s eta 0:00:01
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/63/36/1c93318e9653f4e414a2e0c3b98fc898b4970e939afeedeee6075dd3b703/branca-0.3.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.3.1 folium-0.10.0


### 3.2 Visualize the map of Toronto

We will use arcgis to find Toronto's latitude and longitude. We will use Downsview Airport as our map center.

In [21]:
g = geocoder.arcgis('Downsview Airport, Toronto, Ontario')

toronto_map = folium.Map(
    location = g.latlng,
    zoom_start = 12
)

toronto_map

### 3.3 Visualize Toronto map with markers for each postal code

In [22]:
for lat, long, post, bor in zip(df_tor['Latitude'], df_tor['Longitude'], df_tor['Postcode'], df_tor['Borough']):
    label = post + ", " + bor
    folium.CircleMarker(
        location = [lat, long],
        popup = label,
        radius = 5,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False
    ).add_to(toronto_map)

toronto_map

### 3.4 Load Foursquare API Data

I will load the Foursquare API and gather necessary data from it.

I hide my cell which contains my Foursquare credential.

In [23]:
# The code was removed by Watson Studio for sharing.

In [24]:
down_tor = df_tor[df_tor['Borough'] == 'Downtown Toronto']
down_tor.reset_index(drop = True, inplace = True)

down_tor

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.650295,-79.359166
1,M5B,Downtown Toronto,"Ryerson,Garden District",43.657363,-79.37818
2,M5C,Downtown Toronto,St. James Town,43.65121,-79.375481
3,M5E,Downtown Toronto,Berczy Park,43.64516,-79.373675
4,M5G,Downtown Toronto,Central Bay Street,43.656091,-79.38493
5,M6G,Downtown Toronto,Christie,43.668781,-79.42071
6,M5H,Downtown Toronto,"Adelaide,King,Richmond",43.6497,-79.382582
7,M5J,Downtown Toronto,"Harbourfront East,Toronto Islands,Union Station",43.63021,-79.362433
8,M5K,Downtown Toronto,"Design Exchange,Toronto Dominion Centre",43.6471,-79.381531
9,M5L,Downtown Toronto,"Commerce Court,Victoria Hotel",43.648395,-79.378865


There are **18** different post code in Downtown Toronto.

Let's try to query a restaurant search in M5G.

In [25]:
m5g = down_tor[down_tor['Postcode'] == 'M5G']
m5g_lat = m5g['Latitude'].item()
m5g_long = m5g['Longitude'].item()
query = 'Restaurant'
VERSION = '20190605'
radius = 500
limit = 50

m5g_url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        m5g_lat, 
        m5g_long, 
        VERSION, 
        query, 
        radius, 
        limit
    )

results = requests.get(m5g_url).json()

results

{'meta': {'code': 200, 'requestId': '5da882f0cf72a00039ad2b0b'},
 'response': {'venues': [{'id': '4ad4c05ff964a52048f720e3',
    'name': 'Hemispheres Restaurant & Bistro',
    'location': {'address': '110 Chestnut Street',
     'lat': 43.65488413420439,
     'lng': -79.38593077371578,
     'labeledLatLngs': [{'label': 'display',
       'lat': 43.65488413420439,
       'lng': -79.38593077371578}],
     'distance': 156,
     'postalCode': 'M5G 1R3',
     'cc': 'CA',
     'city': 'Toronto',
     'state': 'ON',
     'country': 'Canada',
     'formattedAddress': ['110 Chestnut Street',
      'Toronto ON M5G 1R3',
      'Canada']},
    'categories': [{'id': '4bf58dd8d48988d14e941735',
      'name': 'American Restaurant',
      'pluralName': 'American Restaurants',
      'shortName': 'American',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/default_',
       'suffix': '.png'},
      'primary': True}],
    'referralId': 'v-1571324657',
    'hasPerk': False},
   {'id': '

In [26]:
def get_category_type(row):
    try:
        categories_list = row['categories'][0]['name']
    except:
        categories_list = row['venue.categories'][0]['name']
    
    if len(categories_list) == 0:
        return None
    else:
        return (categories_list)

In [27]:
restaurant = results['response']['venues']

nearby_restaurant = json_normalize(restaurant)

important_columns = ['name', 'categories', 'location.lat', 'location.lng']
nearby_restaurant = nearby_restaurant.loc[:, important_columns]


nearby_restaurant['categories'] = nearby_restaurant.apply(get_category_type, axis=1)
nearby_restaurant.columns = [col.split('.')[-1] for col in nearby_restaurant.columns]

nearby_restaurant

Unnamed: 0,name,categories,lat,lng
0,Hemispheres Restaurant & Bistro,American Restaurant,43.654884,-79.385931
1,Richtree Natural Market Restaurants,Restaurant,43.652614,-79.380231
2,Hong Shing Chinese Restaurant,Chinese Restaurant,43.654925,-79.387089
3,Yueh Tung Chinese Restaurant,Chinese Restaurant,43.655281,-79.385337
4,The Senator Restaurant,Diner,43.655641,-79.379199
5,New Treasure Restaurant,Dim Sum Restaurant,43.655384,-79.385362
6,Kyoto House Japanese Restaurant,Sushi Restaurant,43.655381,-79.38527
7,The Elm Tree Restaurant,Modern European Restaurant,43.657397,-79.383761
8,Spring Rolls | Japanese Restaurant in Toronto,Theme Restaurant,43.656105,-79.383495
9,Adega Restaurant,Restaurant,43.657519,-79.383462


There are 34 restaurants in M5G, Toronto.

Define a function to return the restaurant type / category

In [28]:
def getTypeRestaurant(result):
    rest_type = 'None'

    for v in result:
        try:
            rest_type = v['name']
        except:
            rest_type = 'None'
            
    return rest_type

Define a function to do a query for all 17 post codes.

In [29]:
def getNearbyRestaurant(postcodes, names, latitudes, longitudes, radius=500, LIMIT = 100):
    
    restaurant_list=[]
    for postcode, name, lat, long in zip(postcodes, names, latitudes, longitudes):
        neigh_name = down_tor[down_tor['Postcode'] == postcode]['Neighborhood'].to_string()[2:]
        print(postcode + ", " +name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            lat, 
            long, 
            VERSION, 
            query, 
            radius, 
            limit
        )

        # make the GET request
        results = requests.get(url).json()["response"]['venues']
        
        # return only relevant information for each nearby venue
        restaurant_list.append([(
            postcode,
            neigh_name, 
            lat, 
            long, 
            v['name'], 
            v['location']['lat'], 
            v['location']['lng'],  
            getTypeRestaurant(v['categories'])) for v in results])


    nearby_restaurant = pd.DataFrame([item for restaurant_list in restaurant_list for item in restaurant_list])
    nearby_restaurant.columns = ['Post Code',
                  'Neighborhood', 
                  'Post Code Latitude', 
                  'Post Code Longitude', 
                  'Restaurant', 
                  'Restaurant Latitude', 
                  'Restaurant Longitude',
                  'Restaurant Category']    
    
    return(nearby_restaurant)

toronto_restaurant = getNearbyRestaurant(postcodes = down_tor['Postcode'],
                                         names = down_tor['Borough'],
                                         latitudes = down_tor['Latitude'],
                                         longitudes = down_tor['Longitude']
                                  )

M5A, Downtown Toronto
M5B, Downtown Toronto
M5C, Downtown Toronto
M5E, Downtown Toronto
M5G, Downtown Toronto
M6G, Downtown Toronto
M5H, Downtown Toronto
M5J, Downtown Toronto
M5K, Downtown Toronto
M5L, Downtown Toronto
M5S, Downtown Toronto
M5T, Downtown Toronto
M5V, Downtown Toronto
M4W, Downtown Toronto
M5W, Downtown Toronto
M4X, Downtown Toronto
M5X, Downtown Toronto
M4Y, Downtown Toronto


In [30]:
toronto_restaurant.head(5)

Unnamed: 0,Post Code,Neighborhood,Post Code Latitude,Post Code Longitude,Restaurant,Restaurant Latitude,Restaurant Longitude,Restaurant Category
0,M5A,"Harbourfront,Regent Park",43.650295,-79.359166,Site Of Great Canary Restaurant,43.653323,-79.357883,Breakfast Spot
1,M5A,"Harbourfront,Regent Park",43.650295,-79.359166,Archeo,43.650667,-79.359431,Italian Restaurant
2,M5A,"Harbourfront,Regent Park",43.650295,-79.359166,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
3,M5B,"Ryerson,Garden District",43.657363,-79.37818,Hemispheres Restaurant & Bistro,43.654884,-79.385931,American Restaurant
4,M5B,"Ryerson,Garden District",43.657363,-79.37818,Studio Restaurant,43.6615,-79.379319,Breakfast Spot


In [31]:
toronto_restaurant.shape

(455, 8)

Our dataset has total of 455 rows.

Let's see if there are any **None** value in the restaurant category.

In [32]:
toronto_restaurant[toronto_restaurant['Restaurant Category'] == 'None']

Unnamed: 0,Post Code,Neighborhood,Post Code Latitude,Post Code Longitude,Restaurant,Restaurant Latitude,Restaurant Longitude,Restaurant Category
186,M5H,"Adelaide,King,Richmond",43.6497,-79.382582,Tropical Desires,43.648298,-79.38788,
230,M5K,"Design Exchange,Toronto Dominion Centre",43.6471,-79.381531,Tropical Desires,43.648298,-79.38788,
328,M5T,"Chinatown,Grange Park,Kensington Market",43.65353,-79.397233,Full Moon Vegetarian Restaurant,43.652131,-79.40252,
372,M5W,Stn A PO Boxes 25 The Esplanade,43.64869,-79.38544,Tropical Desires,43.648298,-79.38788,
389,M4X,"Cabbagetown,St. James Town",43.66816,-79.366602,Plum 226 Restaurant & Lounge,43.66405,-79.36906,
430,M5X,"First Canadian Place,Underground city",43.64828,-79.381461,Tropical Desires,43.648298,-79.38788,


So there are **6 restaurants** with None value, better drop it from the dataset.

And 4 of them are 'Tropical Desires'.

In [33]:
toronto_restaurant = toronto_restaurant[toronto_restaurant['Restaurant Category'] != 'None'].reset_index(drop = True)

toronto_restaurant.head()

Unnamed: 0,Post Code,Neighborhood,Post Code Latitude,Post Code Longitude,Restaurant,Restaurant Latitude,Restaurant Longitude,Restaurant Category
0,M5A,"Harbourfront,Regent Park",43.650295,-79.359166,Site Of Great Canary Restaurant,43.653323,-79.357883,Breakfast Spot
1,M5A,"Harbourfront,Regent Park",43.650295,-79.359166,Archeo,43.650667,-79.359431,Italian Restaurant
2,M5A,"Harbourfront,Regent Park",43.650295,-79.359166,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
3,M5B,"Ryerson,Garden District",43.657363,-79.37818,Hemispheres Restaurant & Bistro,43.654884,-79.385931,American Restaurant
4,M5B,"Ryerson,Garden District",43.657363,-79.37818,Studio Restaurant,43.6615,-79.379319,Breakfast Spot


In [34]:
toronto_restaurant.shape

(449, 8)

Ok, I have successfully removed 6 data.

I need to see how many unique categories are there.

In [35]:
print('There are {} uniques categories.'.format(len(toronto_restaurant['Restaurant Category'].unique())))

There are 51 uniques categories.


Now I will group the data based on its post code and restaurant category and then sort the value to find the largest restaurant category in each post code.

In [36]:
toronto_onehot = pd.get_dummies(toronto_restaurant[['Restaurant Category']], prefix="", prefix_sep="")

toronto_onehot['Post Code'] = toronto_restaurant['Post Code']

# move post code column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot

Unnamed: 0,Post Code,American Restaurant,Asian Restaurant,Bar,Beer Bar,Bistro,Breakfast Spot,Cantonese Restaurant,Caribbean Restaurant,Chinese Restaurant,...,Sandwich Place,Spanish Restaurant,Steakhouse,Sushi Restaurant,Szechuan Restaurant,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar
0,M5A,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M5A,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M5B,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M5B,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,M5B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
6,M5B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,M5B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,M5B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,M5B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Group by post code and neighborhood and get mean of each category occurrence.

In [37]:
toronto_grouped = toronto_onehot.groupby('Post Code').mean().reset_index()
toronto_grouped

Unnamed: 0,Post Code,American Restaurant,Asian Restaurant,Bar,Beer Bar,Bistro,Breakfast Spot,Cantonese Restaurant,Caribbean Restaurant,Chinese Restaurant,...,Sandwich Place,Spanish Restaurant,Steakhouse,Sushi Restaurant,Szechuan Restaurant,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar
0,M4X,0.0,0.0,0.0,0.0,0.0,0.285714,0.0,0.0,0.142857,...,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M4Y,0.0,0.05,0.0,0.0,0.0,0.05,0.0,0.0,0.1,...,0.15,0.0,0.0,0.2,0.0,0.05,0.0,0.0,0.0,0.0
2,M5A,0.0,0.0,0.0,0.0,0.0,0.666667,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M5B,0.069767,0.069767,0.0,0.0,0.023256,0.023256,0.0,0.023256,0.093023,...,0.046512,0.0,0.0,0.069767,0.0,0.069767,0.023256,0.0,0.0,0.0
4,M5C,0.05,0.075,0.0,0.025,0.0,0.025,0.0,0.025,0.0,...,0.0,0.025,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.0
5,M5E,0.117647,0.0,0.058824,0.058824,0.0,0.058824,0.0,0.0,0.0,...,0.0,0.058824,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0
6,M5G,0.085714,0.028571,0.0,0.0,0.0,0.0,0.0,0.028571,0.114286,...,0.085714,0.0,0.0,0.057143,0.0,0.028571,0.028571,0.0,0.028571,0.0
7,M5H,0.021739,0.021739,0.043478,0.0,0.0,0.021739,0.0,0.043478,0.0,...,0.021739,0.0,0.021739,0.021739,0.0,0.021739,0.0,0.0,0.0,0.021739
8,M5K,0.025641,0.051282,0.051282,0.0,0.0,0.025641,0.0,0.025641,0.0,...,0.051282,0.0,0.025641,0.025641,0.0,0.0,0.0,0.0,0.0,0.025641
9,M5L,0.043478,0.021739,0.043478,0.021739,0.0,0.021739,0.0,0.021739,0.0,...,0.021739,0.021739,0.021739,0.021739,0.0,0.021739,0.0,0.0,0.0,0.021739


Get the top 5 restaurant category for each post code.

In [38]:
num_top_venues = 5

for post in toronto_grouped['Post Code']:
    print("----"+post+"----")
    temp = toronto_grouped[toronto_grouped['Post Code'] == post].T.reset_index()
    temp.columns = ['Post Code','Freq']
    temp = temp.iloc[1:]
    temp['Freq'] = temp['Freq'].astype(float)
    temp = temp.round({'Freq': 2})
    print(temp.sort_values('Freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----M4X----
            Post Code  Freq
0      Breakfast Spot  0.29
1          Restaurant  0.14
2   Indian Restaurant  0.14
3  Chinese Restaurant  0.14
4    Greek Restaurant  0.14


----M4Y----
              Post Code  Freq
0      Sushi Restaurant  0.20
1        Sandwich Place  0.15
2    Chinese Restaurant  0.10
3  Fast Food Restaurant  0.10
4                 Diner  0.10


----M5A----
                   Post Code  Freq
0             Breakfast Spot  0.67
1         Italian Restaurant  0.33
2        American Restaurant  0.00
3        Peruvian Restaurant  0.00
4  Middle Eastern Restaurant  0.00


----M5B----
             Post Code  Freq
0           Restaurant  0.12
1   Chinese Restaurant  0.09
2  American Restaurant  0.07
3     Asian Restaurant  0.07
4     Sushi Restaurant  0.07


----M5C----
             Post Code  Freq
0           Restaurant  0.28
1                Diner  0.10
2     Asian Restaurant  0.08
3  Japanese Restaurant  0.08
4   Italian Restaurant  0.08


----M5E----
            

Define a function to sort categories in descending order.

In [39]:
def return_most_common_categories(row, num_top_categories):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_categories]

Now let's create the new dataframe and display the top 5 categories for each neighborhood.

In [40]:
num_top_categories = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Post Code']
for ind in np.arange(num_top_categories):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
tor_rest_sorted = pd.DataFrame(columns=columns)
tor_rest_sorted['Post Code'] = toronto_grouped['Post Code']

for ind in np.arange(toronto_grouped.shape[0]):
    tor_rest_sorted.iloc[ind, 1:] = return_most_common_categories(toronto_grouped.iloc[ind, :], num_top_categories)

tor_rest_sorted.head()

Unnamed: 0,Post Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M4X,Breakfast Spot,Indian Restaurant,Greek Restaurant,Sandwich Place,Restaurant
1,M4Y,Sushi Restaurant,Sandwich Place,Fast Food Restaurant,Diner,Chinese Restaurant
2,M5A,Breakfast Spot,Italian Restaurant,Dumpling Restaurant,Indian Restaurant,Hotel
3,M5B,Restaurant,Chinese Restaurant,American Restaurant,Asian Restaurant,Thai Restaurant
4,M5C,Restaurant,Diner,Japanese Restaurant,Asian Restaurant,Italian Restaurant


### 3.5 *K*-means clustering

Import Kmeans

In [41]:
from sklearn.cluster import KMeans

Run *k*-means to cluster the neighborhood into 5 clusters.

In [42]:
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Post Code', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

kmeans

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)

In [43]:
tor_rest_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = down_tor.drop('Borough', axis = 1)

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(tor_rest_sorted.set_index('Post Code'), on = 'Postcode')

toronto_merged

Unnamed: 0,Postcode,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M5A,"Harbourfront,Regent Park",43.650295,-79.359166,3.0,Breakfast Spot,Italian Restaurant,Dumpling Restaurant,Indian Restaurant,Hotel
1,M5B,"Ryerson,Garden District",43.657363,-79.37818,4.0,Restaurant,Chinese Restaurant,American Restaurant,Asian Restaurant,Thai Restaurant
2,M5C,St. James Town,43.65121,-79.375481,0.0,Restaurant,Diner,Japanese Restaurant,Asian Restaurant,Italian Restaurant
3,M5E,Berczy Park,43.64516,-79.373675,0.0,Restaurant,American Restaurant,Diner,Spanish Restaurant,Fast Food Restaurant
4,M5G,Central Bay Street,43.656091,-79.38493,4.0,Restaurant,Chinese Restaurant,American Restaurant,Sandwich Place,Italian Restaurant
5,M6G,Christie,43.668781,-79.42071,0.0,Korean Restaurant,Restaurant,Middle Eastern Restaurant,Ethiopian Restaurant,Nightclub
6,M5H,"Adelaide,King,Richmond",43.6497,-79.382582,0.0,Restaurant,Italian Restaurant,Japanese Restaurant,Fast Food Restaurant,New American Restaurant
7,M5J,"Harbourfront East,Toronto Islands,Union Station",43.63021,-79.362433,,,,,,
8,M5K,"Design Exchange,Toronto Dominion Centre",43.6471,-79.381531,0.0,Restaurant,Japanese Restaurant,Italian Restaurant,Asian Restaurant,Bar
9,M5L,"Commerce Court,Victoria Hotel",43.648395,-79.378865,0.0,Restaurant,Fast Food Restaurant,Italian Restaurant,Japanese Restaurant,American Restaurant


After processing the dataset, coincidentally, all restaurants in **M5J** and **M4W** data were removed because it didn't meet the requirements, so their value is NaN and I will remove it now.

We will also cast the cluster label to integer.

In [44]:
toronto_merged = toronto_merged.dropna(axis = 0).reset_index(drop = True)

toronto_merged['Cluster Labels'] = toronto_merged[['Cluster Labels']].astype(int)

toronto_merged

Unnamed: 0,Postcode,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M5A,"Harbourfront,Regent Park",43.650295,-79.359166,3,Breakfast Spot,Italian Restaurant,Dumpling Restaurant,Indian Restaurant,Hotel
1,M5B,"Ryerson,Garden District",43.657363,-79.37818,4,Restaurant,Chinese Restaurant,American Restaurant,Asian Restaurant,Thai Restaurant
2,M5C,St. James Town,43.65121,-79.375481,0,Restaurant,Diner,Japanese Restaurant,Asian Restaurant,Italian Restaurant
3,M5E,Berczy Park,43.64516,-79.373675,0,Restaurant,American Restaurant,Diner,Spanish Restaurant,Fast Food Restaurant
4,M5G,Central Bay Street,43.656091,-79.38493,4,Restaurant,Chinese Restaurant,American Restaurant,Sandwich Place,Italian Restaurant
5,M6G,Christie,43.668781,-79.42071,0,Korean Restaurant,Restaurant,Middle Eastern Restaurant,Ethiopian Restaurant,Nightclub
6,M5H,"Adelaide,King,Richmond",43.6497,-79.382582,0,Restaurant,Italian Restaurant,Japanese Restaurant,Fast Food Restaurant,New American Restaurant
7,M5K,"Design Exchange,Toronto Dominion Centre",43.6471,-79.381531,0,Restaurant,Japanese Restaurant,Italian Restaurant,Asian Restaurant,Bar
8,M5L,"Commerce Court,Victoria Hotel",43.648395,-79.378865,0,Restaurant,Fast Food Restaurant,Italian Restaurant,Japanese Restaurant,American Restaurant
9,M5S,"Harbord,University of Toronto",43.66311,-79.401801,4,Thai Restaurant,Sandwich Place,Restaurant,Chinese Restaurant,Italian Restaurant


And now there are 15 rows left in the data.

### 3.6 Visualize map with clustering and marker

Import the required libraries

In [45]:
from matplotlib import cm
from matplotlib import colors

In [46]:
tor_cluster_map = folium.Map(location=[toronto_merged['Latitude'][0], toronto_merged['Longitude'][0]], zoom_start=14)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, long, post, neigh, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Postcode'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(post) + ' : ' + str(neigh) + ', Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius = 5,
        popup = label,
        color = rainbow[cluster-1],
        fill = True,
        fill_color = rainbow[cluster-1],
        fill_opacity = 0.7,
        reset = True).add_to(tor_cluster_map)
       
tor_cluster_map

In [47]:
toronto_merged_cluster_sorted = toronto_merged.sort_values(by = 'Cluster Labels').reset_index(drop = True)
toronto_merged_cluster_sorted

Unnamed: 0,Postcode,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M5C,St. James Town,43.65121,-79.375481,0,Restaurant,Diner,Japanese Restaurant,Asian Restaurant,Italian Restaurant
1,M5E,Berczy Park,43.64516,-79.373675,0,Restaurant,American Restaurant,Diner,Spanish Restaurant,Fast Food Restaurant
2,M6G,Christie,43.668781,-79.42071,0,Korean Restaurant,Restaurant,Middle Eastern Restaurant,Ethiopian Restaurant,Nightclub
3,M5H,"Adelaide,King,Richmond",43.6497,-79.382582,0,Restaurant,Italian Restaurant,Japanese Restaurant,Fast Food Restaurant,New American Restaurant
4,M5K,"Design Exchange,Toronto Dominion Centre",43.6471,-79.381531,0,Restaurant,Japanese Restaurant,Italian Restaurant,Asian Restaurant,Bar
5,M5L,"Commerce Court,Victoria Hotel",43.648395,-79.378865,0,Restaurant,Fast Food Restaurant,Italian Restaurant,Japanese Restaurant,American Restaurant
6,M5W,Stn A PO Boxes 25 The Esplanade,43.64869,-79.38544,0,Restaurant,Indian Restaurant,Japanese Restaurant,Italian Restaurant,Sandwich Place
7,M5X,"First Canadian Place,Underground city",43.64828,-79.381461,0,Restaurant,Japanese Restaurant,Italian Restaurant,New American Restaurant,Bar
8,M4X,"Cabbagetown,St. James Town",43.66816,-79.366602,1,Breakfast Spot,Indian Restaurant,Greek Restaurant,Sandwich Place,Restaurant
9,M5T,"Chinatown,Grange Park,Kensington Market",43.65353,-79.397233,2,Chinese Restaurant,Vietnamese Restaurant,Dim Sum Restaurant,Korean Restaurant,Asian Restaurant


This is the result of the code.


Thanks for reading my Jupyter Notebook.