# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

This assignment will be submitted as the following three parts in one Jupyter Notebook to a Github repository:
1. Web scraping the list of postal codes of Canada from **[Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)**
2. Appending latitude and longitude for each neighborhood
3. Exploring and clustering the neighborhoods in Toronto


The following content in this notebook will be set out with the structure which consists of markdown cells to explain the steps in their following codes.

## 1. Web scraping the list of postal codes of Canada from Wikipedia

In this assignment, `BeautifulSoup` package will be used to scrape the postcode table of Canada on the **[Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)** page.


Therefore, we are going to download `beautifulsoup4` for the latest version as well as `lxml` package which we are going to need for parsing the content of html file to be downloaded in one of the following steps.

Once the packages have been downloaded, load the required libraries.

In [1]:
# comment out to install required packages
!pip install beautifulsoup4
!pip install lxml



In [2]:
# import relevant libraries
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

Download the web page as **List_of_postal_codes_of_Canada.html** file to the current workspace. Parse the html content using `BeautifulSoup` library and `lxml` parser, and store it as a BeautifulSoup object named **soup**.

In [3]:
# download html file from wikipedia site
!wget -q -O 'List_of_postal_codes_of_Canada.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
with open('List_of_postal_codes_of_Canada.html') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')

Extract relevant elements of *Postal code*, *Borough*, and *Neighborhood* from **soup** object, and append into a pandas dataframe named **df**.


Set column headers as values in the first row and remove unwanted columns which contain no data. Remove first row which should be the headers, and rename *Postal code* as *Postalcode* in line with the assignment instruction.

In [4]:
# convert BeautifulSoup object into a dataframe
df = pd.DataFrame()
for tr in soup.find('table', class_='wikitable').tbody.find_all('tr'):
    row = [tr.text]
    row = pd.Series(row)
    row = row.str.split('\n',expand=True)
    df = df.append(row, ignore_index = True)

df.columns = df.iloc[0]
df.drop(columns = ['',np.nan], inplace = True)
df.drop(index = 0, axis = 0, inplace = True)
df.columns.values[0] = 'PostalCode'
df.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Regent Park / Harbourfront


Subset data entries of *Borough* which are not 'Not assigned'.


Combine multiple neighborhoods of the same *PostalCode* into one row and separate with a comma, and reset index. 

In [5]:
# remove Borough of Not assigned
df = df[df['Borough'] != 'Not assigned']
df.Neighborhood = df.Neighborhood.str.split(pat=' /')
df.reset_index(inplace = True, drop = True)
df.tail(5)

Unnamed: 0,PostalCode,Borough,Neighborhood
98,M8X,Etobicoke,"[The Kingsway, Montgomery Road , Old Mill No..."
99,M4Y,Downtown Toronto,[Church and Wellesley]
100,M7Y,East Toronto,[Business reply mail Processing CentrE]
101,M8Y,Etobicoke,"[Old Mill South, King's Mill Park, Sunnylea,..."
102,M8Z,Etobicoke,"[Mimico NW, The Queensway West, South of Blo..."


Join all elements in the list of each *Neighborhood* as a string.

In [6]:
# join the returned list into a string for each Neighborhood
for i in np.arange(df.shape[0]):
    df.Neighborhood[i] = ', '.join(df.Neighborhood[i])
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


If any *Neighborhood* does not have any value or is 'Not assigned', set *Neighborhood* to be *Borough*.

In [7]:
# For Not assigned Neighborhood of an assigned Borough, set Neighborhood to be Borough
for i in np.arange(df.shape[0]):
    if df.Neighborhood[i] is None or df.Neighborhood[i] == 'Not assigned':
        df.Neighborhood[i] = df.Borough[i]
    else:
        pass

The final **df** dataframe of the first part of the assignment consists of **103** rows and **3** columns.

In [8]:
df.shape

(103, 3)

## 2. Appending latitude and longitude for each neighborhood

In this part of assignment, `geopy` package is used instead of `geocoder` package, as the later has been tested but no coordinates were successfully extracted. 


After `geopy` package has been downloaded, load **Nominatim** function into the workspace.

In [9]:
!pip install geopy



In [10]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

A copy of **df** dataframe from the first part of process is to be taken as **df1** for this part of process to keep the original dataframe intact, in case of any necessity to roll back to the start of this part.

In [11]:
df1 = df.copy()

The way used in the previous lab **Segmenting and Clustering Neighborhoods in New York City** for converting addresses into coordinates is adopted here. However, the majority of addresses in this exercise still cannot have the corresponding coordinates converted successfully. Therefore, **Geospatial_data.csv** file is also used in conjunction with the geocoder of `geopy` package, so that different approaches can be practiced.


Read **Geospatial_data.csv** file into a dataframe named **coord**. Set *PostalCode* as index of **df1**, for the ease to reference rows in **coord** dataframe.


With the addition of two new columns for storing *Latitude* and *Longitude* in **df1**, convert addresses concatenated by *PostalCode* and ', Toronto, Ontario' to corresponding coordinattes with the geocoder of `geopy` package or information from **coord** dataframe.


Check if the columns of *Latitude* and *Longitude* have been fully populated.

In [12]:
url = 'http://cocl.us/Geospatial_data'
coord = pd.read_csv(url, header = 0, index_col = 0)
df1.set_index('PostalCode', drop = False, inplace = True)
df1.insert(df1.shape[1], 'Latitude','')
df1.insert(df1.shape[1], 'Longitude','')

geolocator = Nominatim(user_agent='ca_explorer')

for i in np.arange(df.shape[0]):
    add = f'{df1.PostalCode[i]}, Toronto, Ontario'
    location = geolocator.geocode(add, timeout = 3)
    
    if location is None:
        lat, lon = coord.Latitude[df1.PostalCode[i]], coord.Longitude[df1.PostalCode[i]]
    else:
        lat, lon = location.latitude, location.longitude
    df1.Latitude[i] = lat
    df1.Longitude[i] = lon

count = 0

for i in np.arange(df1.shape[0]):
    if df1.Longitude[i] == None or df1.Latitude[i] == None:
        count += 1

print(f'Number of blank lagitude and longitude: {count}')

Number of blank lagitude and longitude: 0


Revert index of **df1** back to what it was and show the result of first few rows.

In [13]:
df1.reset_index(drop = True, inplace = True)
df1.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.6535,-79.3839
1,M4A,North York,Victoria Village,43.7259,-79.3156
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6543,-79.3606
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7185,-79.4648
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6535,-79.3839
5,M9A,Etobicoke,Islington Avenue,43.6679,-79.5322
6,M1B,Scarborough,"Malvern, Rouge",43.6535,-79.3839
7,M3B,North York,Don Mills,43.7459,-79.3522
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.7064,-79.3099
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3789


## 3. Exploring and clustering the neighborhoods in Toronto

In this part of assignment, `requests` and `folium` packages will be used for sending data query requests through 3rd party API, **[Foursquare](Foursquare.com)** in our case, and visualising data on map. So let's download these two packages.

In [14]:
!pip install folium
!pip install requests



The libraries which will be used are imported by the following cell, for implementing tasks including:
- Data visualisation on map
- Sending requests via API
- Parsing content from JSON file
- Using Regular Expression to define string pattern
- Setting colors for data visualisation
- Building clustering model


Then, two lines of code at the end remove the limit of columns and rows to display for the tables which will be created later on.

In [15]:
import folium
import requests
import json
from pandas.io.json import json_normalize
import re
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

As the start of the second part of this assignment, a copy of the final dataframe output **df1** from part 2 is taken to prevent any unexpected behaviour of the programme caused to the previous work. Then, a quick summary about the new dataframe named **df2** is printed.

In [16]:
df2 = df1.copy()

In [17]:
print(f'Number of Neighborhoods: {len(df2.Neighborhood.unique())}')
print(f'Number of Borough: {len(df2.Borough.unique())}')
print(f'Shape of df2: {df2.shape}')
df2.head()

Number of Neighborhoods: 98
Number of Borough: 10
Shape of df2: (103, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.6535,-79.3839
1,M4A,North York,Victoria Village,43.7259,-79.3156
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6543,-79.3606
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7185,-79.4648
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6535,-79.3839


Firstly a map of Canada with **98 neighborhoods** of postal codes starting with **M** is plotted as below. These places are locating close to the coast and a few gather of them gather around Toronto.


Unsurprisingly, we will take a closer look at those neighborhoods in Toronto and find out anything of interest.


Please note that this map is plotted based on the centroid of the 98 coordinates in **df2**.

In [18]:
# Create a mpp of Canada with neighborhoods superimposed on top

neighborhoods = df2.Neighborhood
latitudes = df2.Latitude
longitudes = df2.Longitude

labels = neighborhoods
ctr_lat = latitudes.mean()
ctr_lon = longitudes.mean()

m = folium.Map(location = [ctr_lat, ctr_lon], zoom_start = 11)

for label, lat, lon in zip(labels, latitudes, longitudes):
    popup = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat,lon], 
                        popup = popup,
                        radius = 4,
                        color = 'darkblue',
                        fill = True,
                        fill_opacity=0.6,
                        parse_html=False).add_to(m)

print(f'This map is centered at {ctr_lat}, {ctr_lon} (the average coordinates of all neighborhoods).')
m

This map is centered at 43.70128073041047, -79.40187856307678 (the average coordinates of all neighborhoods).


Next, the same map is to be created but for neighborhoods in Toronto this time. However relevant data to be used needs to be prepared as a pandas dataframe first.

As it is assumed that boroughs with names containing the word 'Toronto' belong to Toronto area, rows in **df2** of boroughs meeting this criteria is extracted with the aid of regular expression and is stored in a new dataframe named **toronto_df**. Summary of the dataframe printed at the end shows that there are **39 neighborhoods** from **4 boroughs** pulled out in **toronto_df**.

In [19]:
# Same task as above but only for Toronto neighborhoods
toronto = []
borough_list = df2.Borough.unique()

for borough in borough_list:
    if re.search(r'Toronto\b',borough):
        toronto.append(borough)
    elif re.search(r'\bToronto',borough):
        toronto.append(borough)
    else:
        pass

toronto_df = pd.DataFrame(columns = df2.columns)

for i in np.arange(df2.shape[0]):
    if df2.Borough[i] in toronto:
        toronto_df = toronto_df.append(df2.iloc[[i]], ignore_index=True)
    else:
        pass

print(f'In Toronto there are {len(toronto_df.Borough.unique())} boroughs and {len(toronto_df.Neighborhood.unique())} neighborhoods.')
print(f'The shape of toronto_df: {toronto_df.shape}')

In Toronto there are 4 boroughs and 39 neighborhoods.
The shape of toronto_df: (39, 5)


Now all the inputs required for creating the map are in place. The same process of plotting the map is carried out as below for Toronto. The central point used to locate the map similarly is the average of all coordinates in **toronto_df**.

In [20]:
t_neighborhoods = toronto_df.Neighborhood
t_latitudes = toronto_df.Latitude
t_longitudes = toronto_df.Longitude

t_labels = t_neighborhoods
t_ctr_lat = t_latitudes.mean()
t_ctr_lon = t_longitudes.mean()

t_m = folium.Map(location = [t_ctr_lat, t_ctr_lon], zoom_start = 11)

for label, lat, lon in zip(t_labels, t_latitudes, t_longitudes):
    popup = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat,lon], 
                        popup = popup,
                        radius = 4,
                        color = 'darkblue',
                        fill = True,
                        fill_opacity=0.6,
                        parse_html=False).add_to(t_m)

print(f'This map is showing all neighborhoods in Toronto and is centered at {t_ctr_lat}, {t_ctr_lon}.')
t_m

This map is showing all neighborhoods in Toronto and is centered at 43.667380561772696, -79.38911187046659.


In [21]:
from IPython.display import HTML
from IPython.display import display

# Taken from https://stackoverflow.com/questions/31517194/how-to-hide-one-specific-cell-input-or-output-in-ipython-notebook
tag = HTML('''<script>
code_show=true; 
function code_toggle() {
    if (code_show){
        $('div.cell.code_cell.rendered.selected div.input').hide();
    } else {
        $('div.cell.code_cell.rendered.selected div.input').show();
    }
    code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
To show/hide this cell's raw code input, click <a href="javascript:code_toggle()">here</a>.''')
display(tag)

# Assign CLIENT_ID & CLIENT_SECRET for Foursquare API

CLIENT_ID = 'X3TWOWJVCNGVTMBAUPHKWH2J2HCMGUUXEWZNYUUI0QE03UY3'
CLIENT_SECRET = '0VHRRVQYUXHJWVGOY4E3J3A5KGXA4HL1FOAQLSI0XLOHJ0VX'

In the following step, a function will be defined for the ETL tasks of transferring relevant venue data of 39 neighborhoods in Toronto from **[Foursquare](Foursquare.com)** website to a pandas dataframe in a required format and strucutre, which will later on feed into the final clustering analysis.


To access data in **[Foursquare](Foursquare.com)** website, a developer account needs to be applied for and will come with **CLIENT_ID** and **CLIENT_SECRET**, which are required to gain access, once the account setup has been done. Then the user needs to create an app in **Foursquare Developer Console** before the API can be connected for requests to be sent. After an app has been created, the ETL tasks will then be able to take place. **[Foursquare developer page](https://developer.foursquare.com/docs/api-reference/venues/search/)** has pretty comprehensive documentation to guide the usage of its API with different programming language as well as the variety of data available to extract.


In this part of assignment, we are going to replicate the same analysis as in the previous lab for this week's exercise -- **Segmenting and Clustering Neighborhoods in New York City**. Therefore, we will only process data of **vanue names**, **vanue latitude** and **longitude**, and **venue categories** of Toronto neighborhoods.


The function below firstly evaluates whether the radius around the coordinates set for the request is greather than or equal to 1000, which will interrupt the function executing with an error thrown from the endpoint. That means the defined function will still complete the job by giving a warning about the excessive radius which has been set but the output dataframe will not be produced. Given a right radius, the parameters required for retrieving venue information are assigned and the endpoint returns a response file in JSON format which then is parsed and converted into a pandas dataframe through `json_normalise` function.


Then we use this function to gather venues via Foursquare API based on the neighborhoods within **toronto_df** and assign the output dataframe into **toronto_venue_df** object.

In [22]:
# Define a function for extracting venues (incl. names, latlon, category) from Foursquare API for all neighborhoods in Toronto 

def create_merged_venue_df (neighborhood, latitude, longitude, radius = 500, limit = 100):
    
    if radius >= 1000: 
    
        print('radius cannot be equal to or greater than 1000.')
    
    else:
    
        venue_columns = ['venue.name', 'venue.location.lat', 'venue.location.lng', 'venue.categories']
        merged_venues_df = pd.DataFrame(columns = ['Neighborhood','Latitude','Longitude'] + venue_columns)
        url = 'https://api.foursquare.com/v2/venues/explore'

        for neigh, lat, lon in zip(neighborhood, latitude, longitude):

            params = dict(
                client_id = CLIENT_ID,
                client_secret = CLIENT_SECRET,
                v = '20200404',
                radius = radius,
                limit = limit,
                ll = f'{lat},{lon}')

            response = requests.get(url,params).json()
            venues = response['response']['groups'][0]['items']
            venues_df = json_normalize(venues)
            venues_df = venues_df.loc[:, venue_columns]

            for i in np.arange(venues_df.shape[0]):

                venues_df['venue.categories'][i] = venues_df['venue.categories'][i][0]['name']

            for v in np.arange(venues_df.shape[0]):

                merged_venues_df = merged_venues_df.append({'Neighborhood': neigh,
                                                            'Latitude': lat,
                                                            'Longitude': lon,
                                                            'venue.name': venues_df['venue.name'][v],
                                                            'venue.location.lat': venues_df['venue.location.lat'][v],
                                                            'venue.location.lng': venues_df['venue.location.lng'][v],
                                                            'venue.categories': venues_df['venue.categories'][v]},
                                                           ignore_index = True)

        merged_venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'Venue_Name', 'Venue_Latitude', 'Venue_Longitude', 'Venue_Categories']
        return merged_venues_df

In [23]:
# Create a df with all neighborhood, latlon, venues, venue latlon, venue category

toronto_venue_df = create_merged_venue_df(neighborhood = toronto_df.Neighborhood, 
                                          latitude = toronto_df.Latitude, 
                                          longitude = toronto_df.Longitude, 
                                          radius = 999, 
                                          limit = 200)
print(f'{toronto_venue_df.shape[0]} venues have been extracted through Foursquare API.')
toronto_venue_df.head(5)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


3243 venues have been extracted through Foursquare API.


Unnamed: 0,Neighborhood,Latitude,Longitude,Venue_Name,Venue_Latitude,Venue_Longitude,Venue_Categories
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant
4,"Regent Park, Harbourfront",43.65426,-79.360636,The Distillery Historic District,43.650244,-79.359323,Historic Site


We group **toronto_venue_df** to shows the number of venues for each neighborhood in Toronto.

In [24]:
# Count venues by neighborhood
toronto_venue_df.groupby('Neighborhood').count()

Unnamed: 0_level_0,Latitude,Longitude,Venue_Name,Venue_Latitude,Venue_Longitude,Venue_Categories
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,100,100,100,100,100,100
"Brockton, Parkdale Village, Exhibition Place",60,60,60,60,60,60
Business reply mail Processing CentrE,48,48,48,48,48,48
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst",100,100,100,100,100,100
Central Bay Street,100,100,100,100,100,100
Christie,100,100,100,100,100,100
Church and Wellesley,100,100,100,100,100,100
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,100,100,100,100,100,100
Davisville North,100,100,100,100,100,100


Now, we are entering the stage of preparing data for segmenting these 39 neighborhoods based on patterns identified by **'KMeans'** clustering algorithm. 


As clustering is an unsupervised machine learning technique, we only need to feed the data into the modeling object, which will be created in a minute, to get the clusters. However, as the variable *Venue_Categories* which is used as the independent variable is categorical, and only numeric values can be used to run the model, transforming it into dummies for each categorical value within this variable needs to be done first.


However, there is a problem found that has caused the next few lines of code to behave in an unexpected manner. Within *Venue_Categories* column, 'Neighborhood' exists as one of the category values. When the dummies have been produced, this value is turned into a stand-alone column and become duplicate for the real *Neighborhood* column. To solve this problem, we change the venue category 'Neighborhood' to 'Neighborhood_' so that it stands as a unique column header. Certainly this is based on the assumption that 'Neighborhood' is a valid venue category instead of a typo.


As part of the troubleshooting process, several print statements are embedded in the code to ensure the output of each line is produced to expectation. The product is named as **cat_dummies** with value '1' denoting factual status of a particular venue category and '0' negative.

In [25]:
# Create a df displaying the mean of frequency of occurrence of each category for each neighborhood

# Replace one of venue category values from 'Neighborhood' to 'Neighborhood_' to prevent duplicate column headers
toronto_venue_df['Venue_Categories'].replace(to_replace = 'Neighborhood', value = 'Neighborhood_', inplace = True)
toronto_venue_df[toronto_venue_df['Venue_Categories']=='Neighborhood_']
print(f'Number of Categories: {len(toronto_venue_df.Venue_Categories.unique())}')

cat_dummies = pd.get_dummies(toronto_venue_df.Venue_Categories)
print(f'Shape of cat_dummies df before concatenation: {cat_dummies.shape}')
col_before = list(cat_dummies.columns) # Used to check in the next cell if the layout of columns have been produced correctly

cat_dummies['Neighborhood'] = toronto_venue_df.Neighborhood
idx = list(cat_dummies.columns.values).index('Neighborhood')
print(f'Index of "Neighborhood" column: {idx}')
print(f'Two columns next to both sides of "Neighborhood" column are: {list(cat_dummies.columns[idx-2:idx+3])}')

fixed_col = [cat_dummies.columns[-1]] + list(cat_dummies.columns[:-1])
cat_dummies = cat_dummies[fixed_col]
print(f'Shape of cat_dummies df after concatenation: {cat_dummies.shape}')

col_after = list(cat_dummies.columns) # Used to check in the next cell if the layout of columns have been produced correctly
cat_dummies.head(1)

Number of Categories: 277
Shape of cat_dummies df before concatenation: (3243, 277)
Index of "Neighborhood" column: 277
Two columns next to both sides of "Neighborhood" column are: ['Yoga Studio', 'Zoo', 'Neighborhood']
Shape of cat_dummies df after concatenation: (3243, 278)


Unnamed: 0,Neighborhood,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,Auto Workshop,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Stadium,Beach,Beach Bar,Beer Bar,Beer Store,Belgian Restaurant,Bike Shop,Bistro,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Bridal Shop,Bubble Tea Shop,Buffet,Burger Joint,Burrito Place,Bus Stop,Business Service,Butcher,Café,Cajun / Creole Restaurant,Camera Store,Candy Store,Cantonese Restaurant,Caribbean Restaurant,Castle,Cemetery,Cheese Shop,Chinese Restaurant,Chiropractor,Chocolate Shop,Church,Churrascaria,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Gym,College Lab,College Quad,College Rec Center,College Theater,Comedy Club,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Cosmetics Shop,Coworking Space,Creperie,Cuban Restaurant,Cupcake Shop,Curling Ice,Cycle Studio,Dance Studio,Deli / Bodega,Department Store,Design Studio,Dessert Shop,Diner,Discount Store,Distribution Center,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Elementary School,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Fruit & Vegetable Store,Furniture / Home Store,Gaming Cafe,Garden,Gas Station,Gastropub,Gay Bar,General Entertainment,General Travel,German Restaurant,Gift Shop,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Harbor / Marina,Hardware Store,Hawaiian Restaurant,Health & Beauty Service,Health Food Store,Historic Site,History Museum,Hobby Shop,Hospital,Hostel,Hot Dog Joint,Hotel,Hotel Bar,IT Services,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Indonesian Restaurant,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Jewish Restaurant,Juice Bar,Karaoke Bar,Kids Store,Korean Restaurant,Lake,Latin American Restaurant,Library,Lingerie Store,Liquor Store,Lounge,Martial Arts Dojo,Massage Studio,Mattress Store,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant,Monument / Landmark,Movie Theater,Moving Target,Museum,Music School,Music Store,Music Venue,Nail Salon,Neighborhood_,New American Restaurant,Nightclub,Noodle House,Office,Optical Shop,Organic Grocery,Other Great Outdoors,Other Repair Shop,Outdoors & Recreation,Pakistani Restaurant,Paper / Office Supplies Store,Park,Pastry Shop,Performing Arts Venue,Persian Restaurant,Peruvian Restaurant,Pet Store,Pharmacy,Pide Place,Pie Shop,Pilates Studio,Pizza Place,Playground,Plaza,Poke Place,Polish Restaurant,Pool,Portuguese Restaurant,Poutine Place,Pub,Ramen Restaurant,Record Shop,Recording Studio,Rental Car Location,Restaurant,Rock Climbing Spot,Rock Club,Roof Deck,Sake Bar,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,School,Seafood Restaurant,Shoe Store,Shopping Mall,Skate Park,Skating Rink,Smoke Shop,Snack Place,Soup Place,South American Restaurant,Souvlaki Shop,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Stadium,Stationery Store,Steakhouse,Street Art,Supermarket,Sushi Restaurant,Syrian Restaurant,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tapas Restaurant,Tattoo Parlor,Tea Room,Tech Startup,Tennis Court,Thai Restaurant,Theater,Theme Park Ride / Attraction,Theme Restaurant,Thrift / Vintage Store,Tibetan Restaurant,Toy / Game Store,Track,Trail,Train Station,Tree,Turkish Restaurant,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Yoga Studio,Zoo
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


The following code is not part of the intended process either, but an addition for checking whether the creation of **cat_dummies** was carried out correctly with focus on 'Neighborhood' column.

In [26]:
# Check if the layout of columns have been produced correctly from the last cell

for i in np.arange(len(col_before)):
    if col_before[i] == col_after[i+1]:
        pass
    else:
        print(f'There is issue with column {col_before[i]} in cat_dummies df before processing.')
        break
        
print('List of columns before and after getting dummies and merging "Neighborhood" column: ')
print()
print(f'Before: {len(col_before)} / First column: "{col_before[0]}" / Last column: "{col_before[-1]}"')
print(f'After: {len(col_after)} / First column: "{col_after[0]}" / Last column: "{col_after[-1]}"')

List of columns before and after getting dummies and merging "Neighborhood" column: 

Before: 277 / First column: "American Restaurant" / Last column: "Zoo"
After: 278 / First column: "Neighborhood" / Last column: "Zoo"


Then we group **cat_dummies** to calculate the average of each venue category column by *Neighborhood* to obtain relative frequencies for all the venues.

In [27]:
cat_grouped = cat_dummies.groupby('Neighborhood').mean().reset_index()
print(f'Shape of cat_grouped df: {cat_grouped.shape}')
cat_grouped.head()

Shape of cat_grouped df: (39, 278)


Unnamed: 0,Neighborhood,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,Auto Workshop,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Stadium,Beach,Beach Bar,Beer Bar,Beer Store,Belgian Restaurant,Bike Shop,Bistro,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Bridal Shop,Bubble Tea Shop,Buffet,Burger Joint,Burrito Place,Bus Stop,Business Service,Butcher,Café,Cajun / Creole Restaurant,Camera Store,Candy Store,Cantonese Restaurant,Caribbean Restaurant,Castle,Cemetery,Cheese Shop,Chinese Restaurant,Chiropractor,Chocolate Shop,Church,Churrascaria,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Gym,College Lab,College Quad,College Rec Center,College Theater,Comedy Club,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Cosmetics Shop,Coworking Space,Creperie,Cuban Restaurant,Cupcake Shop,Curling Ice,Cycle Studio,Dance Studio,Deli / Bodega,Department Store,Design Studio,Dessert Shop,Diner,Discount Store,Distribution Center,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Elementary School,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Fruit & Vegetable Store,Furniture / Home Store,Gaming Cafe,Garden,Gas Station,Gastropub,Gay Bar,General Entertainment,General Travel,German Restaurant,Gift Shop,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Harbor / Marina,Hardware Store,Hawaiian Restaurant,Health & Beauty Service,Health Food Store,Historic Site,History Museum,Hobby Shop,Hospital,Hostel,Hot Dog Joint,Hotel,Hotel Bar,IT Services,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Indonesian Restaurant,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Jewish Restaurant,Juice Bar,Karaoke Bar,Kids Store,Korean Restaurant,Lake,Latin American Restaurant,Library,Lingerie Store,Liquor Store,Lounge,Martial Arts Dojo,Massage Studio,Mattress Store,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant,Monument / Landmark,Movie Theater,Moving Target,Museum,Music School,Music Store,Music Venue,Nail Salon,Neighborhood_,New American Restaurant,Nightclub,Noodle House,Office,Optical Shop,Organic Grocery,Other Great Outdoors,Other Repair Shop,Outdoors & Recreation,Pakistani Restaurant,Paper / Office Supplies Store,Park,Pastry Shop,Performing Arts Venue,Persian Restaurant,Peruvian Restaurant,Pet Store,Pharmacy,Pide Place,Pie Shop,Pilates Studio,Pizza Place,Playground,Plaza,Poke Place,Polish Restaurant,Pool,Portuguese Restaurant,Poutine Place,Pub,Ramen Restaurant,Record Shop,Recording Studio,Rental Car Location,Restaurant,Rock Climbing Spot,Rock Club,Roof Deck,Sake Bar,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,School,Seafood Restaurant,Shoe Store,Shopping Mall,Skate Park,Skating Rink,Smoke Shop,Snack Place,Soup Place,South American Restaurant,Souvlaki Shop,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Stadium,Stationery Store,Steakhouse,Street Art,Supermarket,Sushi Restaurant,Syrian Restaurant,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tapas Restaurant,Tattoo Parlor,Tea Room,Tech Startup,Tennis Court,Thai Restaurant,Theater,Theme Park Ride / Attraction,Theme Restaurant,Thrift / Vintage Store,Tibetan Restaurant,Toy / Game Store,Track,Trail,Train Station,Tree,Turkish Restaurant,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Yoga Studio,Zoo
0,Berczy Park,0.01,0.0,0.0,0.0,0.01,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.02,0.0,0.01,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.01,0.0,0.01,0.0,0.0,0.0,0.02,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.016667,0.0,0.0,0.0,0.0,0.0,0.016667,0.0,0.016667,0.016667,0.033333,0.0,0.0,0.0,0.0,0.066667,0.0,0.016667,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016667,0.0,0.016667,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,0.0,0.0,0.016667,0.0,0.016667,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,0.033333,0.0,0.0,0.033333,0.016667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016667,0.0,0.016667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016667,0.033333,0.0,0.0,0.0,0.0,0.016667,0.0,0.016667,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016667,0.0
2,Business reply mail Processing CentrE,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.020833,0.0,0.020833,0.0,0.020833,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.020833,0.0625,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.020833,0.041667,0.0,0.020833,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.020833,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.104167,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, ...",0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.01,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.01,0.02,0.01,0.01,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.03,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0
4,Central Bay Street,0.02,0.0,0.0,0.0,0.0,0.0,0.03,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.02,0.0,0.0,0.02,0.0,0.01,0.02,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.03,0.04,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.01,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.02,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0


With the figures produced in **cat_grouped** dataframe, we are going to create another dataframe named **top_venues_df** to show top 10 most common venues for each Toronto neighborhood based on the values of relative frequency.


The code below initiates with creating a list of 10 strings which are used to indicate the top 10 venues. Then the row of each neighborhood within **cat_grouped** dataframe is extracted and the venue frequencies are sorted in descending order before the top 10 highest frequencies being inserted into **top_venues_df**.

In [28]:
# Create a df showing top 10 most common venue categories

def ord_string(n):
    list(str(n))[-1]
    if int(n)+1 == 1: return 'st'
    elif int(n)+1 == 2: return 'nd'
    elif int(n)+1 == 3: return 'rd'
    else: return 'th'

col_list=[]
for i in np.arange(10):
    col_list = col_list + [str(i+1) + ord_string(i) + ' Most Common Venue']

top_venues_df = pd.DataFrame()
for neigh in cat_grouped['Neighborhood']:
    temp = cat_grouped[cat_grouped['Neighborhood']==neigh].T.reset_index()
    temp = temp.iloc[1:]
    temp.columns = 'venue', 'freq'
    temp.sort_values(by='freq', ascending=False, inplace=True)
    temp.reset_index(drop=True, inplace=True)
    top_venues_df = top_venues_df.append(temp.iloc[0:10,0])

top_venues_df.columns = col_list
top_venues_df.reset_index(drop=True, inplace=True)
top_venues_df['Neighborhood'] = cat_grouped['Neighborhood']
col_list = ['Neighborhood'] + col_list
top_venues_df = top_venues_df[col_list]
top_venues_df.head(5)

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Café,Hotel,Restaurant,Japanese Restaurant,Art Gallery,Park,Deli / Bodega,Gastropub,Cocktail Bar
1,"Brockton, Parkdale Village, Exhibition Place",Café,Bakery,Breakfast Spot,Restaurant,Gift Shop,Tibetan Restaurant,Coffee Shop,Park,Athletics & Sports,Tea Room
2,Business reply mail Processing CentrE,Park,Pizza Place,Brewery,Sushi Restaurant,Coffee Shop,Italian Restaurant,Pet Store,Fast Food Restaurant,Harbor / Marina,Breakfast Spot
3,"CN Tower, King and Spadina, Railway Lands, ...",Coffee Shop,Italian Restaurant,Hotel,French Restaurant,Restaurant,Sushi Restaurant,Pizza Place,Yoga Studio,Café,Spa
4,Central Bay Street,Coffee Shop,Japanese Restaurant,Park,Café,Italian Restaurant,Art Gallery,Clothing Store,Cosmetics Shop,Theater,Bookstore


Here we use **'KMeans'** algorithm from *scikit-learn* library to model the segmentation. The number of clusters is set as 4 and neighborhood counts within each cluster are shown below.

In [29]:
# Run clustering analysis on neighborhoods based on the mean of frequency of occurrence of each category with kmeans
k = 4
kmeans = KMeans(n_clusters=k, init='k-means++').fit(cat_grouped.iloc[:,1:])
pd.Series(kmeans.labels_).value_counts()

1    17
3    12
0     9
2     1
dtype: int64

Now we know how many neighborhoods there are in each cluster and the top 10 venues. It is time for us to put the result on a map to visualise the pattern. So the prerequisite is to prepare the right data for the map to be created using *folium* library. We just need to append the columns of **Borough**, **Latitude** and **Longitude** of Neighborhood, as well as **Cluster** label to **top_venues_df** with the code below. The dataframe produced is named **toronto_merged_df**.

In [30]:
# Merge df to show borough, neighborhood, latlon, cluster label, top 10 most common venues
toronto_merged_df = toronto_df.copy()
toronto_merged_df['Cluster'] = kmeans.labels_
toronto_merged_df = toronto_merged_df.merge(right=top_venues_df, how='inner', on='Neighborhood')
toronto_merged_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6543,-79.3606,1,Coffee Shop,Park,Café,Pub,Theater,Diner,Restaurant,Breakfast Spot,Bakery,Italian Restaurant
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6535,-79.3839,3,Coffee Shop,Café,Japanese Restaurant,Restaurant,Clothing Store,American Restaurant,Gym,Gastropub,Hotel,Beer Bar
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3789,0,Coffee Shop,Clothing Store,Gastropub,Restaurant,Japanese Restaurant,Italian Restaurant,Theater,Bookstore,Tea Room,Café
3,M5C,Downtown Toronto,St. James Town,43.6515,-79.3754,1,Coffee Shop,Café,Restaurant,Bakery,Italian Restaurant,Clothing Store,Japanese Restaurant,Gastropub,American Restaurant,Theater
4,M4E,East Toronto,The Beaches,43.6764,-79.293,1,Pub,Coffee Shop,Pizza Place,Beach,Breakfast Spot,Japanese Restaurant,Caribbean Restaurant,Burger Joint,Health Food Store,Café


As accuracy is vital for data science, the following codes are in place to cross reference the result produced in **toronto_merged_df** against the total venue counts of each neighborhood in **cat_dummies** and make sure that any error is in check. (The value assigned to variable i is to select the row of **toronto_merged_df** to be examined and can be changed to any number between 0 and 38 as there are 39 neighborhoods in Toronto).

In [31]:
# The following code is for checking whether the result of toronto_merged_df has been genereated correctly
# Change i to any number between 0 and 38 to randomly pick a neighborhood for cross referencing the result against cat_dummies df

i = 10
n = toronto_merged_df.Neighborhood[i]
cat_sum_df = cat_dummies.groupby('Neighborhood').sum().reset_index(drop=False)
print(f'Neighborhood on toronto_merged_df: {n}')
print(toronto_merged_df.iloc[i,6:].T)
print()

temp = cat_sum_df[cat_sum_df.Neighborhood==n].T.reset_index()
print(f'Neighborhood on cat_dummies: {temp.iloc[0,1]}')
temp = temp[1:]
temp.columns = 'venue', 'count'
temp.sort_values('count', ascending = False, inplace=True)
temp.reset_index()
temp[:10]

Neighborhood on toronto_merged_df: Harbourfront East,  Union Station,  Toronto Islands
1st Most Common Venue             Coffee Shop
2nd Most Common Venue                   Hotel
3rd Most Common Venue                    Café
4th Most Common Venue              Restaurant
5th Most Common Venue          Scenic Lookout
6th Most Common Venue     Japanese Restaurant
7th Most Common Venue                 Brewery
8th Most Common Venue                 Theater
9th Most Common Venue                    Park
10th Most Common Venue           Concert Hall
Name: 10, dtype: object

Neighborhood on cat_dummies: Harbourfront East,  Union Station,  Toronto Islands


Unnamed: 0,venue,count
59,Coffee Shop,11
140,Hotel,7
42,Café,4
215,Restaurant,4
223,Scenic Lookout,3
148,Japanese Restaurant,3
33,Brewery,3
257,Theater,3
192,Park,3
68,Concert Hall,3


The same map as that for **toronto_df** is to be produced by the code below, but uses **toronto_merged_df** with the clusters assigned to the 39 neighborhoods and differentiated by dissimilar colours. The central point used to locate the map is the average of all coordinates in **toronto_merged_df**.

In [32]:
# Visualise clusters of neighborhoods on map

tm_neighborhoods = toronto_merged_df.Neighborhood
tm_latitudes = toronto_merged_df.Latitude
tm_longitudes = toronto_merged_df.Longitude
tm_cluster = toronto_merged_df.Cluster

tm_labels = tm_neighborhoods
tm_ctr_lat = tm_latitudes.mean()
tm_ctr_lon = tm_longitudes.mean()

rainbow = cm.rainbow(np.linspace(0,1,len(toronto_merged_df.Cluster.unique())))

tm_m = folium.Map(location = [tm_ctr_lat, tm_ctr_lon], zoom_start = 12)

for label, lat, lon, cluster in zip(tm_labels, tm_latitudes, tm_longitudes, tm_cluster):
    popup = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat,lon], 
                        popup = popup,
                        radius = 5,
                        color = colors.to_hex(rainbow[cluster]),
                        fill = True,
                        fill_opacity=0.6,
                        parse_html=False).add_to(tm_m)

print(f'This map is showing all neighborhoods in Toronto and is centered at {tm_ctr_lat}, {tm_ctr_lon}.')
tm_m

This map is showing all neighborhoods in Toronto and is centered at 43.667380561772696, -79.38911187046659.


Finally we would like to have an overview of the top 10 venues for each cluster, and the function below defines such tasks to allow repeating the same process across all clusters. This function is nothin more than filtering out associated rows and relevant columns from **toronto_merged_df** based on the chosen cluster. As we knew that the number of clusters previously defined was 4, we will run this function four times to show the entire result for all the four clusters.

In [33]:
# Define cluster_df function to show top 10 venues of chosen cluster
def cluster_df (cluster):
    row = toronto_merged_df['Cluster']==cluster
    col = ['Neighborhood'] + list(toronto_merged_df.columns[5:])
    k = len(toronto_merged_df.Cluster.unique())
    print(f'Total Number of Clusters: {k}')
    print(f'There are {toronto_merged_df.loc[row,col].shape[0]} neighborhoods in cluster {cluster}.')
    return toronto_merged_df.loc[row,col]

In [34]:
# Change the parameter in the following function to show the specified cluster.
cluster_df(0)

Total Number of Clusters: 4
There are 9 neighborhoods in cluster 0.


Unnamed: 0,Neighborhood,Cluster,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,"Garden District, Ryerson",0,Coffee Shop,Clothing Store,Gastropub,Restaurant,Japanese Restaurant,Italian Restaurant,Theater,Bookstore,Tea Room,Café
6,Central Bay Street,0,Coffee Shop,Japanese Restaurant,Park,Café,Italian Restaurant,Art Gallery,Clothing Store,Cosmetics Shop,Theater,Bookstore
12,"The Danforth West, Riverdale",0,Greek Restaurant,Coffee Shop,Café,Pub,Pizza Place,Italian Restaurant,Fast Food Restaurant,Bank,Bakery,Restaurant
16,"Commerce Court, Victoria Hotel",0,Coffee Shop,Hotel,Café,Restaurant,Japanese Restaurant,Gastropub,Bakery,Beer Bar,Concert Hall,Seafood Restaurant
20,Davisville North,0,Coffee Shop,Italian Restaurant,Pizza Place,Café,Dessert Shop,Pharmacy,Sushi Restaurant,Restaurant,Gym,Fast Food Restaurant
26,Davisville,0,Coffee Shop,Italian Restaurant,Sushi Restaurant,Dessert Shop,Middle Eastern Restaurant,Pizza Place,Café,Pub,Indian Restaurant,Gym
30,"Kensington Market, Chinatown, Grange Park",0,Café,Bar,Vegetarian / Vegan Restaurant,Coffee Shop,Art Gallery,Vietnamese Restaurant,Mexican Restaurant,Chinese Restaurant,Dumpling Restaurant,Burger Joint
33,Rosedale,0,Coffee Shop,Park,Grocery Store,Pie Shop,Metro Station,Breakfast Spot,Sandwich Place,Trail,Bank,Filipino Restaurant
35,"St. James Town, Cabbagetown",0,Park,Coffee Shop,Diner,Japanese Restaurant,Gay Bar,Thai Restaurant,Pub,Restaurant,Gastropub,Men's Store


In [35]:
cluster_df(1)

Total Number of Clusters: 4
There are 17 neighborhoods in cluster 1.


Unnamed: 0,Neighborhood,Cluster,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Regent Park, Harbourfront",1,Coffee Shop,Park,Café,Pub,Theater,Diner,Restaurant,Breakfast Spot,Bakery,Italian Restaurant
3,St. James Town,1,Coffee Shop,Café,Restaurant,Bakery,Italian Restaurant,Clothing Store,Japanese Restaurant,Gastropub,American Restaurant,Theater
4,The Beaches,1,Pub,Coffee Shop,Pizza Place,Beach,Breakfast Spot,Japanese Restaurant,Caribbean Restaurant,Burger Joint,Health Food Store,Café
7,Christie,1,Korean Restaurant,Café,Coffee Shop,Grocery Store,Cocktail Bar,Mexican Restaurant,Ice Cream Shop,Ethiopian Restaurant,Japanese Restaurant,Indian Restaurant
8,"Richmond, Adelaide, King",1,Coffee Shop,Restaurant,Café,Hotel,Theater,Japanese Restaurant,Gym,Vegetarian / Vegan Restaurant,Clothing Store,Gastropub
9,"Dufferin, Dovercourt Village",1,Café,Coffee Shop,Park,Sushi Restaurant,Bar,Bakery,Pharmacy,Gourmet Shop,Grocery Store,Brewery
11,"Little Portugal, Trinity",1,Café,Bar,Bakery,Pizza Place,Cocktail Bar,Dessert Shop,Mexican Restaurant,Vegetarian / Vegan Restaurant,American Restaurant,Furniture / Home Store
13,"Toronto Dominion Centre, Design Exchange",1,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Concert Hall,Bar,Italian Restaurant,Gastropub,Park
14,"Brockton, Parkdale Village, Exhibition Place",1,Café,Bakery,Breakfast Spot,Restaurant,Gift Shop,Tibetan Restaurant,Coffee Shop,Park,Athletics & Sports,Tea Room
21,Forest Hill North & West,1,Park,Coffee Shop,Bank,Café,Japanese Restaurant,Italian Restaurant,Pharmacy,Liquor Store,Trail,Burger Joint


In [36]:
cluster_df(2)

Total Number of Clusters: 4
There are 1 neighborhoods in cluster 2.


Unnamed: 0,Neighborhood,Cluster,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,Lawrence Park,2,Bookstore,College Quad,Café,Coffee Shop,Gym / Fitness Center,College Gym,Trail,Park,Pakistani Restaurant,Outdoors & Recreation


In [37]:
cluster_df(3)

Total Number of Clusters: 4
There are 12 neighborhoods in cluster 3.


Unnamed: 0,Neighborhood,Cluster,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,"Queen's Park, Ontario Provincial Government",3,Coffee Shop,Café,Japanese Restaurant,Restaurant,Clothing Store,American Restaurant,Gym,Gastropub,Hotel,Beer Bar
5,Berczy Park,3,Coffee Shop,Café,Hotel,Restaurant,Japanese Restaurant,Art Gallery,Park,Deli / Bodega,Gastropub,Cocktail Bar
10,"Harbourfront East, Union Station, Toronto Is...",3,Coffee Shop,Hotel,Café,Restaurant,Scenic Lookout,Japanese Restaurant,Brewery,Theater,Park,Concert Hall
15,"India Bazaar, The Beaches West",3,Coffee Shop,Pub,Beach,Park,Japanese Restaurant,Pizza Place,BBQ Joint,Bakery,Tea Room,Bar
17,Studio District,3,Coffee Shop,Bar,Café,American Restaurant,Bakery,Brewery,Diner,Vietnamese Restaurant,French Restaurant,Italian Restaurant
19,Roselawn,3,Sushi Restaurant,Café,Pharmacy,Coffee Shop,Italian Restaurant,Gym,Bank,Gastropub,Skating Rink,Gym Pool
22,"High Park, The Junction South",3,Café,Bar,Coffee Shop,Thai Restaurant,Italian Restaurant,Convenience Store,Grocery Store,Sushi Restaurant,Park,Ice Cream Shop
28,"Runnymede, Swansea",3,Café,Coffee Shop,Pub,Pizza Place,Park,Bakery,Sushi Restaurant,Diner,Restaurant,Scenic Lookout
32,"CN Tower, King and Spadina, Railway Lands, ...",3,Coffee Shop,Italian Restaurant,Hotel,French Restaurant,Restaurant,Sushi Restaurant,Pizza Place,Yoga Studio,Café,Spa
34,Stn A PO Boxes,3,Coffee Shop,Café,Japanese Restaurant,Restaurant,Beer Bar,Bakery,Gastropub,Hotel,American Restaurant,Creperie
