# Exploring Toronto Neighborhood

## 1.Introduction

Toronto is the provincial capital of the state of Ontario and the most populous city in Canada with a population of over 2.5 M as of 2016. It is also known as the financial capital of Canada. Toronto is known as one of the most multicultural and multiracial cities in the world. More than half of the residents of the city belong to a visible minority group, and there are many ethnic neighborhoods in the city, including Chinatowns, Little Italy, Little India, Little Portugal, Little Jamaica, and many more. 

In a diverse city like Toronto, each of the different minority groups have brought their traditional cuisine from their own countries and keep it as an expression of cultural identity. Recently I read an article on how different neighborhoods of the NYC can be clustered based on the popular resteaurants in those neighborhoods. This gave me the idea of conductin similar study for the city of Toronto. In this project I am going to look at the popular resturants in different neighborhoods of Toronto and invesitgate the relatin between food diversity and location. Moreover, I am going to investigate what the popular activities in each area are, and what the potential investment opourtunity in each are.

Stakeholders would be the investors who are intrested in opening a new business to use this analysis to evaluate the risks and oppurtunities. Those who are intrested to know more about the city diversity culture may also benefit from this study.

## 2.Data

### 2.1. Neighborhoods and Boroughs of Toronto
In this project we are going to use the *List of postal codes of Canada* ([Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)) to find all the neighborhoods of Toronto and their respective boroughs. List of all the postal codes and their respective neighborhoods and boroughs are tabulated in that wikipedia page as follow (only 5 first rows are shown here):

|Postal Code|Borough|Neighborhood| 
|-----|-----|-----|
|M1A | Not assigned| Not assigned|
|M2A | Not assigned| Not assigned|
|M3A| North York |  Parkwoods|
|M4A | North York| Victoria Village 
|M5A |	Downtown Toronto |Regent Park, Harbourfront |

### 2.2. Beautiful Soup

Beautiful is a Python library to extract data out of HTML, XML, or other markup languages. When you find some data on a website that you are interested in but there is no direct link to download those data, Beautiful Soup, as a web scraping tool, can be used to pull the data and clean them. In this project, since there was no way to download the above table, I used Beautiful Soup to extract the table from the Wikipedia page, clean it, and convert it to a data frame that can be easily manipulated.

### 2.3. Forsquare API
As a social location service, Foursquare provides location-based experiences with diverse information about the venues. Foursquare users can share their opinions on the application and comment on the quality of the service they have received. The Foursquare API allows the developer to interact with the Foursquare platform and obtain the information they required. In this project, I used the Foursquare API to get a list of venues created by Foursquare users in each neighborhood.

### 2.4. OpenCage Geocoder
To get venues of a neighborhood from the Foursquare API, we need the coordinates of the location. Since we only obtain the name of the neighbors from the Wikipedia page, we need to get the coordinates of the location. The OpenCage Geocoder is a geocoding library written in Python that can convert an address to its latitude-longitude coordinates.

## 3. Methodology
Install and download all the required packages.

In [1]:
#!pip install beautifulsoup4
#!pip install lxml
#!pip install html5lib
#!pip install requests
#!pip install opencage
#!pip install folium

from bs4 import BeautifulSoup
import requests
import csv

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import numpy as np 

from opencage.geocoder import OpenCageGeocode
from pprint import pprint

import folium # map rendering library

import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

from sklearn.cluster import KMeans # import k-means from clustering stage
from sklearn.metrics import silhouette_score

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm # Matplotlib and associated plotting modules
import matplotlib.colors as colors # Matplotlib and associated plotting modules

from collections import Counter # count occurrences 

import numpy as np # library to handle data in a vectorized manner
import pickle    # to save dataframe on the disk and load it for the second time run

### 3.1. Create the Toronto Data Set
In this section, we are going to create a data set for the city of Toronto.
#### 3.1.1. Neighborhoods and Boroughs of Toronto

In [2]:
source = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text #get the wikipedia page source
soup = BeautifulSoup(source, 'lxml') #sparce the source 
table = soup.find('table') #find the table in the source
pcodes = table.tbody.find_all('tr') 

In [3]:
# save the table in a csv file
csv_file = open('table_file.csv' , 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['PostalCode' , 'Borough' , 'Neighborhood']) #write the first row of the file as titles

33

In [4]:
for item in pcodes:
    subitem_list = []
    for subitem in item.find_all('td'):
        subitem_list.append( subitem.text.split( "\n")[0] )
    print(subitem_list)
    csv_writer.writerow(subitem_list)
csv_file.close()

[]
['M1A', 'Not assigned', '']
['M2A', 'Not assigned', '']
['M3A', 'North York', 'Parkwoods']
['M4A', 'North York', 'Victoria Village']
['M5A', 'Downtown Toronto', 'Regent Park, Harbourfront']
['M6A', 'North York', 'Lawrence Manor, Lawrence Heights']
['M7A', 'Downtown Toronto', "Queen's Park, Ontario Provincial Government"]
['M8A', 'Not assigned', '']
['M9A', 'Etobicoke', 'Islington Avenue']
['M1B', 'Scarborough', 'Malvern, Rouge']
['M2B', 'Not assigned', '']
['M3B', 'North York', 'Don Mills']
['M4B', 'East York', 'Parkview Hill, Woodbine Gardens']
['M5B', 'Downtown Toronto', 'Garden District, Ryerson']
['M6B', 'North York', 'Glencairn']
['M7B', 'Not assigned', '']
['M8B', 'Not assigned', '']
['M9B', 'Etobicoke', 'West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale']
['M1C', 'Scarborough', 'Rouge Hill, Port Union, Highland Creek']
['M2C', 'Not assigned', '']
['M3C', 'North York', 'Don Mills']
['M4C', 'East York', 'Woodbine Heights']
['M5C', 'Downtown Toronto', 'St. J

In [5]:
#convert the csv file into a dataframe
CanPostCode = pd.read_csv('table_file.csv')
print(CanPostCode.head())

  PostalCode           Borough               Neighborhood
0        M1A      Not assigned                        NaN
1        M2A      Not assigned                        NaN
2        M3A        North York                  Parkwoods
3        M4A        North York           Victoria Village
4        M5A  Downtown Toronto  Regent Park, Harbourfront


Now we need to clean the data a little bit. First we should remove all the rows for which neighborhood or borough is not defined:

In [6]:
CanPostCode.drop( CanPostCode[ CanPostCode['Borough'] == 'Not assigned'].index , inplace = True)
CanPostCode.drop( CanPostCode[ CanPostCode['Neighborhood'] == 'Not assigned'].index , inplace = True)
CanPostCode.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Note that some neighborhoods have more than one postal code. To get full list of neighborhoods we need to split neighborhood cells into multiple rows as follows:

In [7]:
CanPostCode.set_index(['PostalCode' , 'Borough'] , inplace = True) #consider PostalCode and Borough as index
CanPostCode = CanPostCode.Neighborhood.str.split(',', expand=True) #split the neighborhood cell to multiple rows
CanPostCode = CanPostCode.stack().reset_index(-1, drop=True).reset_index().rename( columns = {0:"Neighborhood"}) # remove NaN values and stack them over each other.
CanPostCode.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park
3,M5A,Downtown Toronto,Harbourfront
4,M6A,North York,Lawrence Manor


Also, some postal codes covers parts of the different neighborhoods and some neighborhood can have more than one postal code. To have the list of unique neighborhoods, duplicate values are removed:


In [8]:
CanPostCode = CanPostCode.drop_duplicates(subset='Neighborhood', keep="first")
Neigh_Bor_dic = dict(zip(CanPostCode.Neighborhood,CanPostCode.Borough)) #make a dictionary whose keys are neighborhood and values are boroughs

Now let's see how many neighborhoods are in each borough.


In [9]:
CanPostCode.groupby('Borough').count()

Unnamed: 0_level_0,PostalCode,Neighborhood
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1
Central Toronto,16,16
Downtown Toronto,38,38
East Toronto,7,7
East York,6,6
Etobicoke,45,45
Mississauga,1,1
North York,30,30
Scarborough,38,38
West Toronto,12,12
York,8,8


Note that among all the postal codes listed for Toronto area, there is only one postal code (Amazon warehouse) that belongs to Borough of Mississauga. Therefore we remove borough of Mississauga from the dataframe:

In [10]:
CanPostCode.drop( CanPostCode[ CanPostCode['Borough'] == 'Mississauga'].index , inplace = True)
CanPostCode.reset_index(inplace = True)  
CanPostCode.drop(columns = "index" , inplace = True)
CanPostCode.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park
3,M5A,Downtown Toronto,Harbourfront
4,M6A,North York,Lawrence Manor


In [11]:
CanPostCode.groupby('Borough').count()

Unnamed: 0_level_0,PostalCode,Neighborhood
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1
Central Toronto,16,16
Downtown Toronto,38,38
East Toronto,7,7
East York,6,6
Etobicoke,45,45
North York,30,30
Scarborough,38,38
West Toronto,12,12
York,8,8


In [12]:
print('Toronto area has {} neighborhoods and {} boroughs'.format( CanPostCode.shape[0] ,len(CanPostCode['Borough'].unique()) ))

Toronto area has 200 neighborhoods and 9 boroughs


### 3.1.2. Obtain the latitudes and longitudes of the neighborhoods
In this section we use geocoder to convert the address of the neighborhoods to the geographical coordinates.

In [13]:
key = 'c43c7ac8a6454d75a9f3cd5b9fb454e9'
geocoder = OpenCageGeocode(key)

In [14]:
Address = []
for neigh , bor in zip(CanPostCode['Neighborhood'].values.tolist() , CanPostCode['Borough'].values.tolist()):
    Address.append(neigh + ", " + bor + ", Toronto, Canada")

In [15]:
CanPostCode['Address'] = Address
CanPostCode.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Address
0,M3A,North York,Parkwoods,"Parkwoods, North York, Toronto, Canada"
1,M4A,North York,Victoria Village,"Victoria Village, North York, Toronto, Canada"
2,M5A,Downtown Toronto,Regent Park,"Regent Park, Downtown Toronto, Toronto, Canada"
3,M5A,Downtown Toronto,Harbourfront,"Harbourfront, Downtown Toronto, Toronto, Canada"
4,M6A,North York,Lawrence Manor,"Lawrence Manor, North York, Toronto, Canada"


In [16]:
CanPostCode['lat'] = CanPostCode['Address'].apply(lambda item: geocoder.geocode(str(item))[0]['geometry']['lat']) #latitude column

In [None]:
CanPostCode['long'] = CanPostCode['Address'].apply(lambda item: geocoder.geocode(str(item))[0]['geometry']['lng']) #longitude column

In [None]:
##For final report uncomment this cell
##I will comment out this whole section since I did it once, and save it by pickle on file (read explaination below and see next two cells)
CanPostCode['lat'] = CanPostCode['Address'].apply(lambda item: geocoder.geocode(str(item))[0]['geometry']['lat']) #latitude column
CanPostCode['long'] = CanPostCode['Address'].apply(lambda item: geocoder.geocode(str(item))[0]['geometry']['lng']) #longitude column
CanPostCode.head()

The process of converting addresses to their lat, long, coordinates could be very time consuming. Since I have not done the project in one sit, at this point, I save the above dataframe on the disk by **pickle**. Next time that I want to work on the rest of the code, I do not need to spend almost four minutes on creating the above dataframe.

In [None]:
##for final report uncomment this cell
## I comment this whole section to prevent unwanted overwrite on my file
with open('CanPostCode_file.pkl', 'wb') as f:
        pickle.dump(CanPostCode, f)

In [1]:
with open('CanPostCode_file.pkl', 'rb') as f:
        CanPostCode = pickle.load(f)

NameError: name 'pickle' is not defined

In [None]:
CanPostCode.head()

The map of Toronto is created by folium which is a great visulization library. You can zoom in/out and also click on each circle to see the information of the respective neighborhood.

In [None]:
# create map of Toronto using latitude and longitude values
n = 1
latitude = 43.6534817
longitude = -79.3839347
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(CanPostCode['lat'], CanPostCode['long'], CanPostCode['Borough'], CanPostCode['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### 3.2. Obtain top venues in each neighborhood

After finding all neighborhoods of Toronto, in this section, we find venues listed for each neighborhood by Foursquare API users.


#### 3.2.1. setting up Forsquare API


In [None]:
CLIENT_ID = 'YUNBULII4XKK5MFG0AQLAL1MSRVLUOOMVC3NJRD2TGZE1RIQ' # your Foursquare ID
CLIENT_SECRET = 'RBJIE5V0BE4DZZKHELF0DMMG1CQSCLD1YR4NA5VO1RSSL4SR' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 1000 # define radius in meter

The following function accept coordinates of a location as an input and obtain the first top 100 venues in radius of 500 meter from Forsquare API.


In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT = 100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### 3.2.2 Example
To understand how Foursquare API works, let's get all the venues for Parkwoods neighborhood as the first row of the dataframe.

In [None]:
CanPostCode.head(1)

By using pickle library we can prevent any redundant requests to the Foursquare API.

In [None]:
# get all venues in Parkwoods, North York, Toronto, Canada
name_sample = "Parkwoods"
venues_list_sample=[]
lat_sample = 43.761124
lng_sample = -79.324059              
# create the API request URL
url_sample = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat_sample, 
            lng_sample, 
            radius, 
            LIMIT)
            
# make the GET request
results_sample = requests.get(url_sample).json()["response"]['groups'][0]['items']
        
# return only relevant information for each nearby venue
venues_list_sample.append([(
                        name_sample, 
                        lat_sample, 
                        lng_sample, 
                        v['venue']['name'], 
                        v['venue']['location']['lat'], 
                        v['venue']['location']['lng'],  
                        v['venue']['categories'][0]['name']) for v in results_sample])

nearby_venues_sample = pd.DataFrame([item for item in venues_list_sample[0]])
nearby_venues_sample.columns = ['Neighborhood', 
                                'Neighborhood Latitude', 
                                'Neighborhood Longitude', 
                                'Venue', 
                                'Venue Latitude', 
                                'Venue Longitude', 
                                'Venue Category']
nearby_venues_sample['Borough'] = nearby_venues_sample['Neighborhood'].apply(lambda item: Neigh_Bor_dic[item] )
nearby_venues_sample = nearby_venues_sample[ ['Borough'] + [ col for col in nearby_venues_sample.columns if col != 'Borough' ] ]

nearby_venues_sample.head(10)

#### 3.2.3. All venues
After seeing one example on how to get all venues in one neighborhood, we are going to get top 100 venues of all neighborhoods.

In [None]:
try:
    with open('Toronto_venues.pkl', 'rb') as f:
        toronto_venues = pickle.load(f)
    print("---Dataframe Existed and Deserialized---")
except:
    toronto_venues = getNearbyVenues(names=CanPostCode['Neighborhood'],
                                        latitudes=CanPostCode['lat'],
                                        longitudes=CanPostCode['long']
                                       )
    with open('Toronto_venues.pkl', 'wb') as f:
        pickle.dump(toronto_venues, f)
    print("---Dataframe Created and Serialized---")

In [None]:
toronto_venues['Borough'] = toronto_venues['Neighborhood'].apply(lambda item: Neigh_Bor_dic[item] )
toronto_venues = toronto_venues[ ['Borough'] + [ col for col in toronto_venues.columns if col != 'Borough' ] ]

In [None]:
print(f'size of the resulting dataframe is {toronto_venues.shape}')
toronto_venues.head()

## 4. Data Analysis and Results
In this section, we analyze the data we set up until now. First, we analyze venues in each neighborhood. Then, we will cluster neighborhoods based on their borough, and then compare clusters against each other.
### 4.1. Analyze data based on Neighborhoods

Now that we obtained top venues of neighborhoods, let's check how many unique venues exist in the area.

In [None]:
unique_venue = len(toronto_venues['Venue Category'].unique())
print(f'There are {unique_venue} unique venue category')

toronto_venues.groupby('Venue Category')['Venue Category'].count().sort_values(ascending=False)

In [None]:
top10_venues = toronto_venues.groupby('Venue Category')['Venue Category'].count().sort_values(ascending=False).head(10).keys().values.tolist()
top10_venues

In [None]:
dict_10ven_neigh = {}
for ven in top10_venues:
    dict_10ven_neigh[ven] = {}
    for neigh in CanPostCode['Neighborhood']:
        try:
            dict_10ven_neigh[ven][neigh] = toronto_venues.groupby(['Neighborhood','Venue Category']).count().loc[neigh].loc[ven , 'Venue']
        except:    
            dict_10ven_neigh[ven][neigh] = 0
            
df_10ven_neigh = pd.DataFrame(dict_10ven_neigh)
df_10ven_neigh.reset_index(inplace = True) 
df_10ven_neigh.rename(columns = {"index" : "Neighborhood"} , inplace = True)

In [None]:
df_10ven_neigh['address']  = df_10ven_neigh["Neighborhood"] + " , " + df_10ven_neigh["Neighborhood"].apply(lambda item: Neigh_Bor_dic[item])
df_10ven_neigh.head()

As an example, let's see the 5 neighborhoods with the highest number of "Coffee Shop".

In [None]:
df_10ven_neigh.sort_values('Coffee Shop' , ascending=False).head(10)

We loop over the *top 10 venues* to find the top 10 neighborhoods for each venue and plot them on a bar graph. 

In [None]:
fig, axes = plt.subplots(10,1)
fig.set_figheight(30)
fig.set_figwidth(30)
n = 0
for ven in top10_venues:
    df_10ven_neigh.sort_values(ven , ascending=False).head(10).plot(kind = 'barh' , x ='address' , y = ven , rot = 0 , ax = axes[n] , figsize = (5,30) )
    n += 1
plt.show()

Same way, we can find neighborhoods with the lowest number of stores in the top 10 categories. This way, we can find neighborhoods with higher potential for successful investment.

In [None]:
test = {}
for ven in top10_venues:
    test[ven] = df_10ven_neigh.sort_values(ven , ascending=False).tail(20)['Neighborhood'].values.tolist()
pd.DataFrame(test)

### 4.2 Analyze data based on Borough
In this section, we group venues of the neighborhoods based on the borough they belong to. Then we plot the number of top venues in each borough. 

In [None]:
Boroughs = CanPostCode["Borough"].unique().tolist()
Boroughs

In [None]:
dic_test = {}
for brgh in Boroughs:
    print(brgh)
    brgh_df = CanPostCode[CanPostCode["Borough"] == brgh].reset_index(drop=True)
    brgh_venues = getNearbyVenues(names=brgh_df['Neighborhood'],
                                   latitudes=brgh_df['lat'],
                                   longitudes=brgh_df['long']
                                  )
    dic_test[brgh] = brgh_venues.groupby('Venue Category').count().sort_values("Neighborhood" , ascending=False)

In [None]:
with open('Borough_venue.pkl', 'wb') as b_v:
        pickle.dump(dic_test, b_v)        

In [None]:
with open('Borough_venue.pkl', 'rb') as b_v:
        dic_test = pickle.load(b_v)

In [None]:
dict_10ven_bor = {}
for ven in top10_venues:
    dict_10ven_bor[ven] = {}
    for bor in Boroughs:
        try:
            dict_10ven_bor[ven][bor] = toronto_venues.groupby(['Borough','Venue Category']).count().loc[bor].loc[ven , 'Venue']
        except:    
            dict_10ven_bor[ven][bor] = 0
            
df_10ven_bor = pd.DataFrame(dict_10ven_bor)
df_10ven_bor.reset_index(inplace = True) 
df_10ven_bor.rename(columns = {"index" : "Borough"} , inplace = True)

In [None]:
df_10ven_bor

In [None]:
toronto_venues.head()

In [None]:
fig3, axes3 = plt.subplots(10,1)
fig3.set_figheight(30)
fig3.set_figwidth(30)
n = 0
for ven in top10_venues:
    df_10ven_bor.sort_values(ven , ascending=False).head(9).plot(kind = 'barh' , x ='Borough' , y = ven , rot = 0 , ax = axes3[n] , figsize = (5,30) )
    n += 1
plt.show()

This information enables investors to make better investing decisions. 

#### 4.2.1 Boroughs' food diversity 
In this section, we narrow the type of venues down to the different food categories. This way, we can analyze the diversity of each borough based on the variety of popular food categories.

In [None]:
food_categories = ['Pizza Place' , 'Sandwich Place' , 'Fast Food Restaurant' , 'Japanese Restaurant' ,
                  'Italian Restaurant' , 'Vietnamese Restaurant' , 'Sushi Restaurant' , 'Korean Restaurant' , 
                   'Thai Restaurant'  , 'American Restaurant',
                  'Ramen Restaurant' , 'Indian Restaurant' , 'Steakhouse' , 'Seafood Restaurant',
                  'Chinese Restaurant' , 'Burrito Place' , 'Middle Eastern Restaurant' , 'Greek Restaurant' ,
                  'Asian Restaurant' , 'Fried Chicken Joint' , 'Burger Joint' , 'New American Restaurant' ,'Falafel Restaurant',
                  'Caribbean Restaurant' , 'BBQ Joint' , 'Vegetarian / Vegan Restaurant' , 'Turkish Restaurant' ]

In [None]:
toronto_venues_food = toronto_venues[toronto_venues['Venue Category'].isin(food_categories)]
toronto_venues_food.head()

In [None]:
top5_venues_food = toronto_venues_food.groupby('Venue Category')['Venue Category'].count().sort_values(ascending=False).head(5).keys().values.tolist()
top5_venues_food

### 4.2.2 Boroughs and top 5 restaurant categories

In the previous section, we found the top five restaurant categories in Toronto. Now we are going to calculate how many of these top five restaurants are located in each borough.

In [None]:
dict_5food_bor = {}
for ven in top5_venues_food:
    dict_5food_bor[ven] = {}
    for bor in Boroughs:
        try:
            dict_5food_bor[ven][bor] = toronto_venues_food.groupby(['Borough','Venue Category']).count().loc[bor].loc[ven , 'Venue']
        except:    
            dict_5food_bor[ven][bor] = 0
            
df_5food_bor = pd.DataFrame(dict_5food_bor)
df_5food_bor.reset_index(inplace = True) 
df_5food_bor.rename(columns = {"index" : "Borough"} , inplace = True)

In [None]:
df_5food_bor

In [None]:
fig4, axes4 = plt.subplots(5,1)
fig4.set_figheight(30)
fig4.set_figwidth(30)
n = 0
for ven in top5_venues_food:
    df_5food_bor.sort_values(ven , ascending=False).head(9).plot(kind = 'barh' , x ='Borough' , y = ven , rot = 0 , ax = axes4[n] , figsize = (5,30) )
    n += 1
plt.show()

In [None]:
df_5food_bor.set_index('Borough' , inplace = True)
df_5food_bor

## Cluster Boroughs
k-means clustering is a simple but useful and popular, unsupervised machine learning algorithm that is used to categorized data into clusters. The clustered data share some similar characteristics. We use this algorithm to cluster similar boroughs and find out which boroughs share the same characteristics and how.

Determining the optimal number of clusters in this method is a frequent problem in data clustering. The optimal number of clusters can be achieved by *The Elbow Method* or *The Silhouette Method*.

**The elbow method** runs k-means clustering for a range of different values of k, and for each value of k, it computes an average score for all clusters.

In [None]:
sum_of_squared_distances = []
K = range(1,8)
for k in K:
    print(k, end=' ')
    kmeans = KMeans(n_clusters=k).fit(df_5food_bor)
    sum_of_squared_distances.append(kmeans.inertia_)
plt.plot(K, sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('sum_of_squared_distances')
plt.title('Elbow Method For Optimal k');

Apprantly, Elbow Method does not reveal any useful information about the optimal number of clusters. Therefore, we use the other method.

**The Silhouette Method**  examines how well a data point lies within its cluster compared to the other clusters.

In [None]:
sil = []
K_sil = range(2,9)
# minimum 2 clusters required, to define dissimilarity
for k in K_sil:
    print(k, end=' ')
    kmeans = KMeans(n_clusters = k).fit(df_5food_bor)
    labels = kmeans.labels_
    sil.append(silhouette_score(df_5food_bor, labels, metric = 'euclidean'))
plt.plot(K_sil, sil, 'bx-')
plt.xlabel('k')
plt.ylabel('silhouette_score')
plt.title('Silhouette Method For Optimal k')
plt.show()

Since k = 2 shows a very poor performance with broad classification, we select k = 4 as there is a peak at k =4.

In [None]:
# set number of clusters
kclusters = 4

# run k-means clustering
kmeans = KMeans(init="k-means++", n_clusters=kclusters, n_init=50).fit(df_5food_bor)

print(Counter(kmeans.labels_))

Now we add the cluster column to the dataframe that shows cluster number of each Borough.

In [None]:
# add clustering labels
try: 
    df_5food_bor.drop('Cluster', axis=1)
except:
    df_5food_bor.insert(0, 'Cluster', kmeans.labels_)
df_5food_bor

## Let's have a closer look to each group.

#### Group 1: Fast Food and Pizza Place

In [None]:
df_5food_bor[df_5food_bor["Cluster"] == 0]

Restaurants located in boroughs of group 1, are mainly **fast food** and **pizza place**

#### Group 2: Italian, Asian and Japanese

In [None]:
df_5food_bor[df_5food_bor["Cluster"] == 1]

As expected, the highest concentration of restaurants is in downtown Toronto. Therefore, Downtown Toronto is grouped by itself. **Italian**, **Asian** and **Japanese** restaurants, are the main three types of the restaurant in Downtown Toronto.

#### Group 3: Low Concentration of Restaurants and No Fast Food

In [None]:
df_5food_bor[df_5food_bor["Cluster"] == 2]

The concentration of restaurants in the Boroughs clustered in Group 4 is low, and they have almost the same number of restaurants from different categories. It is observed that there are not that many fast food restaurants in this group.

#### Group 4: Italian Restaurant and Pizza Place

In [None]:
df_5food_bor[df_5food_bor["Cluster"] == 3]

Restaurants in Group 4 are mainly **Italian Restaurants** and **Pizza Places**. 

In [None]:
df_5food_bor.reset_index(inplace = True)

In [None]:
Cluster_Borough = pd.Series(df_5food_bor.Cluster.values,index=df_5food_bor.Borough).to_dict()
Cluster_Borough

The plot of Toronto with all its neighborhoods is shown at the beginning of this notebook. Now, we plot that map again but this time we match colors of neighborhoods whose Boroughs fall into the same category. 

In [None]:
# set color scheme for the clusters
colors_array = cm.rainbow(np.linspace(0, 1, kclusters+1)) # plus one because of Mississauga
rainbow = [colors.rgb2hex(i) for i in colors_array]

rainbow = ['red' , 'blue' , 'green' , 'yellow' , 'white' , 'black']
# create map of Toronto using latitude and longitude values
latitude = 43.6534817
longitude = -79.3839347
map_tor = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood ,  in zip(CanPostCode['lat'], CanPostCode['long'], CanPostCode['Borough'], CanPostCode['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    cluster = Cluster_Borough[borough]
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_tor) 
    
map_tor

### 4.2.2 Top five restaurants in each Borough

In the previous section, we examined the Boroughs against the top five popular restaurants in Toronto. Since the restaurants' concentration varies based on location, we explore the top five restaurants in each Borough. This gives us a better understanding on how diverse each area is.

In [None]:
dict_Bor_5food = {}
for brgh in Boroughs:
    temp = dic_test[brgh].reset_index()
    if len(temp[temp['Venue Category'].isin(food_categories)].reset_index()['Venue Category'].tolist()) >= 5:
        dict_Bor_5food[brgh] = temp[temp['Venue Category'].isin(food_categories)].reset_index()['Venue Category'].head(5).tolist()
    dict_Bor_5food

In [None]:
Bor_5food_df = pd.DataFrame.from_dict(dict_Bor_5food , orient='index')
Bor_5food_df.rename(columns = {0: '1st most common' , 1: '2nd most common' , 2: '3rd most common' , 3: '4th most common' ,
                              4 : '5th most common'})

### Cluster Boroughs

We need get_dummies to convet the categorical variable to indicator variables so we can analyse them with k-means clustering method.

In [None]:
toronto_onehot = pd.get_dummies(Bor_5food_df)
toronto_onehot

We are going to take exact same steps that we took in the previous section to find the optimal number of clusters.

In [None]:
sum_of_squared_distances = []
K = range(1,10)
for k in K:
    print(k, end=' ')
    kmeans = KMeans(n_clusters=k).fit(toronto_onehot)
    sum_of_squared_distances.append(kmeans.inertia_)

In [None]:
plt.plot(K, sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('sum_of_squared_distances')
plt.title('Elbow Method For Optimal k');

In [None]:
sil = []
K_sil = range(2,9)
# minimum 2 clusters required, to define dissimilarity
for k in K_sil:
    print(k, end=' ')
    kmeans = KMeans(n_clusters = k).fit(toronto_onehot)
    labels = kmeans.labels_
    sil.append(silhouette_score(toronto_onehot, labels, metric = 'euclidean'))

In [None]:
plt.plot(K_sil, sil, 'bx-')
plt.xlabel('k')
plt.ylabel('silhouette_score')
plt.title('Silhouette Method For Optimal k')
plt.show()


In [None]:
# set number of clusters
kclusters = 4

# run k-means clustering
kmeans = KMeans(init="k-means++", n_clusters=kclusters, n_init=50).fit(toronto_onehot)

print(Counter(kmeans.labels_))

In [None]:
# add clustering labels
try:
    Bor_5food_df.drop('Cluster', axis=1)
except:
    Bor_5food_df.insert(0, 'Cluster', kmeans.labels_)

In [None]:
Bor_5food_df.reset_index(inplace = True)

In [None]:
Bor_5food_df.rename(columns = {'index' : 'Borough'} , inplace = True)

In [None]:
Bor_5food_df.sort_values(['Cluster'])

### Group1: Italian Restaurant

In [None]:
Bor_5food_df[Bor_5food_df["Cluster"] == 0]

### Group2: Italian Restaurant, Pizza Place, Indian Restuarant

In [None]:
Bor_5food_df[Bor_5food_df["Cluster"] == 1]

### Group3: Fast Food, Sandwich Place, Pizza Place

In [None]:
Bor_5food_df[Bor_5food_df["Cluster"] == 2]

### Group4: Italian Restaurant, Vietnamese Restaurant

In [None]:
Bor_5food_df[Bor_5food_df["Cluster"] == 3]

We plot map of toronto while all the neighborhoods are colored based on their Boroughs' clusters. 

In [None]:
Cluster_Borough = pd.Series(Bor_5food_df.Cluster.values,index=Bor_5food_df.Borough).to_dict()
Cluster_Borough
# set color scheme for the clusters
colors_array = cm.rainbow(np.linspace(0, 1, kclusters+1)) # plus one because of Mississauga
rainbow = [colors.rgb2hex(i) for i in colors_array]

rainbow = ['red' , 'blue' , 'green' , 'yellow' , 'white' , 'black']
# create map of Toronto using latitude and longitude values
latitude = 43.6534817
longitude = -79.3839347
map_tor = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood ,  in zip(CanPostCode['lat'], CanPostCode['long'], CanPostCode['Borough'], CanPostCode['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    cluster = Cluster_Borough[borough]
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_tor) 
    
map_tor

## Results