# Capstone Project - The Battle of the Neighborhoods

### Applied Data Science Capstone by IBM/Coursera

## Table of contents



* Introduction: Business Problem
* Data
* Analysis
* Results and Discussion
* Conclusion

## Introduction: Business Problem

A chain of restaurant owners in Ontario, Canada want to expand their business. Currently they have their restaurants open in cities like Ottawa, Brampton and Hamilton.

They figured out that they would make more profit by opening up a restaurant in Toronto as Toronto is the largest city of Canada. So they want to open up a new restaurant some place nice with good neighbourhood in Toronto. They are having trouble figuring out which place to chose within Toronto to open their new restaurant.

We have to help them figure out which place to chose where there business will be good, they have less competition and nice people live around. They want to know about 2-3 such places so that they can decide for themselves which one is the best.

## Data

__First Dataset: List of neighbourhoods in Toronto:__


Firstly, I will be using data from a wikipedia page which provides information about list of neighbourhoods in Toronto, Canada. I will be using web scrapping tool BeautifulSoup for extracting the data in the form of a table from this wikipedia page. This table contains 3 columns: Postal Code, Borough and Neighbourhood. The link for this wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M . After preprocessing the table and adding two more columns of Latitude and Longitude of each Neighbourhood, this dataset is ready for use. Final DataFrame will have 5 columns: Postal Code, Borough, Neighbourhood, Latitude, Longitude. And it will contain 103 rows having 103 unique neighbourhoods of Toronto and 11 unique Boroughs.

For example,the first row contains a Borough named North York which contains one neighbourhood named Parkwoods and has a Postal code of M3A. The geographical coordinates of this neighbourhood is (43.753259,-79.329656).

**Second Dataset: List of different venues in the neighbourhoods of Toronto:**


This dataset will be formed using the Foursquare API. I will use the Foursquare location data to explore different venues in each neighbourhood of Toronto. These venues can be any place. For example: Parks, Coffee Shops, Hotels, Gyms, etc. Using the Foursquare location data, I can get information about these venues and analyze the neighbourhoods of Toronto easily based on this information.

We will use the geographical coordinates from above dataset to generate this Location dataset.

**In general, I will be using these two datasets to solve the business problem of finding the best place to open a restaurant within Toronto**

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
#Importing Libraries
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


Importing the first dataset in form of a DataFrame:

In [29]:
df=pd.read_excel('data1.xlsx')


In [30]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


In [31]:
df.shape

(103, 5)

In [16]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df['Borough'].unique()),
        df.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


Geographical coordinates of Toronto:

In [6]:
latitude=43.6532
longitude=-79.3832

### Creating a map of Toronto with all 103 neighbourhoods marked on this map:

In [8]:
# create map of Toronto using latitude and longitude values:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df['latitude'], df['longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto


Folium is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them by creating the second dataset.

Define Foursquare Credentials and Version

In [9]:
CLIENT_ID = 'Q24YKJBLSZC3TFCMPSX3XFQN5VT4LR2ZPAC5DECYN1BVF4X0' # my Foursquare ID
CLIENT_SECRET = 'OOT2PFUCXTI0C3Y1GT1BA25X5HZHARCKZFL5PA53AJWLKQ2S' # my Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100


### Explore different venues in different Neighborhoods of Toronto:

#### Let's create a function to find all nearby venues within radius of 500 in all the neighborhoods in Toronto:

In [10]:
LIMIT=100
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


In [11]:
toronto_venues = getNearbyVenues(names=df['Neighbourhood'],
                                   latitudes=df['latitude'],
                                   longitudes=df['longitude']
                                  )

#### toronto_venues is a dataframe that contains all the information about different neighbourhoods of Toronto along with their nearby venues like Park, Restaurant, Coffee shop, etc. It is the second dataset that we require to solve the problem:

In [14]:

toronto_venues.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,TTC stop #8380,43.752672,-79.326351,Bus Stop
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


In [17]:

tor2=toronto_venues.groupby('Neighbourhood').count()
tor2.reset_index(inplace=True)
tor2

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Adelaide, King, Richmond",94,94,94,94,94,94
1,Agincourt,4,4,4,4,4,4
2,"Agincourt North, L'Amoreaux East, Milliken, St...",2,2,2,2,2,2
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",10,10,10,10,10,10
4,"Alderwood, Long Branch",8,8,8,8,8,8
...,...,...,...,...,...,...,...
95,Willowdale West,6,6,6,6,6,6
96,Woburn,3,3,3,3,3,3
97,"Woodbine Gardens, Parkview Hill",11,11,11,11,11,11
98,Woodbine Heights,7,7,7,7,7,7


Note:
We see that Foursquare does not provide any information about 3 specific neighbourhoods from df dataframe, hence 3 rows are missing from toronto_venues dataframe. Therefore, we have to remove these 3 neighbourhoods from df dataframe also:

In [32]:
list1 =df["Neighbourhood"].to_list()
list2 =tor2['Neighbourhood'].to_list()
set1=set(list1)
set2=set(list2)

In [33]:
set1.difference(set2)

{'Islington Avenue', 'Silver Hills, York Mills', 'Upper Rouge'}

#### above neighbourhoods are not present in df dataframe. So we delete them in order to maintain same shape

In [34]:
df1=df[df['Neighbourhood']=='Islington Avenue']
df1

Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242


Index = 5, 45, 95 are not present in df dataframe

In [35]:
df.drop([5,45,95],axis=0,inplace=True)
df.reset_index(drop=True,inplace=True)
df.shape


(100, 5)

### shape is now same of both frames- df and tor2

Preprocessing the second dataset that is toronto_venues dataframe so that we can cluster the dataset easily using one hot encoding :



In [55]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We're interested in venues in 'food' category, but only those that are proper restaurants - coffee shops, pizza places, bakeries etc. are not direct competitors, so we don't care about those. Hence we will include in out list only venues that have 'restaurant' in category name, and we'll make sure to detect and include all the subcategories of different restaurants in the neighborhood. For example, Afghan restaurant, Italian restaurant, etc. For this, we locate venues from toronto_onehot dataframe that are restaurants only:

In [77]:
col=['Neighbourhood']
for column in toronto_onehot.columns:
    if column.__contains__('Restaurant'):
        col.append(column)

In [81]:
toronto_restaurants=toronto_onehot[col]
toronto_restaurants=toronto_restaurants.groupby('Neighbourhood').sum().reset_index()
torr_grouped=toronto_restaurants.groupby('Neighbourhood').mean().reset_index()
torr_grouped.shape

(100, 46)

#### Adding a column containing total number of restaurants in that neighbourhood. This will help us in making clusters using K-Means clustering algorithm.



In [82]:
toronto_restaurants['Total']=toronto_restaurants.sum(axis=1)
toronto_restaurants= toronto_restaurants.drop('Neighbourhood',axis=1)
toronto_restaurants

Unnamed: 0,American Restaurant,Asian Restaurant,Belgian Restaurant,Brazilian Restaurant,Cajun / Creole Restaurant,Caribbean Restaurant,Chinese Restaurant,Colombian Restaurant,Comfort Food Restaurant,Cuban Restaurant,...,Ramen Restaurant,Restaurant,Seafood Restaurant,Sushi Restaurant,Taiwanese Restaurant,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Total
0,1,1,0,1,0,0,0,1,0,0,...,0,4,1,2,0,3,0,1,0,22
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
96,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
97,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
98,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Using K-Means clustering algorithm to make clusters of dataset so that our analysis is easy:



In [83]:
# set number of clusters
kclusters = 5


# run k-means clustering
kmeans = KMeans(n_clusters=kclusters,random_state=0).fit(toronto_restaurants)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]


array([2, 0, 0, 0, 0, 4, 0, 1, 1, 0], dtype=int32)

In [84]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#### Let's print each neighborhood along with the top 5 most common venues

In [86]:
num_top_venues = 5

for hood in torr_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = torr_grouped[torr_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
                    venue  freq
0              Restaurant   4.0
1         Thai Restaurant   3.0
2        Sushi Restaurant   2.0
3     American Restaurant   1.0
4  Gluten-free Restaurant   1.0


----Agincourt----
                       venue  freq
0  Latin American Restaurant   1.0
1        American Restaurant   0.0
2      Indonesian Restaurant   0.0
3        Japanese Restaurant   0.0
4          Korean Restaurant   0.0


----Agincourt North, L'Amoreaux East, Milliken, Steeles East----
                       venue  freq
0        American Restaurant   0.0
1      Indonesian Restaurant   0.0
2        Japanese Restaurant   0.0
3          Korean Restaurant   0.0
4  Latin American Restaurant   0.0


----Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown----
                  venue  freq
0  Fast Food Restaurant   1.0
1   American Restaurant   0.0
2   Moroccan Restaurant   0.0
3   Japanese Restaurant   0.

Preparing a dataset venues_sorted in which all neighbourhoods of Toronto are listed along with its top 10 most common venues. This will help in better visualisation of each cluster after they are formed.

In [87]:
torr_grouped.head()

Unnamed: 0,Neighbourhood,American Restaurant,Asian Restaurant,Belgian Restaurant,Brazilian Restaurant,Cajun / Creole Restaurant,Caribbean Restaurant,Chinese Restaurant,Colombian Restaurant,Comfort Food Restaurant,...,Portuguese Restaurant,Ramen Restaurant,Restaurant,Seafood Restaurant,Sushi Restaurant,Taiwanese Restaurant,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
0,"Adelaide, King, Richmond",1,1,0,1,0,0,0,1,0,...,0,0,4,1,2,0,3,0,1,0
1,Agincourt,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Alderwood, Long Branch",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [88]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['Neighbourhood'] = torr_grouped['Neighbourhood']


for ind in np.arange(torr_grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(torr_grouped.iloc[ind, :], num_top_venues)
    
venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Adelaide, King, Richmond",Restaurant,Thai Restaurant,Sushi Restaurant,American Restaurant,Modern European Restaurant
1,Agincourt,Latin American Restaurant,Vietnamese Restaurant,Doner Restaurant,Gluten-free Restaurant,German Restaurant
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Vietnamese Restaurant,Hakka Restaurant,Gluten-free Restaurant,German Restaurant,French Restaurant
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Fast Food Restaurant,Vietnamese Restaurant,Hakka Restaurant,Gluten-free Restaurant,German Restaurant
4,"Alderwood, Long Branch",Vietnamese Restaurant,Hakka Restaurant,Gluten-free Restaurant,German Restaurant,French Restaurant


In [89]:
# add clustering labels
venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

#### After adding cluster labels to venues_sorted dataframe:



In [90]:
venues_sorted.head()

Unnamed: 0,Cluster Labels,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,2,"Adelaide, King, Richmond",Restaurant,Thai Restaurant,Sushi Restaurant,American Restaurant,Modern European Restaurant
1,0,Agincourt,Latin American Restaurant,Vietnamese Restaurant,Doner Restaurant,Gluten-free Restaurant,German Restaurant
2,0,"Agincourt North, L'Amoreaux East, Milliken, St...",Vietnamese Restaurant,Hakka Restaurant,Gluten-free Restaurant,German Restaurant,French Restaurant
3,0,"Albion Gardens, Beaumond Heights, Humbergate, ...",Fast Food Restaurant,Vietnamese Restaurant,Hakka Restaurant,Gluten-free Restaurant,German Restaurant
4,0,"Alderwood, Long Branch",Vietnamese Restaurant,Hakka Restaurant,Gluten-free Restaurant,German Restaurant,French Restaurant


#### Creating a dataframe toronto_merged, by merging two dataframes: df and venues_sorted.



In [91]:

toronto_merged = df

toronto_merged = toronto_merged.join(venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')
toronto_merged.dropna(axis=0,inplace=True)
toronto_merged[['Cluster Labels']]=list(map(int,toronto_merged['Cluster Labels']))
toronto_merged.reset_index(inplace=True)
toronto_merged

Unnamed: 0,index,Postcode,Borough,Neighbourhood,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,0,M3A,North York,Parkwoods,43.753259,-79.329656,0,Vietnamese Restaurant,Hakka Restaurant,Gluten-free Restaurant,German Restaurant,French Restaurant
1,1,M4A,North York,Victoria Village,43.725882,-79.315572,0,Portuguese Restaurant,Vietnamese Restaurant,Dim Sum Restaurant,Gluten-free Restaurant,German Restaurant
2,2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.654260,-79.360636,4,Asian Restaurant,French Restaurant,Restaurant,Mexican Restaurant,Vietnamese Restaurant
3,3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,0,Vietnamese Restaurant,Hakka Restaurant,Gluten-free Restaurant,German Restaurant,French Restaurant
4,4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494,4,Italian Restaurant,Sushi Restaurant,Mexican Restaurant,Vietnamese Restaurant,Doner Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...
95,95,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944,0,Vietnamese Restaurant,Hakka Restaurant,Gluten-free Restaurant,German Restaurant,French Restaurant
96,96,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160,2,Japanese Restaurant,Sushi Restaurant,Restaurant,Mediterranean Restaurant,Indian Restaurant
97,97,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558,0,Fast Food Restaurant,Restaurant,Vietnamese Restaurant,Dim Sum Restaurant,Gluten-free Restaurant
98,98,M8Y,Etobicoke,"Humber Bay, King's Mill Park, Kingsway Park So...",43.636258,-79.498509,0,Vietnamese Restaurant,Hakka Restaurant,Gluten-free Restaurant,German Restaurant,French Restaurant


#### Creating a map of toronto showing all 100 neighbourhoods of toronto, with different colours representing neighbourhoods belonging to different cluster:



In [92]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['latitude'], toronto_merged['longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Cluster-wise segmentation of the main dataset that is toronto_merged dataframe:

In [93]:
df0=toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
df0.head()

Unnamed: 0,Postcode,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,-79.329656,0,Vietnamese Restaurant,Hakka Restaurant,Gluten-free Restaurant,German Restaurant,French Restaurant
1,M4A,-79.315572,0,Portuguese Restaurant,Vietnamese Restaurant,Dim Sum Restaurant,Gluten-free Restaurant,German Restaurant
3,M6A,-79.464763,0,Vietnamese Restaurant,Hakka Restaurant,Gluten-free Restaurant,German Restaurant,French Restaurant
5,M1B,-79.194353,0,Fast Food Restaurant,Vietnamese Restaurant,Hakka Restaurant,Gluten-free Restaurant,German Restaurant
6,M3B,-79.352188,0,Japanese Restaurant,Caribbean Restaurant,Vietnamese Restaurant,Doner Restaurant,Gluten-free Restaurant


In [94]:
df1=toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
df1.head()

Unnamed: 0,Postcode,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
19,M5E,-79.373306,1,Restaurant,Seafood Restaurant,Indian Restaurant,Greek Restaurant,Vegetarian / Vegan Restaurant
32,M2J,-79.346556,1,Fast Food Restaurant,Restaurant,Japanese Restaurant,American Restaurant,Asian Restaurant
35,M5J,-79.381752,1,Restaurant,Italian Restaurant,Indian Restaurant,Vegetarian / Vegan Restaurant,Sushi Restaurant
36,M6J,-79.41975,1,Vietnamese Restaurant,Asian Restaurant,Restaurant,New American Restaurant,Greek Restaurant
40,M4K,-79.352188,1,Greek Restaurant,Italian Restaurant,Restaurant,Indian Restaurant,Caribbean Restaurant


In [95]:
df2=toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
df2.head()

Unnamed: 0,Postcode,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
8,M5B,-79.378937,2,Japanese Restaurant,Fast Food Restaurant,Italian Restaurant,Middle Eastern Restaurant,Ramen Restaurant
23,M5G,-79.387383,2,Italian Restaurant,Japanese Restaurant,Thai Restaurant,Indian Restaurant,New American Restaurant
29,M5H,-79.384568,2,Restaurant,Thai Restaurant,Sushi Restaurant,American Restaurant,Modern European Restaurant
90,M5W,-79.374846,2,Seafood Restaurant,Restaurant,Italian Restaurant,Japanese Restaurant,Indian Restaurant
96,M4Y,-79.38316,2,Japanese Restaurant,Sushi Restaurant,Restaurant,Mediterranean Restaurant,Indian Restaurant


In [96]:
df3=toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
df3.head()

Unnamed: 0,Postcode,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
14,M5C,-79.375418,3,Restaurant,American Restaurant,Italian Restaurant,Japanese Restaurant,Seafood Restaurant
41,M5K,-79.381576,3,American Restaurant,Japanese Restaurant,Italian Restaurant,Seafood Restaurant,Restaurant
46,M5L,-79.379817,3,Restaurant,American Restaurant,Italian Restaurant,Seafood Restaurant,Japanese Restaurant
94,M5X,-79.38228,3,Restaurant,Japanese Restaurant,American Restaurant,Asian Restaurant,Seafood Restaurant


In [97]:
df4=toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
df4.head()

Unnamed: 0,Postcode,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,M5A,-79.360636,4,Asian Restaurant,French Restaurant,Restaurant,Mexican Restaurant,Vietnamese Restaurant
4,M7A,-79.389494,4,Italian Restaurant,Sushi Restaurant,Mexican Restaurant,Vietnamese Restaurant,Doner Restaurant
12,M3C,-79.340923,4,Asian Restaurant,Restaurant,Dim Sum Restaurant,Italian Restaurant,Japanese Restaurant
22,M4G,-79.363452,4,Sushi Restaurant,Restaurant,Mexican Restaurant,Vietnamese Restaurant,Dim Sum Restaurant
25,M1H,-79.239476,4,Thai Restaurant,Caribbean Restaurant,Hakka Restaurant,Vietnamese Restaurant,Doner Restaurant


## Analysis:

In [98]:
print('Total number of neighbourhoods in cluster 0 is',toronto_restaurants.loc[df0.index,:].shape[0])
print('Total number of restaurants in this cluster is', toronto_restaurants.loc[df0.index,:]['Total'].sum())
print('Ratio of Restaurant/Neighbourhood in this cluster is',(toronto_restaurants.loc[df0.index,:]['Total'].sum()/toronto_restaurants.loc[df0.index,:].shape[0]) )

Total number of neighbourhoods in cluster 0 is 60
Total number of restaurants in this cluster is 251
Ratio of Restaurant/Neighbourhood in this cluster is 4.183333333333334


In [99]:
print('Total number of neighbourhoods in cluster 1 is',toronto_restaurants.loc[df1.index,:].shape[0])
print('Total number of restaurants in this cluster is', toronto_restaurants.loc[df1.index,:]['Total'].sum())
print('Ratio of Restaurant/Neighbourhood in this cluster is',(toronto_restaurants.loc[df1.index,:]['Total'].sum()/toronto_restaurants.loc[df1.index,:].shape[0]) )

Total number of neighbourhoods in cluster 1 is 11
Total number of restaurants in this cluster is 79
Ratio of Restaurant/Neighbourhood in this cluster is 7.181818181818182


In [100]:
print('Total number of neighbourhoods in cluster 2 is',toronto_restaurants.loc[df2.index,:].shape[0])
print('Total number of restaurants in this cluster is', toronto_restaurants.loc[df2.index,:]['Total'].sum())
print('Ratio of Restaurant/Neighbourhood in this cluster is',(toronto_restaurants.loc[df2.index,:]['Total'].sum()/toronto_restaurants.loc[df2.index,:].shape[0]) )

Total number of neighbourhoods in cluster 2 is 5
Total number of restaurants in this cluster is 19
Ratio of Restaurant/Neighbourhood in this cluster is 3.8


In [101]:
print('Total number of neighbourhoods in cluster 3 is',toronto_restaurants.loc[df3.index,:].shape[0])
print('Total number of restaurants in this cluster is', toronto_restaurants.loc[df3.index,:]['Total'].sum())
print('Ratio of Restaurant/Neighbourhood in this cluster is',(toronto_restaurants.loc[df3.index,:]['Total'].sum()/toronto_restaurants.loc[df3.index,:].shape[0]) )

Total number of neighbourhoods in cluster 3 is 4
Total number of restaurants in this cluster is 15
Ratio of Restaurant/Neighbourhood in this cluster is 3.75


In [102]:
print('Total number of neighbourhoods in cluster 4 is',toronto_restaurants.loc[df4.index,:].shape[0])
print('Total number of restaurants in this cluster is', toronto_restaurants.loc[df4.index,:]['Total'].sum())
print('Ratio of Restaurant/Neighbourhood in this cluster is',(toronto_restaurants.loc[df4.index,:]['Total'].sum()/toronto_restaurants.loc[df4.index,:].shape[0]) )

Total number of neighbourhoods in cluster 4 is 20
Total number of restaurants in this cluster is 116
Ratio of Restaurant/Neighbourhood in this cluster is 5.8


## Note: As it is clearly visible that Restaurant/Neighbourhood ratio is lowest for Cluster 3, we will further analyse neighbourhoods belonging to cluster 3 only.

In [103]:
toronto_restaurants.loc[df3.index,:]

Unnamed: 0,American Restaurant,Asian Restaurant,Belgian Restaurant,Brazilian Restaurant,Cajun / Creole Restaurant,Caribbean Restaurant,Chinese Restaurant,Colombian Restaurant,Comfort Food Restaurant,Cuban Restaurant,...,Ramen Restaurant,Restaurant,Seafood Restaurant,Sushi Restaurant,Taiwanese Restaurant,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Total
14,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
46,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,2
94,0,0,0,0,0,0,0,0,0,0,...,3,2,0,2,0,0,0,0,1,13


As we can see, last row contains very high Total number of restaurants (13) in these neighbourhoods, we will remove these neighbourhoods from df3 dataframe:

In [104]:
df3.drop([94],axis=0,inplace=True)


In [105]:
toronto_restaurants.loc[df3.index,:]


Unnamed: 0,American Restaurant,Asian Restaurant,Belgian Restaurant,Brazilian Restaurant,Cajun / Creole Restaurant,Caribbean Restaurant,Chinese Restaurant,Colombian Restaurant,Comfort Food Restaurant,Cuban Restaurant,...,Ramen Restaurant,Restaurant,Seafood Restaurant,Sushi Restaurant,Taiwanese Restaurant,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Total
14,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
46,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,2


In [106]:
toronto_merged.loc[df3.index,:]


Unnamed: 0,index,Postcode,Borough,Neighbourhood,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
14,14,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,3,Restaurant,American Restaurant,Italian Restaurant,Japanese Restaurant,Seafood Restaurant
41,41,M5K,Downtown Toronto,"Design Exchange, Toronto Dominion Centre",43.647177,-79.381576,3,American Restaurant,Japanese Restaurant,Italian Restaurant,Seafood Restaurant,Restaurant
46,46,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",43.648198,-79.379817,3,Restaurant,American Restaurant,Italian Restaurant,Seafood Restaurant,Japanese Restaurant


# The above Neighbourhoods looks perfect for Restaurant opening. Therefore, finally storing the information of these 3 neighbourhoods in a dataframe named final:

In [107]:
final=toronto_merged.loc[df3.index,'Postcode':'longitude']
final


Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude
14,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
41,M5K,Downtown Toronto,"Design Exchange, Toronto Dominion Centre",43.647177,-79.381576
46,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",43.648198,-79.379817


## Visualising these 3 neighbourhoods on a map:

In [108]:
# create map of Toronto using latitude and longitude values:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=15)

# add markers to map
for lat, lng, borough, neighbourhood in zip(final['latitude'], final['longitude'], final['Borough'], final['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=9,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=1,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### The 3 neighbourhoods are depicted by 3 blue dots in the above map.

Here rataurants are most isolated among all clusters and within this, from the map we can conclude that neighbourhood with **PostalCode : M5C, Downtown Borough, St. James Town with Latitude,Longitude:{46.35,-79.37}** is the most isolated and hence the most ideal place for restaurant opening which can be most Profitable for the investors to put their fundings into.

## Results and Discussion

Our analysis shows that although there is a great number of restaurants in Toronto, there are pockets of low restaurant density fairly close to city center. To identify these pockets, we used clustering algorithm and segmmented our neighbourhood dataset accordingly.

We used K-means clustering algorithm for for making 5 clusters each containing some neighbourhoods based on number of restaurants they have in their vicinity. Then we analysed each cluster by calculating Restaurant/Neighbourhood ratio of each cluster. We saw that cluster 3 had lowest ratio, which means very few restaurants are present within vicinity of each neighbourhood. There were total 4 neighbourhoods belonging to cluster 4. Then upon further analysis, we found that 1 among them was not good for opening up a new restaurant. Hence, only 3 neighbourhoods left.

According to our analysis, we got a total of 3 neighbourhoods where restaurant business will be good. There are two reasons for that. First reason is, we saw that these neighbourhoods does not contain much restaurants around their vicinity which will lower the competition in the restaurant business. Second reason is that, as we can see in the above map that these 4 neighbourhoods lie in the center of Toronto which means these neighbourhoods have high population density which means more customers and hence more profit.

The final 3 neighbourhoods that are perfect for opening a new restaurant are stored in a dataframe named final which contains information about latitude, longitude and borough of these neighbourhoods.

The owners can further chose from these 3 locations which will be the best according to the type of restaurant they are trying to open.

If we look closely the distance from each other in the map, we find neighbourhood with **PostalCode : M5C, Downtown Borough, St. James Town with Latitude,Longitude:{46.35,-79.37}** is most isolated of all and hence, the perfect location for restaurant pening

## Conclusion

Purpose of this project was to identify neighbourhoods in Toronto low number of restaurants in order to aid stakeholders in narrowing down the search for optimal location for a new restaurant. By calculating restaurant density distribution from Foursquare data we have first identified the most common nearby venues of each neighbourhood. Then with the help of clustering techniques and further analysis we were able to narrow down to 3 neighbourhoods and finally selecting the best of them for opening up a new restaurant. This concludes this project of Battle of Neighbourhoods.