# Capstone Project - The Battle of the Neighborhoods

## Applied Data Science Capstone by IBM/Coursera¶

### TABLE OF CONTENTS :
---

* Introduction: Business Problem
* Data
* Analysis
* Results and Discussion
* Conclusion

### __INTRODUCTION : BUSINESS PROBLEM__
---

A chain of restaurant owners in Ontario, Canada want to expand their business.Currently they have their restaurants
open in cities like Ottawa, Brampton and Hamilton.They figured out that they would make more profit by opening up a 
restaurant in Toronto as Toronto is the largest city of Canada. So they want to open up a new restaurant some nice place 
with good neighbourhood in Toronto. 

They are having trouble figuring out which place to chose within Toronto to open their new restaurant.

We have to help them figure out which place to chose where there business will be good, they have less competition and
nice people live around. They want to know about 2-3 such places so that they can decide for themselves which one is the 
best.

### __DATA__
---

#### __First Dataset: List of neighbourhoods in Toronto__:

Firstly, I will be using data from a wikipedia page which provides information about list of neighbourhoods in Toronto, Canada. I will be using web scrapping tool BeautifulSoup for extracting the data in the form of a table from this wikipedia page. This table contains 3 columns: Post Code, Borough and Neighbourhood. The link for this wikipedia page: (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) . After preprocessing the table and adding two more columns of Latitude and Longitude of each Neighbourhood, this dataset is ready for use. Final DataFrame will have 5 columns: Post Code, Borough, Neighbourhood, Latitude, Longitude. And it will contain 103 rows having 103 unique neighbourhoods of Toronto and 5 unique Boroughs.

__Here is a screenshot of first five rows of the final dataframe:__

<img src='DATA_IMG1.JPG'>

#### __Second Dataset: List of different venues in the neighbourhoods of Toronto__:

This dataset will be formed using the Foursquare API. I will use the Foursquare location data to 
explore different venues in each neighbourhood of Toronto. These venues can be any place.
For example: Parks, Coffee Shops, Hotels, Gyms, etc. Using the Foursquare location data, 
I can get information about these venues and analyze the neighbourhoods of Toronto easily based on this information.

We will use the geographical coordinates from above dataset to generate this Location dataset.
Here is the screenshot of the sample dataset:

In general, I will be using these two datasets to solve the business problem of finding the best 
place to open a restaurant within Toronto

---

__Before we get the data and start exploring it, let's download all the dependencies that we will need.__



In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# For example, here's several helpful packages to load in 
#Importing Libraries

# library to handle data in a vectorized manner- linear algebra
import numpy as np 

# library for data analsysis, data processing, CSV file I/O (e.g. pd.read_csv)
import pandas as pd 

# library to handle JSON files
import json 

# library to handle requests
import requests 

# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize 

# Visualization
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN

 # map rendering library
import folium

print('Libraries imported.')

Libraries imported.



---
__Importing the first dataset in form of a DataFrame:__

In [2]:
# data set is extracted from wikipedia link and savedin dataframe data1.csv with columns: Postcode, Borough, Neighborhood,Latitude and Longitude 
df=pd.read_csv('data1.csv')

In [3]:
# Showing the Data Frame
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


In [4]:
# And make sure that the dataset has all 11 boroughs and 103 neighborhoods.
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df['Borough'].unique()),
        df.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


---
__Geographical coordinates of Toronto:__

In [5]:
# entering Geospatial coordinates of Toronto 
latitude=43.6532
longitude=-79.3832

##### __Visualization with 103 Neighbourhood__

We visualize the data many times at different stages. In the beginning, we visualize the selected borough neighborhoods so that we can get an idea or confirmation regarding the coordinates of that Borough. 

 __Creating a map of Toronto with all 103 neighbourhoods marked on this map:__

In [6]:
# create map of Toronto using latitude and longitude values:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df['latitude'], df['longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Folium is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.

---
#### __Foursquare API__

Next, we are going to use Foursquare API to explore the neighborhoods and segment them by creating the second dataset.

__Define Foursquare Credentials and Version__

In [7]:
# Define Foursquare Credentials and Version
CLIENT_ID = 'M10FSYAXCRRL5JXP34MNG3CK4XVPZWCUBSL1SZXOM3H3IEDG' # my Foursquare ID
CLIENT_SECRET = 'TZFUEHWWQRUNJD5YL2XEMP1EDOOTEGWPJGRE23THGFT0NH5T' # my Foursquare Secret
VERSION = '20180604' # Foursquare API version
LIMIT = 100

---
#### Explore different venues in different Neighborhoods of Toronto:¶

##### Let's create a function to do the same for all the neighborhoods in Toronto:¶


In [10]:
# Let's create a function to repeat the process to all the neighborhoods in Toronto

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
            
       # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [11]:
# Write the code to run the above function on each neighborhood and create a new dataframe called toronto_venues.
toronto_venues = getNearbyVenues(names=df['Neighbourhood'],
                                   latitudes=df['latitude'],
                                   longitudes=df['longitude']
                                  )

In [12]:
# Let's check the size of the resulting dataframe
print(toronto_venues.shape)

(2255, 7)


__toronto_venues is a dataframe that contains all the information about different neighbourhoods of Toronto along with their nearby venues like Park, Restaurant, Coffee shop, etc. It is the second dataset that we require to solve the problem__:

In [64]:
#showing first rows of dataframe
toronto_venues.head(13)

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,KFC,43.754387,-79.333021,Fast Food Restaurant
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
5,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
6,Victoria Village,43.725882,-79.315572,The Frig,43.727051,-79.317418,French Restaurant
7,Victoria Village,43.725882,-79.315572,Eglinton Ave E & Sloane Ave/Bermondsey Rd,43.726086,-79.31362,Intersection
8,Victoria Village,43.725882,-79.315572,Pizza Nova,43.725824,-79.31286,Pizza Place
9,"Harbourfront, Regent Park",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery


---

In [14]:
# Let's check how many venues were returned for each neighborhood

toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Agincourt,4,4,4,4,4,4
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",2,2,2,2,2,2
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",11,11,11,11,11,11
"Alderwood, Long Branch",8,8,8,8,8,8
"Bathurst Manor, Downsview North, Wilson Heights",18,18,18,18,18,18
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",23,23,23,23,23,23
Berczy Park,56,56,56,56,56,56
"Birch Cliff, Cliffside West",4,4,4,4,4,4


In [15]:
# Let's find out how many unique categories can be curated from all the returned venues

#toronto_venues['Neighbourhood'].unique()

In [16]:
#df['Neighbourhood'].unique()
#df['Neighbourhood'][95]

#### Note:

We see that Foursquare does not provide any information about 3 specific neighbourhoods from df dataframe, hence 3 rows are missing from toronto_venues dataframe. Therefore, we have to remove these 3 neighbourhoods from df dataframe also:

In [17]:
# only 100 neighborhood details returned in toronto_venues so lets remove missing rows 5,52,95 from first dataframe df
df.drop([5,52,95],axis=0,inplace=True)
df.reset_index(drop=True,inplace=True)

#showing dataframe 
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


---
#### __ANALYSIS__

We analyze each venues in neighborhoods through one hot encoding (giving ‘1’ if a venue category is there, and ‘0’ in case of venue category is not there). On the basis of one hot encoding, we calculate mean of the frequency of occurrence of each category and picked top five venues on that basis for each neighborhood. It means the top venues are showing the foot traffic or the more visited places.


__Preprocessing the second dataset that is toronto_venues dataframe so that we can cluster the dataset easily using one hot encoding__ :

In [18]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
# Next, let's group rows by neighborhood and by taking the sum of occurrence of each of the venues 
# in each Neighborhood and save it in new dat frame toronto_grouped
toronto_grouped=toronto_onehot.groupby('Neighbourhood').sum().reset_index()
toronto_grouped

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0,0,0,0,0,0,0,0,3,...,0,1,0,0,0,0,1,0,0,0
1,Agincourt,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Alderwood, Long Branch",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,"Bathurst Manor, Downsview North, Wilson Heights",0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
6,Bayview Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,"Bedford Park, Lawrence Manor East",0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
8,Berczy Park,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
9,"Birch Cliff, Cliffside West",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


---
We're interested in venues in 'food' category, but only those that are proper restaurants - coffee shops, pizza places, bakeries etc. are not direct competitors, so we don't care about those. __Hence we will include in our list only venues that have 'restaurant' in category name, and we'll make sure to detect and include all the subcategories of different restaurants in the neighborhood__. For example, Afghan restaurant, Italian restaurant, etc. For this, we locate venues from toronto_onehot dataframe that are restaurants only:

In [20]:
# Including rows from dataframe toronto_onehot with venues category=Restaurants and saving it to new dataframe toronto_restaurants
col=['Neighbourhood']
for column in toronto_onehot.columns:
    if column.__contains__('Restaurant'):
        col.append(column)

In [21]:
toronto_restaurants=toronto_onehot[col]
#toronto_grouped = toronto_restaurants.groupby('Neighbourhood').sum().reset_index()
toronto_restaurants=toronto_restaurants.groupby('Neighbourhood').sum().reset_index()
toronto_restaurants.head()

Unnamed: 0,Neighbourhood,Afghan Restaurant,American Restaurant,Asian Restaurant,Belgian Restaurant,Brazilian Restaurant,Cajun / Creole Restaurant,Caribbean Restaurant,Chinese Restaurant,Colombian Restaurant,...,Restaurant,Seafood Restaurant,Southern / Soul Food Restaurant,Sushi Restaurant,Taiwanese Restaurant,Tapas Restaurant,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
0,"Adelaide, King, Richmond",0,3,3,0,1,0,0,0,1,...,3,1,0,2,0,0,4,0,1,0
1,Agincourt,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Alderwood, Long Branch",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


---

In [22]:
# Let's put that into a pandas dataframe
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

__Preparing a dataset venues_sorted in which all neighbourhoods of Toronto are listed along with its top 5 most common venues. This will help in better visualisation of each cluster after they are formed.__

In [23]:
# Now let's create the new dataframe and display the top 5 venues for each neighborhood.

num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
           columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Bar,Steakhouse,Thai Restaurant
1,Agincourt,Skating Rink,Breakfast Spot,Lounge,Clothing Store,Eastern European Restaurant
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Playground,Park,Yoga Studio,Dumpling Restaurant,Discount Store
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Pizza Place,Fried Chicken Joint,Coffee Shop,Sandwich Place
4,"Alderwood, Long Branch",Pizza Place,Pharmacy,Pub,Gym,Sandwich Place


---
__Adding a column containing total number of restaurants in that neighbourhood. This will help us in making clusters using K-Means clustering algorithm.__

In [24]:
toronto_restaurants['Total']=toronto_restaurants.sum(axis=1)
#toronto_restaurants= toronto_restaurants.drop('Neighbourhood',axis=1)

In [None]:
#toronto_restaurants

---
#### __Using K-Means clustering algorithm to make clusters of dataset so that our analysis is easy:__

In [25]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_restaurants.drop('Neighbourhood',axis=1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters,random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 2, 2, 2, 2, 1, 2, 3, 3, 2], dtype=int32)

In [26]:
# add clustering labels to dataframe venues_sorted
venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [63]:
#create a new dataframe that includes the cluster as well as the top 5 venues for each neighborhood
toronto_merged = df

#toronto_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with venues_sorted to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

# check the last columns!
toronto_merged.head(5)


Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,2,Fast Food Restaurant,Park,Food & Drink Shop,Dumpling Restaurant,Discount Store
1,M4A,North York,Victoria Village,43.725882,-79.315572,2,Pizza Place,Coffee Shop,French Restaurant,Portuguese Restaurant,Hockey Arena
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636,1,Coffee Shop,Café,Park,Bakery,Gym / Fitness Center
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,2,Furniture / Home Store,Accessories Store,Coffee Shop,Event Space,Miscellaneous Shop
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494,3,Coffee Shop,Park,Gym,Yoga Studio,Smoothie Shop


In [41]:
#check how many rows and column in new dataframe
toronto_merged.shape

(100, 11)

---
#### __Creating a map of toronto showing all 100 neighbourhoods of toronto, with different colours representing neighbourhoods belonging to different cluster__:

In [42]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['latitude'], toronto_merged['longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters


#### Cluster-wise segmentation of the main dataset that is toronto_merged dataframe:
---

In [43]:

df0=toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
df0.head()

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
8,Downtown Toronto,0,Coffee Shop,Clothing Store,Cosmetics Shop,Fast Food Restaurant,Café
14,Downtown Toronto,0,Coffee Shop,Restaurant,Café,Hotel,Italian Restaurant
23,Downtown Toronto,0,Coffee Shop,Italian Restaurant,Sandwich Place,Café,Ice Cream Shop
29,Downtown Toronto,0,Coffee Shop,Café,Bar,Steakhouse,Thai Restaurant
36,West Toronto,0,Bar,Coffee Shop,Asian Restaurant,Restaurant,New American Restaurant


#### Cluster 0
---

In [44]:
df1=toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
df1.head()

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,Downtown Toronto,1,Coffee Shop,Café,Park,Bakery,Gym / Fitness Center
9,North York,1,Park,Pub,Pizza Place,Sushi Restaurant,Japanese Restaurant
22,East York,1,Coffee Shop,Sporting Goods Shop,Furniture / Home Store,Burger Joint,Sandwich Place
25,Scarborough,1,Hakka Restaurant,Thai Restaurant,Caribbean Restaurant,Fried Chicken Joint,Bank
27,North York,1,Coffee Shop,Frozen Yogurt Shop,Shopping Mall,Fast Food Restaurant,Sandwich Place


---
#### Cluster 1
---

In [45]:
df2=toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
df2.head()

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,North York,2,Fast Food Restaurant,Park,Food & Drink Shop,Dumpling Restaurant,Discount Store
1,North York,2,Pizza Place,Coffee Shop,French Restaurant,Portuguese Restaurant,Hockey Arena
3,North York,2,Furniture / Home Store,Accessories Store,Coffee Shop,Event Space,Miscellaneous Shop
5,Scarborough,2,Fast Food Restaurant,Diner,Farmers Market,Falafel Restaurant,Event Space
6,North York,2,Japanese Restaurant,Caribbean Restaurant,Gym / Fitness Center,Café,Baseball Field


---
#### Cluster 2
---

In [46]:
df3=toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
df3.head()

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
4,Queen's Park,3,Coffee Shop,Park,Gym,Yoga Studio,Smoothie Shop
12,North York,3,Gym,Asian Restaurant,Sporting Goods Shop,Beer Store,Coffee Shop
19,Downtown Toronto,3,Coffee Shop,Cocktail Bar,Beer Bar,Seafood Restaurant,Cheese Shop
32,North York,3,Clothing Store,Fast Food Restaurant,Coffee Shop,Cosmetics Shop,Food Court
35,Downtown Toronto,3,Coffee Shop,Aquarium,Hotel,Italian Restaurant,Café


---
#### Cluster 3
---

In [47]:
df4=toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
df4.head()

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
82,Downtown Toronto,4,Café,Vegetarian / Vegan Restaurant,Chinese Restaurant,Vietnamese Restaurant,Mexican Restaurant


---
#### Cluster 4
---

### Examining the clusters: 
---

Find Restaurant/Neighbourhood ratio of each group of Neighborhood cluster and determine the one with lowest ratio

In [49]:
print('Total number of neighbourhoods in cluster 0 is',toronto_restaurants.loc[df0.index,:].shape[0])
print('Total number of restaurants in this cluster is', toronto_restaurants.loc[df0.index,:]['Total'].sum())
print('Ratio of Restaurant/Neighbourhood in this cluster is',(toronto_restaurants.loc[df0.index,:]['Total'].sum()/toronto_restaurants.loc[df0.index,:].shape[0]) )

Total number of neighbourhoods in cluster 0 is 10
Total number of restaurants in this cluster is 29
Ratio of Restaurant/Neighbourhood in this cluster is 2.9



Cluster 0




In [50]:
print('Total number of neighbourhoods in cluster 1 is',toronto_restaurants.loc[df1.index,:].shape[0])
print('Total number of restaurants in this cluster is', toronto_restaurants.loc[df1.index,:]['Total'].sum())
print('Ratio of Restaurant/Neighbourhood in this cluster is',(toronto_restaurants.loc[df1.index,:]['Total'].sum()/toronto_restaurants.loc[df1.index,:].shape[0]) )

Total number of neighbourhoods in cluster 1 is 17
Total number of restaurants in this cluster is 80
Ratio of Restaurant/Neighbourhood in this cluster is 4.705882352941177


Cluster 1

In [51]:
print('Total number of neighbourhoods in cluster 2 is',toronto_restaurants.loc[df2.index,:].shape[0])
print('Total number of restaurants in this cluster is', toronto_restaurants.loc[df2.index,:]['Total'].sum())
print('Ratio of Restaurant/Neighbourhood in this cluster is',(toronto_restaurants.loc[df2.index,:]['Total'].sum()/toronto_restaurants.loc[df2.index,:].shape[0]) )

Total number of neighbourhoods in cluster 2 is 59
Total number of restaurants in this cluster is 305
Ratio of Restaurant/Neighbourhood in this cluster is 5.169491525423729


Cluster 2

In [52]:

print('Total number of neighbourhoods in cluster 3 is',toronto_restaurants.loc[df3.index,:].shape[0])
print('Total number of restaurants in this cluster is', toronto_restaurants.loc[df3.index,:]['Total'].sum())
print('Ratio of Restaurant/Neighbourhood in this cluster is',(toronto_restaurants.loc[df3.index,:]['Total'].sum()/toronto_restaurants.loc[df3.index,:].shape[0]) )

Total number of neighbourhoods in cluster 3 is 13
Total number of restaurants in this cluster is 89
Ratio of Restaurant/Neighbourhood in this cluster is 6.846153846153846


Cluster 3

In [53]:
print('Total number of neighbourhoods in cluster 4 is',toronto_restaurants.loc[df4.index,:].shape[0])
print('Total number of restaurants in this cluster is', toronto_restaurants.loc[df4.index,:]['Total'].sum())
print('Ratio of Restaurant/Neighbourhood in this cluster is',(toronto_restaurants.loc[df4.index,:]['Total'].sum()/toronto_restaurants.loc[df4.index,:].shape[0]) )

Total number of neighbourhoods in cluster 4 is 1
Total number of restaurants in this cluster is 22
Ratio of Restaurant/Neighbourhood in this cluster is 22.0


Cluster 4

#### Note: As it is clearly visible that Restaurant/Neighbourhood ratio is lowest for Cluster 0, we will further analyse neighbourhoods belonging to cluster 0 only.
---
### Now refine dataset cluster 0 such that places with less Restaurants in vicinity is filtered

In [54]:
toronto_restaurants.loc[df0.index,:]

Unnamed: 0,Neighbourhood,Afghan Restaurant,American Restaurant,Asian Restaurant,Belgian Restaurant,Brazilian Restaurant,Cajun / Creole Restaurant,Caribbean Restaurant,Chinese Restaurant,Colombian Restaurant,...,Seafood Restaurant,Southern / Soul Food Restaurant,Sushi Restaurant,Taiwanese Restaurant,Tapas Restaurant,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Total
8,Berczy Park,0,0,0,1,0,0,0,0,0,...,2,0,0,0,0,1,0,1,0,10
14,"CN Tower, Bathurst Quay, Island airport, Harbo...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23,"Clairlea, Golden Mile, Oakridge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
29,Davisville North,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36,Downsview Central,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41,East Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
47,Glencairn,0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,3
90,Thorncliffe Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
94,Willowdale South,0,0,0,0,0,0,0,0,0,...,0,0,2,0,0,0,0,0,1,12
96,Woburn,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


As we can see, first and second last row contains very high Total number of restaurants (10 and 12) in these neighbourhoods, we will remove these neighbourhoods from df0 dataframe:

In [56]:
df0.drop([8,94],axis=0,inplace=True)

In [57]:
toronto_restaurants.loc[df0.index,:]

Unnamed: 0,Neighbourhood,Afghan Restaurant,American Restaurant,Asian Restaurant,Belgian Restaurant,Brazilian Restaurant,Cajun / Creole Restaurant,Caribbean Restaurant,Chinese Restaurant,Colombian Restaurant,...,Seafood Restaurant,Southern / Soul Food Restaurant,Sushi Restaurant,Taiwanese Restaurant,Tapas Restaurant,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Total
14,"CN Tower, Bathurst Quay, Island airport, Harbo...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23,"Clairlea, Golden Mile, Oakridge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
29,Davisville North,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36,Downsview Central,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41,East Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
47,Glencairn,0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,3
90,Thorncliffe Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
96,Woburn,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [58]:
toronto_merged.loc[df0.index,:]

Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
14,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Coffee Shop,Restaurant,Café,Hotel,Italian Restaurant
23,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,0,Coffee Shop,Italian Restaurant,Sandwich Place,Café,Ice Cream Shop
29,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568,0,Coffee Shop,Café,Bar,Steakhouse,Thai Restaurant
36,M6J,West Toronto,"Little Portugal, Trinity",43.647927,-79.41975,0,Bar,Coffee Shop,Asian Restaurant,Restaurant,New American Restaurant
41,M5K,Downtown Toronto,"Design Exchange, Toronto Dominion Centre",43.647177,-79.381576,0,Coffee Shop,Café,Hotel,Italian Restaurant,Restaurant
47,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",43.648198,-79.379817,0,Coffee Shop,Hotel,Café,American Restaurant,Restaurant
90,M5W,Downtown Toronto,Stn A PO Boxes 25 The Esplanade,43.646435,-79.374846,0,Coffee Shop,Restaurant,Café,Hotel,Fast Food Restaurant
96,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,0,Coffee Shop,Gay Bar,Japanese Restaurant,Sushi Restaurant,Restaurant


In above dataset, we can see that neighbourhoods with index 14,36,41,47,90 and 96 have Restaurant as their most common venue more than once and hence these neighbourhoods are not suitable for Restaurant business. Hence we have to remove these rows from df3 dataframe:

In [59]:
df0.drop([14,36,41,47,90,96],axis=0,inplace=True)


In [60]:
toronto_merged.loc[df0.index,:]

Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
23,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,0,Coffee Shop,Italian Restaurant,Sandwich Place,Café,Ice Cream Shop
29,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568,0,Coffee Shop,Café,Bar,Steakhouse,Thai Restaurant


---
__The above Neighbourhoods looks perfect for Restaurant opening since there is no more than one Restaurant as its Common venue. Therefore, finally storing the information of these 2 neighbourhoods in a dataframe named final:__


In [61]:
# The best places to start new restauranti.e, with lower number of restaurants in the neighborhoods of Toronto 
# are store in this new dataframe called final
final=toronto_merged.loc[df0.index,'Postcode':'longitude']
final

Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude
23,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
29,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568


---
__Visualising final 2 neighbourhoods on a map:__



In [62]:
# create map of most suitable places in Toronot Neighborhoods using latitude and longitude values from final dataframe:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=15)

# add markers to map
for lat, lng, borough, neighbourhood in zip(final['latitude'], final['longitude'], final['Borough'], final['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=9,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=1,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

__The 2 neighbourhoods - Central Bay Street,Downtown Toronto and Adelaide, King, Richmond, Downtown Toronto are depicted by 2 blue dots in the above map__

---
## Results and Discussion
---

Our analysis shows that although there is a great number of restaurants in Toranto, there are pockets of low restaurant density fairly close to city center. To identify these pockets, we used clustering algorithm and segmmented our neighbourhood dataset accordingly.

We used K-means clustering algorithm for for making 5 clusters each containing some neighbourhoods based on number of restaurants they have in their vicinity. Then we analysed each cluster by calculating Restaurant/Neighbourhood ratio of each cluster. We saw that cluster 0 had lowest ratio, which means very few restaurants are present within vicinity of each neighbourhood. There were total 10 neighbourhoods belonging to cluster 0. Then upon further analysis, we found that 8 among those were not good for opening up a new restaurant. Hence, only 2 neighbourhoods left.

According to our analysis, we got a total of 2 neighbourhoods where restaurant business will be good. There are two reasons for that. First reason is, we saw that these neighbourhoods does not contain much restaurants around their vicinity which will lower the competition in the restaurant business. Second reason is that, as we can see in the above map that these 2 neighbourhoods lie in the center of Toronto which means these neighbourhoods have high population density which means more customers and hence more profit.

The final 2 neighbourhoods that are perfect for opening a new restaurant are stored in a dataframe named final which contains information about latitude, longitude and borough of these neighbourhoods.

The owners can further chose from these 8 locations which will be the best according to the type of restaurant they are trying to open

---
## Conclusion 
---

Purpose of this project was to identify neighbourhoods in Toronto with low number of restaurants in order to aid stakeholders in narrowing down the search for optimal location for a new restaurant. By calculating restaurant density distribution from Foursquare data we have first identified the most common nearby venues of each neighbourhood. Then with the help of clustering techniques and further analysis we were able to narrow down to 2 neighbourhoods which were good for opening up a new restaurant. This concludes this project of Battle of Neighbourhoods.