# **Recommendation of best place to setup a cafe/juice bar in Toronto**

## 1. **Introduction**

**Background**

Toronto is a major city in the world & has a big area with several neighborhoods. All the neighborhoods have many different venues such as restaurants, pubs etc. The client is an outsider and wishes to get a recommendation on the best place to setup a small cafe or a juice bar. The major requirement is to find the best suited neighborhood for the client to flourish the business. It is well understood that many coffes shops such as Starbucks are available in Toronto. By knowing a location which will help the client to avoid unbeatable competetion & other valuable insights he will be able to make a well informed decision.

**Problem**

Data that might contribute to determining the best recommendation place for seting up a cafe/juice bar in Toronto can be the analysis of the availabilty of cafes, restaurants in all the neighborhoods of Toronto. Insights to what sort of people are in a particular area (ourgoing or not) can also be used to determine the best neighborhood for this problem. The project aims to show the best location for the client to setup his business based on these data.

**Interest**

The client is an outsider to Toronto. He requires to find out the best location where he can setup his new business without losing money. He also has a family & hence wishes to make the best decision when setting up this business. By making a very informed decision he is bound to avoid making a loss on his investment

# 2. Data Acquisition & Cleaning

**Data Sources**

For this problem we would require data regarding all the neighborhoods in Toronto & the details of all the places for eating in these areas. In the entire length of this project I have used the word eatery to indicate place which can fulfill the clients requirement in some way. (Restaurants, Cafes, Food Joints all have been considered as eateries).

Accordingly the required data requirements have been extracted from the following sources;

* Neighborhood data with Postal Codes; From Wikipage "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
* Venues data; Foursquare 
* Longitudes & Latitudes of Toronto Neighborhoods; CSV file provided by Coursera 'http://cocl.us/Geospatial_data'
* Longitudes & Latitudes of Toronto; from geopy package

The venues dataset doesn't provide any detailed insights into the venues, including reviews, popularity etc. But this dataset can be used to identify all venues within a certain neighborhood.

**Data Cleansing**

Initially the Neighborhood data has to be extracted from the wiki table & scraped. I have used python coding for the web scraping. This table scraped from the wiki page had to be adjusted to provide me with the required data to be used for the project. 

The rows containing 'Not assigned' in the Boroughs column was removed as it was of no use to the analysis. The table was also checked afterwards to see if there were any missing values, for which there were none.This table extracted contained details of areas other than the major parts of Toronto. Therefore this table also had to be reduced to include only the relevant data 

In this project as we are using Foursquare API to get all veue data it is important to get all coordinate details of the neighborhoods to be analysed. This was done by joining the neighborhood details table with the csv file containing the coordinates of the neighborhood of Toronto to get the relevant coordinate details.


**Data Selection**

After the initial cleaning there was 39 rows with 6 columns. The objective of the project is to provide the best recommendation for the client in order to start the business.

All the venues which were contained in the neighborhood was obtained from the Foursquare dataset. The Venue category is used for the analysis of data in this project as it provides a better insight regarding the venue. In this project 1605 venues were available to carry out the analysis based on their coordinates of location.

Venue categories was grouped to the neighborhoods & then the proportions of a venue category to exist in a neighborhood was analyzed. There were 228 unique venue categories. It is from this Data that the final analysis was calculated upon.

When initializing the concept for analyzing & selecting data for the project some key assumptions were made;
* All venue details in a neighborhood have been captured by the foursquare dataset
* People residing/ or using any facility in a certain neighborhood will prefer to dine/eat/drink in that neighborhood
* All data captured in foursquare dataset comprises detail of actively used venues


Here i have attached some key snapshots of the data sets used for a better understanding of the data;

The below is a snap of the data from wiki table to identify all neighborhoods. NaN values are clearly visible in this snap;

   ![image.png](attachment:image.png)

The longitude & latitudes were extracted from a table of below format; 

   ![image.png](attachment:image.png)

The above two tables were joined to get the cordinate details of the neighborhood.

All the venue details taken from the Fouresquare dataset was then summarized into the below form to get a clear understanding of the method to map to the data of neighborhoods.

![image.png](attachment:image.png)

Using the above formats and data sets the project was carried in order to identify the best recommendation according to the clients requirement.

# 3. Methodology

## Background

The problem is to provide a recommendation from an existing Dataset. The final goal is to recommend best neighborhood to set up a cafe/juice bar.

The Data set can be used to identify the neighborhoods on the following basis;
* Venue categories present in a neighborhood
* Competetion for setting up a new eatery
* Insight into the neighborhood by looking at the venues existing

## Model Development

**Basic concept of the model development;**

Initially we identify that our first target is to find neighborhoods with less eateries. For this we have to identify all the venues present in a neighborhood. 

Then we have identified what neighborhoods that can directly be eliminated from consideration to match with the clients requirement. Afterwards the best neighborhood is selected to setup the business.

**Best approach for Model Development;**

The best method to approach this project is ti initially cluster the dataset based on similar attributes. With this I have anticipated that I will be able to clearly identify the areas which have more eateries.

The data clustering method which was used for this project is the K-means clustering method. This method is used to cluster the foursquare dataset to a given K number of clusters. Neighborhood with similar attributes will be placed under 1 cluster.

**Step by Step Model Development;**

* With the Initial Data sources for Toronto Neighborhoods & Location Coordinates build up following merged table

![image.png](attachment:image.png)

* With the Foursquare Dataset identify all the venues in each neighborhood

![image.png](attachment:image.png)

* Identify the proportions of a Venue Category to exist in the given neighborhood (Which means the probability of finding this venue category among all the venues in the neighborhood)

*The below table has the neighborhoods as the 1st column & all the different venue categories as other columns. The decimal values show the proportions of finding the venue among other venues in the particular neighborhood*


![image.png](attachment:image.png)

* Identify top 10 venue categories for each neighborhood based on the proportions of finding a venue in  the neighborhood 

*The following table consists 39 rows (the number of neighborhoods in Toronto) & the top 10 most common venue categories in these areas. This is the table which is used for clustering*

![image.png](attachment:image.png)

* Identify the best K value (Number of clusters) to be used for the clustering 

We use the elbow method to identify the best K value for clustering. This is done by ploting a graph for sum of squared distances with different K values. It is known that k value is better when the sum reduces. However if it becomes too low the value gives errors. By  using the elbow method I have identified the best K value for the graph.

![image.png](attachment:image.png)

* Cluster the neighborhoods into the different clusters identified

The K value selected was 8 from the above analysis. The dataset consisting 39 neighborhoods is thereby divided into 8 clusters with cluster labels 0 to 7

* Identify clusters which have a higher proportion of eateries & eliminate

After analysing all the clusters I have directly eliminated 5 clusters which have a higher proportion of eateries in the neighborhood. This is done based on observation of the clusters. Finaly selected clusters were as below;

![image.png](attachment:image.png)

* Finally selected clusters with less competetion to be ranked

At this stage we have already identified that the best Borough to setup the clients business is in Central Toronto. The client will be given a ranking among these 3 neighborhoods.

For this the exact proportions of all these 3 neighborhoods are considered & the result is finalized.

# 4. Results

From the Model developed in this project the final 3 Neighborhoods suitable for setting up business are as below;
* Lawrence Park
* Moore Park, Summerhill East
* Roselawn

The individual suggested neighborhoods have the following proportions for the venue categories;

Lawson Park; 

![image.png](attachment:image.png)

Moore Park, Summerhill East;

![image.png](attachment:image.png)

Roselawn;

![image.png](attachment:image.png)

From the above results we can clearly identify that these neighborhoods do not have many venues according to the foursquare dataset. According to my initial assumption that Fouresquare has captured all the data of the neighborhoods I can come to the following inferences;

1. Lawson park has a young age population for which a park & swim school has been established. Adult population is not outgoing here, so they might visit the city centers in the bus if required (For dining, movies, grocers etc.)

2. Moore Park, Summerhill East has a population much towards teens & young adults. They intend to be fit, do exercise & play. Thus we have a Gym & Playground based here. Here also adult & young populations will visit other neighborhoods for dining, movies, grocers etc.

3. Roselawn has only data for 1 venue. This indicates that the population might be lesser here. The people are most llikely to come outside only for essential tasks & are required to go to other neigborhoods for dining, movies, grocers etc.

# 5. Discussion

From the above results & insights we can clearly state that setting up a small juice shop/ Cafe in Roselawn will be in vain, as the population seems to be less & less outgoing within the neighborhood.

However both Lawson park & Moore Park,Summerhill East neighborhoods are areas which have a higher potential.

Any children or Young adults who play & do exercises are bound to get tired. This will also eventually increase their appetites. As there are no eateries in these 2 neighborhoods it will be a great chance to setup the clients business here.

However according to my results inference, I can predict that the crowd using the facilities of 'Gym' & 'Playground' in Moore park, Summerhill East has a higher spending power than the crowd using the 'Swim school' & 'Park' in Lawson park. This is because the crowd using facilities in Lawson park can be assumed to be younger. 

I also do not consider the proportion of the crowd going to 'Bus lines' in Lawson park as very high potential customers comparative to the other 2 facilities.

Thereby the best location or neighborhood to setup the business can be considered as **Moore Park, Summerhill East**

# 6. Conclusion

The best choice of neighborhood for the client to setup the business is .

The order of rating for the 3 best neighborhoods can clearly be stated as ;
1. Moore Park, Summerhill East
2. Lawrence Park
3. Roselawn

The K-Means clustering method was used to arrive at this conclusion.

It is very valuable to understand that the decison is based on several assumptions. These can be listed as below;
* The exact area & population within a neighborhood is not considered. Hence an insight to this has been developed based on the venues in the neighborhood. (eg: Gym & Playground in Moore Park, Summerhill East suggests that young people are in the neighborhood, & that they are also using the facilities)
* The insight is based on the venues but not on any review or current status data regarding the venue. (eg: Even if the gym is there in Moore Park, Summerhill East I do not have data to suggest the amount of people using it (1 or 100) or if it is open only on 1 day of the week etc. )
* I have assumed all venues available in the dataset are actively used & an average amount of people use it. It is based on the fact that otherwise the venue would have been closed in the neighborhood & hence removed from the dataset
* All venue details in a neighborhood have been captured by the foursquare dataset
* People residing/ or using any facility in a certain neighborhood will prefer to dine/eat/drink in that neighborhood
