In [1]:
from IPython.display import IFrame
import pandas as pd

#### Coursera Capstone
## The Battle of the Neighborhoods
### Find the best place to start a restaurant in Amsterdam

#### Introduction

Recently, Vegetarianism has become the most in-vogue dietary trend around the world. More and more people embrace the plan-based lifestyles. Even though the prospect of vegan dietary seems to be optimistic, the hyper-competitive essence in the catering market shouldn’t be neglected. <br>

A restaurant operator from <b>Amsterdam</b> runs a very succesfull vegetarian restaurant in the neighborhood called the 'Oude Pijp'. However, since a couple of years, more and more similar restaurants are opening in the neighborhood. Therefore, he is looking for a location to open a second restaurant. He really likes Amsterdam, due to its high number of vegetarians, tourists and excellent accessibility. <br>

The operator knows from previous experience that <b>location</b> makes as much, if not more, of a difference in the success of a new restaurant as the menu does. He tasked us to come up with a data-driven advice for the best location to open up a new restaurant. His main requirements are: <br>

1. Similar neighborhood as current restaurant
2. Low rent prices
3. Not in suburbal area of Amsterdam
4. Low number of competitors (vegetarian restaurants) yet.
5. Safe environment

<img src=https://www.meininger-hotels.com/fileadmin/images/meininger-hotels-in-amsterdam-zentrum-6d28120.jpg width = '100%' height=500 >

#### Target audience
This project is particularly useful for restaurant chain owners and/or investors looking to open and/or invest in a Vegetarian rest.
The target audience of this report is the restaurant operator from Amsterdam. The restaurant is called <b> De Waaghals <b> and located just below the center of Amsterdam.

#### Problem description
Ideally, the operator would open a restaurant in the heart of the city of Amsterdam. Although this would expose his restaurants to a lot of passing tourists each day, this would mean excessive costs in terms of rent each month. At this point, he doesn't want to take this risk. Therefore, he is looking for a similar neighborhood with lower average rent prices. <br>

The main problem to be resolved in this project is as follows: <br>
>***Provide a top three of neighborhoods in Amsterdam that are ideal for opening a new restaurant, based on affordability, safety, demographics and competition***

Even though the rising demand for vegetarian dietary brings profit potential, there are lots of indices that need to be cautiously considered. So, how could we utilize machine learning to find the proper location to start the vegan business? One solution is using clustering method to group different districts by their restaurant category and leverage the result with population data. In the following paragraph, I will disclose the whole process of this solution.

In [16]:
IFrame(src='data/Waaghals.html', width= '100%', height=600)

## Methodology
The following methodology is used to perform this research:
1. **Data collection**. Usually, in these types of researches, centers of neighborhoods are used to evaluate statistics of neighborhoods. However, neighborhoods are not points but areas. Therefore, we will collect the polygons of the neighborhoods and add statistics from various sources.
2. **Data preparation**. Combine all the data sources into a single format and extract important features from the larger datasets
3. **Data visualization**. Plot the collected data on the maps of folium
4. **Clustering**. Cluster similar neighborhoods into groups using k-means cluster algorithm
5. **Results and recommendataions**

## Data collection

#### Data sources
The data we will collect consists of three sources. Each source addresses one of the criteria categories from the problem description. <br>

1. **Neighborhood data** - Basic statistics about the different boroughs and its neighborhoods are collected from Wikipedia ([data.amsterdam.nl](https://data.amsterdam.nl))
2. **Demographic data** - Statistics on the people living in each neighborhood will be collected from the website of the Central Department of Statistics ([website CBS](https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS&tableId=37296ned&_theme=63))
3. **Rent costs** - We will estimate the rent costs per borough based on the average value per square meter of the venues. The data will be obtained via data.amsterdam.nl ([dataset: Gemiddelde WOZ-waarde woningen wijken en stadsdelen](https://data.amsterdam.nl/datasets/B4EcyyT9e_AFyQ/stedelijke-ontwikkeling-wijken/))
4. **Venue data** - We will use the Foursquare API to retrieve all restaurants per borough ([Foursquare API](https://foursquare.com/developers/apps)) <br>

#### Neighborhood data
The datasets from the first three sources are acquired and consolidated in a single dataframe. This resulted in a total of 56 neighborhoods and 18 columns of data

In [3]:
data = pd.read_pickle('data/neigh_data.pickle')
data.head()

Unnamed: 0_level_0,Total # Restaurants,Most occuring type,Number of Vegan / Vegatarian Restaurants,% Men,% Woman,% Western,% Non-Western,% Children,% Youth,% Young Urban Professionals,% Boomers,% Eldery,Population Density,Average Household Size,House price per m2,Safety Index,Criminality Index,Perception of safety
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Burgwallen-Oude Zijde,73,Chinese Restaurant,0.0,0.552072,0.447928,0.338186,0.174692,0.049272,0.155655,0.491601,0.210526,0.095185,12784.0,1.4,6030.0,178.1,102.01,106.73
Burgwallen-Nieuwe Zijde,106,Italian Restaurant,1.0,0.546005,0.453995,0.382567,0.179177,0.047215,0.14891,0.53753,0.190073,0.081114,7236.0,1.4,6502.0,168.84,117.2,96.95
Grachtengordel-West,57,French Restaurant,0.0,0.523699,0.476301,0.348096,0.112665,0.073038,0.135198,0.351981,0.271173,0.17094,14380.0,1.6,7339.0,79.61,52.74,69.55
Grachtengordel-Zuid,109,Italian Restaurant,1.0,0.538674,0.461326,0.332413,0.134438,0.084715,0.132597,0.384899,0.252302,0.14733,10463.0,1.6,7145.0,137.45,139.46,74.58
Nieuwmarkt/Lastage,67,Sandwich Place,0.0,0.521381,0.478104,0.281813,0.157135,0.07831,0.113344,0.341577,0.284389,0.183926,13652.0,1.5,5984.0,120.79,113.35,87.59


#### Venue data
Foursquare API was used to collect the data of all restaurants in Amsterdam. It is a bit tricky to collect all restaurants from Amsterdam at once. With the Foursquare API, there is a limit of only 100 results per query. This is obviously a lot less than the total number of restaurants. <br>

The solution I applied here is to create a grid of locations on the map. For each point in this grid, we perform the search query resulting in 100 results. This resulted in 2117 restaurants across around 90 different categories. A sample of the data is shown below

In [4]:
venues = pd.read_pickle('data/venue_data.pickle')
venues.head()

Unnamed: 0,Restaurant name,Restaurant category,Neighborhood,Latitude,Longitude
0,Jefferson Bar Brasserie,Restaurant,Hoofddorppleinbuurt,52.340269,4.844127
1,Oliver's Crazy Kitchen,Italian Restaurant,Hoofddorppleinbuurt,52.340771,4.844395
2,Salad Bowl Club,Salad Place,Hoofddorppleinbuurt,52.339045,4.84304
3,Lunchroom Etcetera,Cafeteria,Hoofddorppleinbuurt,52.33955,4.843908
4,by gusto,Breakfast Spot,Hoofddorppleinbuurt,52.340607,4.84267


## Data visualization

To visualize the data, I use folium package to generate the map with geographical details. Using the Chloropeth maps, I am able to plot the shapes of the neighborhoods on the maps.

### House prices
Let's start with looking at the neighborhoods of Amsterdam and the average house price per m2 in those neighborhoods. As a reference, the restaurant location is included in the map.

In [15]:
IFrame(src='data/houseprices.html', width = '100%', height=800)

### Safety
From data.amsterdam.nl, a couple of metrics regarding safety and security are included: Security perception, Safety index and Criminality index. Since the restaurant operator doesn't want to look into the suburb neighborhoods, these are greyed out.

In [6]:
IFrame(src='data/Security.html', width= '100%', height=800)

### Restaurant data

#### Plotting the restaurants on the map
The neighborhood 'Oude Pijp', where the Waaghals is located, has the highest number of restaurants. Also the number of vegetarian restaurants (marked with green dots) is the highest over there with five in total

In [17]:
IFrame(src='data/restaurants.html', width= '100%', height=800)

#### Heatmap of restaurant types
Italian Restaurants are the most occuring types of restaurants, followed by the French cuisine

In [18]:
IFrame(src='data/heatmap.html', width= '100%', height=800)

## Clustering
Now that we've collected and combined all the features it is time to perform clustering on the dataset. This will result in groups of neighborhoods that are similar to each other. <br>
Before clustering we will be performing data encoding using one-hot method on the Venues Category using Pandas pd.getdummies method. 

### K-means clustering
Using the Elbow method, I found that k = 6 is the optimal value for k for the K-Means clustering algorithm. After determining the optimal k, let’s illustrate the result with the map. At a glimpse of the map, the color represents each cluster.

In [19]:
IFrame(src='data/clusters.html', width= '100%', height=800)

## Results
### Finding similar neighborhoods
In the following plot, only the neighborhoods falling in the same cluster as the 'Oude Pijp' are included, colored by the average housing price. When hovering over the neighborhoods, you can see a couple of relevant statistics.

In [20]:
IFrame(src='data/cluster0.html', width= '100%', height=800)

### Selecting top three neighborhoods
Based on the house price, perception of safety and competition level, we can advise two different options for opening the next vegetarian restaurant.
#### **1. Westindische Buurt** - Safe and young
**Pros:**
* Highest perception of safety of all neighborhoods in cluster
* No vegetarian restaurants, yet
* High percentage of young people (target audience of the restaurant)

**Cons:**
* Low number of restaurants yet
* House price is average

#### **2. Jordaan** - Popular with no competitionm but a bit more expensive
**Pros:**
* Popular neighborhood with a lot of tourists
* High number of restaurants, but no vegetarian restaurants yet
* High percepton of safety

**Cons:**
* House price is high
* High percentage of elderly people (not the target audience)

In [11]:
pd.read_pickle('data/cluster.pickle')

Unnamed: 0_level_0,Total # Restaurants,Most occuring type,Number of Vegan / Vegatarian Restaurants,% Men,% Woman,% Western,% Non-Western,% Children,% Youth,% Young Urban Professionals,% Boomers,% Eldery,Population Density,Average Household Size,House price per m2,Safety Index,Criminality Index,Perception of safety
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Jordaan,131,Italian Restaurant,0.0,0.506959,0.493041,0.281186,0.146649,0.078093,0.097423,0.378093,0.278351,0.168814,23247.0,1.5,6499.0,92.96,77.02,71.03
Frederik Hendrikbuurt,22,Tapas Restaurant,0.0,0.502115,0.497885,0.251964,0.221148,0.095468,0.079758,0.461027,0.245921,0.119637,23076.0,1.6,5965.0,84.11,68.39,75.87
Da Costabuurt,34,Vegetarian / Vegan Restaurant,3.0,0.504292,0.494635,0.280043,0.158798,0.099785,0.08691,0.450644,0.237124,0.127682,21439.0,1.6,6130.0,91.21,88.19,76.94
Helmersbuurt,35,Ethiopian Restaurant,0.0,0.487092,0.512228,0.275815,0.138587,0.110734,0.10462,0.417799,0.253397,0.11481,21968.0,1.7,6638.0,95.98,107.73,73.36
Landlust,44,Snack Place,3.0,0.494226,0.505774,0.180577,0.428084,0.143832,0.120472,0.445407,0.208399,0.082677,18647.0,1.8,5031.0,107.58,95.15,98.26
Westindische Buurt,7,French Restaurant,0.0,0.474479,0.525521,0.233645,0.200575,0.13156,0.101366,0.488138,0.193386,0.086988,21845.0,1.8,5942.0,68.62,56.1,66.85
Oude Pijp,149,Italian Restaurant,5.0,0.503544,0.496456,0.287209,0.197098,0.085724,0.118124,0.487344,0.217685,0.091799,23344.0,1.5,6501.0,94.6,82.09,74.2
Zuid Pijp,13,Indian Restaurant,1.0,0.47558,0.52381,0.190476,0.35409,0.115995,0.104396,0.346764,0.26862,0.165446,23213.0,1.7,5908.0,91.5,69.01,87.06
IJselbuurt,11,Fast Food Restaurant,0.0,0.477316,0.522684,0.20794,0.212665,0.104915,0.102079,0.437618,0.222117,0.136106,20649.0,1.6,5737.0,87.8,86.02,84.56


## Conclusion
In a fast-moving world, there are many real-life problems or scenarios where data can be used to find solutions to those problems. Like seen in the example above, data was used to cluster neighborhoods in Amsterdam in its 56 major neighborhoods. The results can help a restaurant operator to decide about the neighborhoods that fit the most his needs. <br>

I have made use of some frequently used python libraries to scrape web-data, use Foursquare API to explore the major neighborhoods of Amsterdam and saw the results of segmentation of neighborhoods using a Folium map. <br>

Similarly, data can also be used to solve other problems, which most people face in metropolitan cities. Potential for this kind of analysis in a real-life problem is discussed in great detail. Also, some of the drawbacks and chance for improvements to represent even more realistic pictures are mentioned.