# Capstone Project - The Battle of the Neighborhoods

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

The goal of this project is to recommend a location for someone looking to open a restaurant in New York City.
We will work on answering the following questions:

1. Where are the popular locations for running a restaurant business? Are there any geographical patterns in these popular restaurants?
    To find the hottest spot, we search restaurants from Foursquare location data. Then cluster these restaurants and locate the center of each cluster. 

2. How many times do these restaurants be mentioned by users of Foursquare?
    To confirm whether or not the hottest spot is the most popular one, we get the tips number of these restaurants from Foursquare to see the correlation between location and popularity.

3. What are the characteristics of the nearby of these popular restaurants?
    Finally, we explore the nearby venues, then discuss the characteristics of each cluster.


## Data <a name="data"></a>

1. To answer the first question, we use the latitude and longitude data to search restaurants from Foursquare. Cluster the restaurants to find the dense zone. Then get the center of each cluster.

2. To see whether or not the restaurants in the dense zone are more popular, we use the tips number of each restaurant from Foursquare. We will divide the restaurants into different levels of the tips number in the later analysis.

3. To find the characteristics of each cluster, we get the nearby venues data of each cluster. Then, calculate the top 10 common venues of each cluster. (This part will show in Analysis section.)


In [1]:
# import libraries
import requests 
import pandas as pd 
import numpy as np 

from geopy.geocoders import Nominatim

from pandas.io.json import json_normalize

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

import folium


In [2]:
# Import previous collected New York neighborhoods data
newyork_data = pd.read_csv('new_york_data.csv')
neighborhoods = newyork_data.iloc[:, 1:]
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [None]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

In [3]:
@hidden_cell
CLIENT_ID = 'RVUKUV5BWZM51TRGCVUVKXRD1TOOB0CO33XY4NOGPRVQL3EK'
CLIENT_SECRET = 'MEK5DPUC0VCVPALMF0BAO4XUEAQZADXZSQPNRAPORHH1HNJI'
VERSION = '20210722' 
LIMIT = 100 


### Explore Neighborhoods in Manhattan

For the reason that Foursquare has limit for getting venues data, we narrowdown our analysis to the neighborhoods in Manhattan.


In [5]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [6]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


### Create a map of the neighborhoods in Manhattan

In [7]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

In [7]:
# function for getting nearby venues
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name'],
            v['venue']['id']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category',
                  'Venue ID']
    
    return(nearby_venues)

In [None]:
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'], 
                                    latitudes=manhattan_data['Latitude'], 
                                    longitudes=manhattan_data['Longitude']
                                    )

In [9]:
# save the venues location data to a file for later use
# manhattan_venues.to_csv('manhattan_venues.csv')

manhattan_venues = pd.read_csv('manhattan_venues.csv').iloc[:, 1:]

# drop duplicated rows
manhattan_venues.drop_duplicates('Venue ID', inplace=True, ignore_index=True)
print(manhattan_venues.shape)
manhattan_venues.head()

(3091, 8)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue ID
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place,4b4429abf964a52037f225e3
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio,4baf59e8f964a520a6f93be3
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner,4b79cc46f964a520c5122fe3
3,Marble Hill,40.876551,-73.91066,Dunkin',40.877136,-73.906666,Donut Shop,4b5357adf964a520319827e3
4,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop,55f81cd2498ee903149fcc64


In [10]:
# select the rows that the venues' category is restaurant
manhattan_restaurants = manhattan_venues[manhattan_venues['Venue Category'].str.contains('Restaurant')]
manhattan_restaurants.shape

(873, 8)

### Create a map of the restaurants

In [11]:
# create map of the restaurants
map_manhattan_restaurants = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, label in zip(manhattan_restaurants['Venue Latitude'], manhattan_restaurants['Venue Longitude'], manhattan_restaurants['Venue']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan_restaurants)  
    
map_manhattan_restaurants

### Get the rating and tips count of the restaurants

Because the limit, we have to do this by separating the data.

In [None]:
# get the rating and tips count of the restaurants
ratings = []
tips_counts = []
venue_ids_400 = manhattan_restaurants['Venue ID'].head(400)

for venue_id in venue_ids_400:
	url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)

	result = requests.get(url).json()
	try:
		ratings.append(result['response']['venue']['rating'])
	except:
		ratings.append(np.nan)

	try:
		tips_counts.append(result['response']['venue']['tips']['count'])
	except:
		tips_counts.append(np.nan)

# print(ratings)
# print(tips_counts)

rating_tipsCount_400 = pd.DataFrame(zip(ratings, tips_counts), columns=['Rating', 'Tips Count'])
rating_tipsCount_400.to_csv('rating_tipsCount_400.csv')

In [None]:
ratings_tail = []
tips_counts_tail = []
# venue_ids_tail = manhattan_restaurants['Venue ID'].tail()
venue_ids_tail = manhattan_restaurants['Venue ID'].tail(len(manhattan_restaurants['Venue ID']) - 400)

for venue_id in venue_ids_tail:
	url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)

	result = requests.get(url).json()
	try:
		ratings_tail.append(result['response']['venue']['rating'])
	except:
		ratings_tail.append(np.nan)

	try:
		tips_counts_tail.append(result['response']['venue']['tips']['count'])
	except:
		tips_counts_tail.append(np.nan)

# print(ratings_tail)
# print(tips_counts_tail)

rating_tipsCount_tail = pd.DataFrame(zip(ratings_tail, tips_counts_tail), columns=['Rating', 'Tips Count'])
rating_tipsCount_tail.to_csv('rating_tipsCount_tail.csv')


In [21]:
rating_tipsCount_400 = pd.read_csv('rating_tipsCount_400.csv')
rating_tipsCount = pd.concat([rating_tipsCount_400, rating_tipsCount_tail], axis=0, ignore_index=True).iloc[:, 1:]
rating_tipsCount.to_csv('rating_tipsCount.csv')

rating_tipsCount = pd.read_csv('rating_tipsCount.csv').iloc[:, 1:]
rating_tipsCount

Unnamed: 0,Rating,Tips Count
0,7.1,19
1,,0
2,8.7,180
3,9.3,205
4,8.4,99
...,...,...
868,7.5,3
869,7.2,3
870,6.8,6
871,6.5,7


In [17]:
manhattan_restaurants.reset_index(inplace=True, drop=True)
manhattan_restaurants_merged = pd.concat([manhattan_restaurants, rating_tipsCount], axis=1)
manhattan_restaurants_merged.to_csv('manhattan_restaurants_merged.csv')

manhattan_restaurants_merged = pd.read_csv('manhattan_restaurants_merged.csv').iloc[:, 1:]
manhattan_restaurants_merged

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue ID,Rating,Tips Count
0,Marble Hill,40.876551,-73.910660,Land & Sea Restaurant,40.877885,-73.905873,Seafood Restaurant,4b9c9c6af964a520b27236e3,7.1,19
1,Marble Hill,40.876551,-73.910660,Grill 26 at TCR,40.878802,-73.915672,American Restaurant,5012c967e889cf0567e9e2d4,,0
2,Chinatown,40.715618,-73.994279,Spicy Village,40.717010,-73.993530,Chinese Restaurant,4db3374590a0843f295fb69b,8.7,180
3,Chinatown,40.715618,-73.994279,Kiki's,40.714476,-73.992036,Greek Restaurant,5521c2ff498ebe2368634187,9.3,205
4,Chinatown,40.715618,-73.994279,Wah Fung Number 1 Fast Food 華豐快餐店,40.717278,-73.994177,Chinese Restaurant,4a96bf8ff964a520ce2620e3,8.4,99
...,...,...,...,...,...,...,...,...,...,...
868,Hudson Yards,40.756658,-74.000111,Via Trenta,40.753004,-74.002898,Italian Restaurant,57e55e46498e04b0dc14dbb0,7.5,3
869,Hudson Yards,40.756658,-74.000111,Nitti’s,40.756726,-73.994175,Italian Restaurant,5bd10c0ca35dce002cb16e6c,7.2,3
870,Hudson Yards,40.756658,-74.000111,Treadwell,40.759964,-73.996284,Restaurant,5bb17b9531ac6c0039f150cf,6.8,6
871,Hudson Yards,40.756658,-74.000111,EDEN Local,40.759909,-73.996301,Restaurant,5a0264e01ffe977e0fea5da3,6.5,7


## Methodology <a name="methodology"></a>

In this project we will direct our efforts on detecting areas of Berlin that have low restaurant density, particularly those with low number of Italian restaurants. We will limit our analysis to area ~6km around city center.

In first step we have collected the required **data: location and type (category) of every restaurant within 6km from Berlin center** (Alexanderplatz). We have also **identified Italian restaurants** (according to Foursquare categorization).

Second step in our analysis will be calculation and exploration of '**restaurant density**' across different areas of Berlin - we will use **heatmaps** to identify a few promising areas close to center with low number of restaurants in general (*and* no Italian restaurants in vicinity) and focus our attention on those areas.

In third and final step we will focus on most promising areas and within those create **clusters of locations that meet some basic requirements** established in discussion with stakeholders: we will take into consideration locations with **no more than two restaurants in radius of 250 meters**, and we want locations **without Italian restaurants in radius of 400 meters**. We will present map of all such locations but also create clusters (using **k-means clustering**) of those locations to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.

## Analysis <a name="analysis"></a>

### Cluster Restaurants

We use DBSCAN to separate out restaurants in the dense zone.

In [18]:
# Compute DBSCAN
restaurants_clustering = manhattan_restaurants_merged[['Venue Latitude', 'Venue Longitude']]
clustering_transformed = StandardScaler().fit_transform(restaurants_clustering)
db = DBSCAN(eps=0.1, min_samples=20).fit(clustering_transformed)

manhattan_restaurants_merged['Cluster Label'] = db.labels_
n_clusters = len(manhattan_restaurants_merged['Cluster Label'].unique())
print('The number of clusters is ', n_clusters)
manhattan_restaurants_merged.head()

The number of clusters is  13


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue ID,Rating,Tips Count,Cluster Label
0,Marble Hill,40.876551,-73.91066,Land & Sea Restaurant,40.877885,-73.905873,Seafood Restaurant,4b9c9c6af964a520b27236e3,7.1,19,-1
1,Marble Hill,40.876551,-73.91066,Grill 26 at TCR,40.878802,-73.915672,American Restaurant,5012c967e889cf0567e9e2d4,,0,-1
2,Chinatown,40.715618,-73.994279,Spicy Village,40.71701,-73.99353,Chinese Restaurant,4db3374590a0843f295fb69b,8.7,180,0
3,Chinatown,40.715618,-73.994279,Kiki's,40.714476,-73.992036,Greek Restaurant,5521c2ff498ebe2368634187,9.3,205,0
4,Chinatown,40.715618,-73.994279,Wah Fung Number 1 Fast Food 華豐快餐店,40.717278,-73.994177,Chinese Restaurant,4a96bf8ff964a520ce2620e3,8.4,99,0


### Create a map of the clusters

Here, we use gray circle to mark outliers (cluster label is -1).

In [19]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(n_clusters-1)
ys = [i + x + (i*x)**2 for i in range(n_clusters-1)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array] + ['#a9a9a9']

# add markers to the map
for lat, lon, lab, cluster in zip(manhattan_restaurants_merged['Venue Latitude'], manhattan_restaurants_merged['Venue Longitude'], manhattan_restaurants_merged['Cluster Label'], manhattan_restaurants_merged['Cluster Label']):
    label = folium.Popup('Cluster '+str(lab), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=0.7).add_to(map_clusters)

       
map_clusters

### Take a look about the rating and tips count of each cluster

In [20]:
cluster_mean = manhattan_restaurants_merged.groupby('Cluster Label').mean()[['Rating', 'Tips Count']]
cluster_mean

Unnamed: 0_level_0,Rating,Tips Count
Cluster Label,Unnamed: 1_level_1,Unnamed: 2_level_1
-1,8.126969,59.019417
0,8.684,115.52
1,8.114815,44.333333
2,8.291667,88.0
3,8.006061,47.393939
4,8.730435,125.173913
5,8.678049,128.829268
6,8.351613,100.193548
7,8.675,129.708333
8,8.671429,88.107143


It seems the rating and tips count of the restaurants in the dense zone (cluster label 0~11) are greater, except cluster 1, 3, 10. 

In [22]:
manhattan_restaurants_merged.describe()[['Rating', 'Tips Count']]

Unnamed: 0,Rating,Tips Count
count,866.0,873.0
mean,8.276905,77.356243
std,0.605025,113.980122
min,5.5,0.0
25%,7.9,10.0
50%,8.3,35.0
75%,8.7,98.0
max,9.5,1050.0


We can see the median of rating 8.3 is greater than the rating of the outliers (cluster label -1) 8.126969, but not the tips count.

Let's see the correlation between rating and tips count.

In [23]:
manhattan_restaurants_merged[['Rating', 'Tips Count']].dropna(axis=0).corr()


Unnamed: 0,Rating,Tips Count
Rating,1.0,0.396506
Tips Count,0.396506,1.0


The correlation between rating and tips count is not strong.

Let's take a look at the same correlation in dense zone.

In [24]:
cluster_mean.corr()

Unnamed: 0,Rating,Tips Count
Rating,1.0,0.87544
Tips Count,0.87544,1.0


The correlation 0.87544 is strong. The restaurants in the dense zone show a high rating and high tips count tendency.

### Divide the restaurants to 4 groups by their tips count

Use the 25th, 50th, 75th percentile of the tips count.

In [26]:
# divide the restaurants to 4 groups by their tips count
hot_degree = []
for n in manhattan_restaurants_merged['Tips Count']:
	if n <= 10:
		hot_degree.append(1)
	elif n <= 35:
		hot_degree.append(2)
	elif n <= 98:
		hot_degree.append(3)
	else:
		hot_degree.append(4)

manhattan_restaurants_merged['Hot Degree'] = hot_degree
manhattan_restaurants_merged.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue ID,Rating,Tips Count,Cluster Label,Hot Degree
0,Marble Hill,40.876551,-73.91066,Land & Sea Restaurant,40.877885,-73.905873,Seafood Restaurant,4b9c9c6af964a520b27236e3,7.1,19,-1,2
1,Marble Hill,40.876551,-73.91066,Grill 26 at TCR,40.878802,-73.915672,American Restaurant,5012c967e889cf0567e9e2d4,,0,-1,1
2,Chinatown,40.715618,-73.994279,Spicy Village,40.71701,-73.99353,Chinese Restaurant,4db3374590a0843f295fb69b,8.7,180,0,4
3,Chinatown,40.715618,-73.994279,Kiki's,40.714476,-73.992036,Greek Restaurant,5521c2ff498ebe2368634187,9.3,205,0,4
4,Chinatown,40.715618,-73.994279,Wah Fung Number 1 Fast Food 華豐快餐店,40.717278,-73.994177,Chinese Restaurant,4a96bf8ff964a520ce2620e3,8.4,99,0,4


In [27]:
# check the count of each group
manhattan_restaurants_merged.groupby('Hot Degree').count()[['Venue']]

Unnamed: 0_level_0,Venue
Hot Degree,Unnamed: 1_level_1
1,219
2,224
3,213
4,217


### Create a map of 4 groups with different hot degree

In [28]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(4)
ys = [i + x + (i*x)**2 for i in range(4)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
for lat, lon, cluster in zip(manhattan_restaurants_merged['Venue Latitude'], manhattan_restaurants_merged['Venue Longitude'], manhattan_restaurants_merged['Hot Degree']):
    label = folium.Popup('Hot Degree '+str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters

### Label the restaurants in dense zone

In [30]:
# restaurants in dense zone will be labeled 1
dense_zone = []
for cluster in manhattan_restaurants_merged['Cluster Label']:
	if cluster == -1:
		dense_zone.append(0)
	else:
		dense_zone.append(1)
	
manhattan_restaurants_merged['Dense Zone'] = dense_zone
manhattan_restaurants_merged.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue ID,Rating,Tips Count,Cluster Label,Hot Degree,Dense Zone
0,Marble Hill,40.876551,-73.91066,Land & Sea Restaurant,40.877885,-73.905873,Seafood Restaurant,4b9c9c6af964a520b27236e3,7.1,19,-1,2,0
1,Marble Hill,40.876551,-73.91066,Grill 26 at TCR,40.878802,-73.915672,American Restaurant,5012c967e889cf0567e9e2d4,,0,-1,1,0
2,Chinatown,40.715618,-73.994279,Spicy Village,40.71701,-73.99353,Chinese Restaurant,4db3374590a0843f295fb69b,8.7,180,0,4,1
3,Chinatown,40.715618,-73.994279,Kiki's,40.714476,-73.992036,Greek Restaurant,5521c2ff498ebe2368634187,9.3,205,0,4,1
4,Chinatown,40.715618,-73.994279,Wah Fung Number 1 Fast Food 華豐快餐店,40.717278,-73.994177,Chinese Restaurant,4a96bf8ff964a520ce2620e3,8.4,99,0,4,1


In [31]:
manhattan_restaurants_merged.groupby('Dense Zone').mean()[['Rating', 'Tips Count']]

Unnamed: 0_level_0,Rating,Tips Count
Dense Zone,Unnamed: 1_level_1,Unnamed: 2_level_1
0,8.126969,59.019417
1,8.489665,103.734637


Again, we see the rating of the restaurants in dense zone (label 1) is greater than the median of rating (8.30), and the tips count  is greater than the 75th percentile of tips count (98).

Let's also take a look at the distribution of groups with different hot degree in dense zone vs. not in dense zone.

In [32]:
# distribution of groups with different hot degree in dense zone vs. not in dense zone
venue_counts = manhattan_restaurants_merged.groupby(['Dense Zone', 'Hot Degree']).count()[['Venue']]
venue_sum = manhattan_restaurants_merged.groupby(['Dense Zone']).count()[['Venue']]
percentage = (venue_counts / venue_sum).rename(columns={'Venue': 'Percentage'})
percentage = percentage.round({'Percentage': 2})
percentage

Unnamed: 0_level_0,Unnamed: 1_level_0,Percentage
Dense Zone,Hot Degree,Unnamed: 2_level_1
0,1,0.34
0,2,0.28
0,3,0.21
0,4,0.17
1,1,0.13
1,2,0.22
1,3,0.3
1,4,0.36


### Creat a map of dense zone vs. not dense zone

In [21]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(2)
ys = [i + x + (i*x)**2 for i in range(2)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
for lat, lon, cluster in zip(manhattan_restaurants_merged['Venue Latitude'], manhattan_restaurants_merged['Venue Longitude'], manhattan_restaurants_merged['Dense Zone']):
    label = folium.Popup('Dense Zone '+str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters

### Explore Nearby

In [19]:
# set number of clusters
kclusters = 12

dense_zone_restaurants = manhattan_restaurants_merged[manhattan_restaurants_merged['Dense Zone'] == 1]
restaurants_clustering = dense_zone_restaurants[['Venue Latitude', 'Venue Longitude']]

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(restaurants_clustering)

# # check cluster labels generated for each row in the dataframe
# kmeans.labels_[0:10] 

centers = kmeans.cluster_centers_
# print(centers[0:5])

centers_lat = centers[:, 0]
centers_lon = centers[:, 1]
centers_df = pd.DataFrame(zip(centers_lat, centers_lon), columns=['Latitude', 'Longitude'])
centers_df

Unnamed: 0,Latitude,Longitude
0,40.756047,-73.967811
1,40.717118,-73.992032
2,40.73922,-73.988749
3,40.785104,-73.977145
4,40.727539,-74.001429
5,40.7776,-73.951248
6,40.727326,-73.983917
7,40.748147,-73.97547
8,40.71713,-74.008848
9,40.748068,-73.986258


In [20]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array] + ['#a9a9a9']

# add markers to the map
for lat, lon, cluster in zip(manhattan_restaurants_merged['Venue Latitude'], manhattan_restaurants_merged['Venue Longitude'], manhattan_restaurants_merged['Cluster Label']):
    label = folium.Popup('Cluster '+str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=0.7).add_to(map_clusters)

for lat, lon, center in zip(centers_df['Latitude'], centers_df['Longitude'], centers_df.index):
    label = folium.Popup('Center '+str(center), parse_html=True)
    folium.Marker(
        [lat, lon],
        popup=label).add_to(map_clusters)
       
map_clusters

In [21]:
centers_cluster = [ 10, 0, 11, 2, 4, 1, 5, 3, 6, 9, 7, 8]
centers_df['Cluster'] = centers_cluster
centers_fixed = centers_df.sort_values(['Cluster']).set_index('Cluster')
centers_fixed

Unnamed: 0_level_0,Latitude,Longitude
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
0,40.717118,-73.992032
1,40.7776,-73.951248
2,40.785104,-73.977145
3,40.748147,-73.97547
4,40.727539,-74.001429
5,40.727326,-73.983917
6,40.71713,-74.008848
7,40.733589,-74.005268
8,40.721138,-73.987032
9,40.748068,-73.986258


### The address of each center

In [None]:
centers_address = []
for i in np.arange(centers_fixed.shape[0]):
	latitude = centers_fixed.iloc[i, 0]
	longitude = centers_fixed.iloc[i, 1]
	url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, 200, 20)
	
	results = requests.get(url).json()["response"]['groups'][0]['items']
	dataframe = json_normalize(results) # flatten JSON

	# filter columns
	filtered_columns = ['venue.location.distance', 'venue.location.formattedAddress']
	dataframe_filtered = dataframe.loc[:, filtered_columns]

	# 
	closest_address = dataframe_filtered[dataframe_filtered['venue.location.distance'] == dataframe_filtered['venue.location.distance'].min()]['venue.location.formattedAddress']

	centers_address.append([i, closest_address.values])

centers_address_df = pd.DataFrame(centers_address, columns=['Center', 'Address'])

In [71]:
for i in np.arange(centers_address_df.shape[0]):
	print('Center {}: {}'.format(centers_address_df.iloc[i, 0], centers_address_df.iloc[i, 1][0][0]))


Center 0: 295 Grand St (at Broome St)
Center 1: 316 E 86th St (btwn 1st & 2nd Ave.)
Center 2: 460 Amsterdam Ave (82nd St)
Center 3: 216 E 39th St
Center 4: 132 W Houston St (btwn Sullivan & MacDougal St)
Center 5: 109 Saint Marks Pl (btw 1st & Ave A)
Center 6: 50 Hudson St (Thomas St.)
Center 7: 228 W 10th St (btwn Bleecker & Hudson St)
Center 8: 164 Ludlow St (Stanton)
Center 9: 20 W 33rd St (Broadway)
Center 10: 973 2nd Ave (btwn E 51st & E 52nd)
Center 11: 25 E 20th St (btwn Broadway & Park Ave S)


### Analyze each dense zone

In [23]:
def getNearbyVenues(centers, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for center, lat, lng in zip(centers, latitudes, longitudes):
        print('Center ' + str(center))
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            center, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Center', 
                  'Center Latitude', 
                  'Center Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [55]:
# dense_zone_venues = getNearbyVenues(centers=centers_fixed.index, 
#                                     latitudes=centers_fixed['Latitude'], 
#                                     longitudes=centers_fixed['Longitude']
#                                     )

Center 0
Center 1
Center 2
Center 3
Center 4
Center 5
Center 6
Center 7
Center 8
Center 9
Center 10
Center 11


In [24]:
# dense_zone_venues.to_csv('dense_zone_venues.csv')
nearby_venues = pd.read_csv('dense_zone_venues.csv').iloc[:, 1:]
print(nearby_venues.shape)
nearby_venues.head()

(1200, 7)


Unnamed: 0,Center,Center Latitude,Center Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,0,40.717118,-73.992032,Wayla,40.718291,-73.992584,Thai Restaurant
1,0,40.717118,-73.992032,MooShoes NYC,40.717861,-73.990377,Shoe Store
2,0,40.717118,-73.992032,Simple,40.718145,-73.991988,Asian Restaurant
3,0,40.717118,-73.992032,CW Pencil Enterprise,40.717583,-73.990662,Paper / Office Supplies Store
4,0,40.717118,-73.992032,Orchard Grocer,40.717847,-73.990358,Vegetarian / Vegan Restaurant


In [60]:
print('There are {} uniques categories.'.format(len(nearby_venues['Venue Category'].unique())))

There are 223 uniques categories.


In [25]:
# one hot encoding
nearby_onehot = pd.get_dummies(nearby_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
nearby_onehot['Center'] = nearby_venues['Center'] 

# move neighborhood column to the first column
fixed_columns = [nearby_onehot.columns[-1]] + list(nearby_onehot.columns[:-1])
nearby_onehot = nearby_onehot[fixed_columns]

nearby_onehot.head()

Unnamed: 0,Center,Accessories Store,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Australian Restaurant,Austrian Restaurant,...,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Veterinarian,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


### Group rows by center and by taking the mean of the frequency of occurrence of each category


In [26]:
nearby_grouped = nearby_onehot.groupby('Center').mean().reset_index()
nearby_grouped

Unnamed: 0,Center,Accessories Store,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Australian Restaurant,Austrian Restaurant,...,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Veterinarian,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,0,0.0,0.04,0.0,0.0,0.0,0.0,0.03,0.01,0.01,...,0.02,0.0,0.0,0.02,0.0,0.02,0.0,0.0,0.0,0.01
1,1,0.0,0.01,0.0,0.01,0.0,0.0,0.01,0.0,0.0,...,0.0,0.0,0.0,0.02,0.0,0.02,0.03,0.0,0.01,0.02
2,2,0.01,0.03,0.0,0.0,0.0,0.01,0.02,0.0,0.0,...,0.02,0.0,0.0,0.01,0.0,0.03,0.01,0.0,0.0,0.02
3,3,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,...,0.02,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.01,0.01
4,4,0.01,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.01
5,5,0.0,0.02,0.0,0.01,0.01,0.01,0.0,0.0,0.0,...,0.04,0.0,0.0,0.03,0.0,0.04,0.0,0.0,0.0,0.0
6,6,0.0,0.06,0.01,0.0,0.02,0.0,0.01,0.01,0.0,...,0.01,0.0,0.0,0.0,0.01,0.02,0.02,0.01,0.0,0.01
7,7,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.0,0.0
8,8,0.0,0.0,0.0,0.01,0.02,0.0,0.02,0.01,0.0,...,0.01,0.01,0.0,0.02,0.01,0.02,0.03,0.0,0.0,0.01
9,9,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0


### Create the new dataframe and display the top 10 venues for each center

In [27]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [28]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Center']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
nearby_venues_sorted = pd.DataFrame(columns=columns)
nearby_venues_sorted['Center'] = nearby_grouped['Center']

for ind in np.arange(nearby_grouped.shape[0]):
    nearby_venues_sorted.iloc[ind, 1:] = return_most_common_venues(nearby_grouped.iloc[ind, :], num_top_venues)

nearby_venues_sorted

Unnamed: 0,Center,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,Bakery,Chinese Restaurant,American Restaurant,Pizza Place,Mexican Restaurant,Cocktail Bar,Asian Restaurant,Bar,Dumpling Restaurant,Coffee Shop
1,1,Coffee Shop,Italian Restaurant,Bar,Ice Cream Shop,Mexican Restaurant,Wine Shop,Bagel Shop,Hot Dog Joint,Dessert Shop,Spa
2,2,Café,Italian Restaurant,Bakery,Coffee Shop,American Restaurant,Wine Bar,Ice Cream Shop,Mediterranean Restaurant,Bar,Pizza Place
3,3,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Pizza Place,Burger Joint,Park,Taco Place,Hotel,Gourmet Shop,Gym
4,4,Italian Restaurant,French Restaurant,Dessert Shop,Cosmetics Shop,American Restaurant,Café,Indian Restaurant,Coffee Shop,Cocktail Bar,Sushi Restaurant
5,5,Bar,Wine Bar,Vegetarian / Vegan Restaurant,Korean Restaurant,Vietnamese Restaurant,Cocktail Bar,Coffee Shop,Sushi Restaurant,Pizza Place,Ice Cream Shop
6,6,American Restaurant,Spa,Italian Restaurant,French Restaurant,Coffee Shop,Café,Falafel Restaurant,Burger Joint,Playground,Gym / Fitness Center
7,7,Italian Restaurant,Coffee Shop,Cocktail Bar,New American Restaurant,French Restaurant,Jazz Club,Speakeasy,Ice Cream Shop,Seafood Restaurant,Chinese Restaurant
8,8,Pizza Place,Bakery,French Restaurant,Café,Rock Club,Wine Shop,Cocktail Bar,Italian Restaurant,Coffee Shop,Candy Store
9,9,Korean Restaurant,Hotel,American Restaurant,Japanese Restaurant,Dessert Shop,Italian Restaurant,Gym / Fitness Center,Hotel Bar,Coffee Shop,Bakery


## Results and Discussion <a name="results"></a>

Our analysis shows that although there is a great number of restaurants in Berlin (~2000 in our initial area of interest which was 12x12km around Alexanderplatz), there are pockets of low restaurant density fairly close to city center. Highest concentration of restaurants was detected north and west from Alexanderplatz, so we focused our attention to areas south, south-east and east, corresponding to boroughs Kreuzberg, Friedrichshain and south-east corner of central Mitte borough. Another borough was identified as potentially interesting (Prenzlauer Berg, north-east from Alexanderplatz), but our attention was focused on Kreuzberg and Friedrichshain which offer a combination of popularity among tourists, closeness to city center, strong socio-economic dynamics *and* a number of pockets of low restaurant density.

After directing our attention to this more narrow area of interest (covering approx. 5x5km south-east from Alexanderplatz) we first created a dense grid of location candidates (spaced 100m appart); those locations were then filtered so that those with more than two restaurants in radius of 250m and those with an Italian restaurant closer than 400m were removed.

Those location candidates were then clustered to create zones of interest which contain greatest number of location candidates. Addresses of centers of those zones were also generated using reverse geocoding to be used as markers/starting points for more detailed local analysis based on other factors.

Result of all this is 15 zones containing largest number of potential new restaurant locations based on number of and distance to existing venues - both restaurants in general and Italian restaurants particularly. This, of course, does not imply that those zones are actually optimal locations for a new restaurant! Purpose of this analysis was to only provide info on areas close to Berlin center but not crowded with existing restaurants (particularly Italian) - it is entirely possible that there is a very good reason for small number of restaurants in any of those areas, reasons which would make them unsuitable for a new restaurant regardless of lack of competition in the area. Recommended zones should therefore be considered only as a starting point for more detailed analysis which could eventually result in location which has not only no nearby competition but also other factors taken into account and all other relevant conditions met.

## Conclusion <a name="conclusion"></a>

Purpose of this project was to identify Berlin areas close to center with low number of restaurants (particularly Italian restaurants) in order to aid stakeholders in narrowing down the search for optimal location for a new Italian restaurant. By calculating restaurant density distribution from Foursquare data we have first identified general boroughs that justify further analysis (Kreuzberg and Friedrichshain), and then generated extensive collection of locations which satisfy some basic requirements regarding existing nearby restaurants. Clustering of those locations was then performed in order to create major zones of interest (containing greatest number of potential locations) and addresses of those zone centers were created to be used as starting points for final exploration by stakeholders.

Final decission on optimal restaurant location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration additional factors like attractiveness of each location (proximity to park or water), levels of noise / proximity to major roads, real estate availability, prices, social and economic dynamics of every neighborhood etc.