# Capstone Project - The Battle of the Neighborhoods
### Spezialisation in Data Science by IBM and Coursera

## Table of contents
* [Introduction/Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results](#results)
* [Discussion](#discussion)
* [Conclusion](#conclusion)


## Introduction/Business Problem <a name="introduction"></a>

   In my Business Problem I will analyse and compare **restaurants** and the **clusters of food venues** in the center of two different European cities, namely - **Sofia and Bucharest**. 

   As a Bulgarian it is interesting to me to compare two cities like **Sofia and Bucharest**, which are still in a transition period after the fall of the Communist regime in East Europe and despite that they are changing and there are already people from different nationalities living there, they are still **not that multicultural cities comparing to many European capitals**.    
    
   I want to analyse whether the **variety of restaurants** with cusines from different countries is **connected to the multiculturality** of a city or whether the natural people's desire to try dishes from all over the world makes the venues of a city multicultural, before the city has become such. 
    
   Additionally, I want to **analyze whether there are any restaurants with traditional national dishes from one of the two countries in the other capital city** (Bulgarian restaurants in Bucharest and Romanian restaurants in Sofia). And **where in the city center is the best place for such a restaurants to be opened**.
   
   This problem can be of a great use for private entrepreneurs from any of the two countries, who want to open a restaurant with traditional national cuisines in the neighbour country. It can also help governmental or tourism companies who want to promote their city and the diversity of restaurants and food venues it offers.   

## Data <a name="data"></a>

1. The first step is to allocate a main touristic sightseeing place in both cities - the Court House in Sofia and the Old Town in Bucharest.
   
   
2. Second, the geocoordinates of the two locations will be defined, using Google Maps.
  
  
3. The next step will be to define a surface with radius of 6 km from the central spot, which will be filled with a grid of small circles of radius 300 m. The center of the small circles will be used to extract the nearby food venues with the help of Foursquare. 


4. The venues will be first used to extract the restaurants and analyse them - how many restaurants there are nearby, whether there are any Bulgarian/Romanian restaurants. Depending on that, places with smaller number of Bulgarian/Romanian and with smaller number of restaurants as a whole will be highlighted as good places for a restaurant to be opened.  


5. Later the founded venues will be used for a more expanded clustering (food venues based) of the city centre. 


6. K - Means Clustering will be used for the both clustering cases - restaurants and venues based. It will help us find places with less number of restaurants nearby and will help us compare how similar they are to each other and how multicultural they are.

### The Places selected:
 - Sofia, the Court House 42.695138306155314, 23.320175955491948
 - Bucharest, The Old Town 44.43320239120804, 26.10238305064323
 
The Court House in Sofia is at the beginning of the Vitosha Boulevard, which is the main commercial street in the centre of Sofia, the capital of Bulgaria. The place is abundant in posh stores, restaurants and bars

The Old Town of Bucharest, Romania is located in the center of the city and is popular for its nightlife, restaurants, bars and other venues.

### Collecting the data

As mentioned, the first step will be to create a grid of circles("neighborhoods") with a radius of 300 meters in a big circle with a radius of 6 km from the Sofia Court House. Our grid of circles will be our grid of "neighborhoods", which food venues we will analyse and cluster. 

First, we work with metrics in meters. Later we convert the coordinates into latitude/longitude degrees to be shown on Folium map. So we will use functions to convert between WGS84 spherical coordinate system (latitude/longitude degrees) and UTM Cartesian coordinate system (X/Y coordinates in meters).

In [None]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # the well known pandas library for data analysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you don't have them installed 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means for the clustering
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you don't have them installed 
import folium # map rendering library

#!pip install shapely
import shapely.geometry

#!pip install pyproj
import pyproj

import math

import folium

import pickle

from folium import plugins
from folium.plugins import HeatMap

from sklearn.cluster import KMeans

print('Libraries imported.')

In [None]:
 # My Foursquare credentials
client_id = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
client_secret = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' 
version = '20180605'
limit = 100 

print('My credentails:')
print('CLIENT_ID: ' + client_id)
print('CLIENT_SECRET:' + client_secret)

In [None]:
# Google Maps Api Key
api_key = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

<h1><center>Sofia</center></h1>

In [None]:
sofia_center= ['42.695138306155314', '23.320175955491948']

In [None]:
def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')
print('Sofia center longitude={}, latitude={}'.format(sofia_center[1], sofia_center[0]))
x, y = lonlat_to_xy(sofia_center[1], sofia_center[0])
print('Sofia center UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('Sofia center longitude={}, latitude={}'.format(lo, la))

Next we create our hexagonal grid of cells. We offset every row and adjust vertical row spacing so that cell centers are equally distant from one another 

In [None]:
sofia_center_x, sofia_center_y = lonlat_to_xy(sofia_center[1], sofia_center[0]) # City center in Cartesian coordinates

k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_min = sofia_center_x - 6000
x_step = 600
y_min = sofia_center_y - 6000 - (int(21/k)*k*600 - 12000)/2
y_step = 600 * k 

latitudes = []
longitudes = []
distances_from_center = []
xs = []
ys = []
for i in range(0, int(21/k)):
    y = y_min + i * y_step
    x_offset = 300 if i%2==0 else 0
    for j in range(0, 21):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(sofia_center_x, sofia_center_y, x, y)
        if (distance_from_center <= 6001):
            lon, lat = xy_to_lonlat(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)

print(len(latitudes), 'candidate neighborhood centers generated.')

Next we visualize our grid of artificial cells on an nice map plot 

In [None]:
map_sofia = folium.Map(location=sofia_center, zoom_start=13)
folium.Marker(sofia_center, popup='Court House').add_to(map_sofia)
for lat, lon in zip(latitudes, longitudes):
    #folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sofia) 
    folium.Circle([lat, lon], radius=300, color='blue', fill=False).add_to(map_sofia)
    #folium.Marker([lat, lon]).add_to(map_sofia)
map_sofia

### Google Maps

We now have the coordinates of the centers which we will use to extract the venues nearby.

We use Google Maps API to get approximate addresses of those locations.

In [None]:
def get_address(api_key, latitude, longitude, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&latlng={},{}'.format(api_key, latitude, longitude)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        address = results[0]['formatted_address']
        return address
    except:
        return None

addr = get_address(api_key, sofia_center[0], sofia_center[1])
print('Reverse geocoding check')
print('-----------------------')
print('Address of [{}, {}] is: {}'.format(sofia_center[0], sofia_center[1], addr))

In [None]:
print('Obtaining location addresses: ', end='')
addresses = []
for lat, lon in zip(latitudes, longitudes):
    address = get_address(api_key, lat, lon)
    if address is None:
        address = 'NO ADDRESS'
    address = address.replace(', Bulgaria', '') # We don't need this part of the address
    addresses.append(address)
    print(' .', end='')
print(' done.')

In [None]:
addresses[0:20]

So far so good. Let's put the data into a Pandas dataframe and save the data into a local file.

In [None]:
df_locations = pd.DataFrame({'Address': addresses,
                             'Latitude': latitudes,
                             'Longitude': longitudes,
                             'X': xs,
                             'Y': ys,
                             'Distance from center': distances_from_center})

df_locations.head(10)

In [None]:
df_locations.to_pickle('./locations_bulgaria.pkl')

### Foursquare

Having the locations, we use our Foursquare API to get the food venues and from them we extract the restaurants in every neighborhood.

During the analysis of the restaurants we are not interested in every 'food' category, because coffee shops, pizza places, bakeries etc. are not direct competitors. So we include in our filter only venues with 'restaurant' in category name, and we'll search for all the subcategories of specific 'Romanian restaurant', as we need info on Romanian restaurants in the neighborhood.

In [None]:
# Category IDs corresponding were taken from Foursquare web site (https://developer.foursquare.com/docs/resources/categories):

food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues

romanian_restaurant_categories = ['52960bac3cf9994f4e043ac4']

def is_restaurant(categories, specific_filter=None):
    restaurant_words = ['restaurant', 'diner', 'taverna', 'steakhouse']
    restaurant = False
    specific = False
    for c in categories:
        category_name = c[0].lower()
        category_id = c[1]
        for r in restaurant_words:
            if r in category_name:
                restaurant = True
        if 'fast food' in category_name:
            restaurant = False
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
            restaurant = True
    return restaurant, specific

def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def format_address(location):
    address = ', '.join(location['formattedAddress'])
    address = address.replace(', България', '')
    address = address.replace(', Bulgaria', '')
    return address

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=300, limit=100):
    version = '20180724'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
    results = requests.get(url).json()['response']['groups'][0]['items']
    venues = [(item['venue']['id'],
               item['venue']['name'],
              get_categories(item['venue']['categories']),
              (item['venue']['location']['lat'], item['venue']['location']['lng']),
              format_address(item['venue']['location']),
             item['venue']['location']['distance']) for item in results]        

    venues_list.append([( 
        lat, 
        lon, 
        v['venue']['name'], 
        v['venue']['location']['lat'], 
        v['venue']['location']['lng'],  
        v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    
    return venues, nearby_venues


In [None]:
# Let's now get all the nearby food venues, restaurants and romanian restaurants
def get_restaurants(lats, lons):    
    restaurants = {}
    romanian_restaurants = {}
    location_restaurants = []

    print('Obtaining venues around candidate locations:', end='')
    for lat, lon in zip(lats, lons):
        venues, nearby_venues = get_venues_near_location(lat, lon, food_category, client_id, client_secret, radius=300, limit=100)
        area_restaurants = []
        for venue in venues:
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            venue_address = venue[4]
            venue_distance = venue[5]
            is_res, is_romanian = is_restaurant(venue_categories, specific_filter=romanian_restaurant_categories)
            if is_res:
                x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
                restaurant = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_romanian, x, y)
                if venue_distance<=300:
                    area_restaurants.append(restaurant)
                restaurants[venue_id] = restaurant
                if is_romanian:
                    romanian_restaurants[venue_id] = restaurant
        location_restaurants.append(area_restaurants)
        print(' .', end='')
    print(' done.')
    return restaurants, romanian_restaurants, location_restaurants, nearby_venues, venues

# Try to load the data from files if they were already saved
restaurants = {}
romanian_restaurants = {}
nearby_venues = {}
location_restaurants = []
venues_list = []
loaded = False
try:
    
    with open('restaurants_bulgaria_350.pkl', 'rb') as f:
        restaurants = pickle.load(f)
    with open('romanian_restaurants_350.pkl', 'rb') as f:
        romanian_restaurants = pickle.load(f)
    with open('location_restaurants_bulgaria_350.pkl', 'rb') as f:
        location_restaurants = pickle.load(f)
    with open('nearby_venues_bulgaria_350.pkl', 'rb') as f:
        nearby_venues = pickle.load(f)
    print('Data loaded.')
    loaded = True
except:
    pass

# In case the files don't exist, load the data from Foursquare
if not loaded:
    
    restaurants, romanian_restaurants, location_restaurants, nearby_venues,venues = get_restaurants(latitudes, longitudes)
    
    # Let's persist this in a local file system
    with open('restaurants_bulgaria_350.pkl', 'wb') as f:
        pickle.dump(restaurants, f)
    with open('romanian_restaurants_350.pkl', 'wb') as f:
        pickle.dump(romanian_restaurants, f)
    with open('location_restaurants_bulgaria_350.pkl', 'wb') as f:
        pickle.dump(location_restaurants, f)
    with open('nearby_venues_bulgaria_350.pkl', 'wb') as f:
        pickle.dump(nearby_venues, f)

In [None]:
print('Total number of restaurants:', len(restaurants))
print('Total number of Romanian restaurants:', len(romanian_restaurants))
print('Percentage of Romanian restaurants: {:.2f}%'.format(len(romanian_restaurants) / len(restaurants) * 100))
print('Average number of restaurants in neighborhood:', np.array([len(r) for r in location_restaurants]).mean())

In [None]:
nearby_venues.shape

OK, so it seems that there are no Romanian restaurants in the city center of Sofia. We have 1927 food venues and 867 of them are in the class "restaurants". The average number of restaurants in the neighborhood (circle of 600m) is around 2,4 , which is not that high.  

In [None]:
print('List of all restaurants')
print('-----------------------')
for r in list(restaurants.values())[:10]:
    print(r)
print('...')
print('Total:', len(restaurants))

In [None]:
print('Restaurants around location')
print('---------------------------')
for i in range(100, 110):
    rs = location_restaurants[i][:8]
    names = ', '.join([r[1] for r in rs])
    print('Restaurants around location {}: {}'.format(i+1, names))

We plot all the restaurants on the map of Sofia

In [None]:
map_sofia = folium.Map(location=sofia_center, zoom_start=13)
folium.Marker(sofia_center, popup='Court House').add_to(map_sofia)
for res in restaurants.values():
    lat = res[2]; lon = res[3]
    is_romanian = res[6]
    color = 'red' if is_romanian else 'blue'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_sofia)
map_sofia

Not bad. These are all the restaurants in an area of 6 kilometers from the Court House, and there are no Romanian restaurants. This is a very good niche for every entrepreneur who want to open such a venue.  

We also know the coordinates of every "neighborhood" and restaurant.

We are so far ready with the collection and the processing of data. Now we can proceed to analysis and determination of optimal locations for a new Romanian restaurant.

## Methodology <a name="methodology"></a>

### 1.Restaurants Analysis

In the first part of our analysis we will use the obtained restaurants data in order to analyse the **density of restaurants** in the given areas (**radius of 6 km from the city center**). Density of Romanian/Bulgarian restaurants would be also of interest, but **as we already found out, there are no Romanian restaurants in Sofia and as we will later find out there are no Bulgarian restaurants in Bucharest, too**. 

**Heatmaps** will be used to identify and visualize the areas with low restaurants density. 

**Areas with no more than 2 restaurants in radius of 250 meters** will be allocated as the most promising areas for future restaurant owners. The locations will be marked and **clusters of those locations (K-Means clustering)** will be created as a starting point for deeper exploration of the areas. 


### 2. Food Venues Analysis

The **number** of food venues, different type of unique **categories** in the whole area and the **frequency** of different types of food venues near every spot (center of "neighborhood") will be analyzed. The **top 5 and top 10 of the venues** for every given address will be defined. On the basis of **the top 10 venues a K-Means clustering** will be applied in order to find the **5 most common clusters of top 10 venues in the city center**. **The 5 clusters from Sofia and their frequency will be compared to the five clusters in Bucharest and their frequency in order to get first impression of the atmosphere of the center of both cities**. 


### Analysis <a name="analysis"></a>

### 1. Restaurants Analysis - Sofia

We continue with the processing and analysis of data in order to prepare it for the heatmaps. First we count the **number of restaurants in every area**:

In [None]:
location_restaurants_count = [len(res) for res in location_restaurants]

df_locations['Restaurants in area'] = location_restaurants_count

print('Average number of restaurants in every area with radius=300m:', np.array(location_restaurants_count).mean())

df_locations.head(10)

In [None]:
def boroughs_style(feature):
    return { 'color': 'blue', 'fill': False }

In [None]:
restaurant_latlons = [[res[2], res[3]] for res in restaurants.values()]

We are ready to plot the data on a **heatmap / density of restaurants**. Also, let's show on our map a few circles indicating distance of 1km, 2km and 3km from the Court House.

In [None]:
map_sofia = folium.Map(location=sofia_center, zoom_start=13)
folium.TileLayer('cartodbpositron').add_to(map_sofia) 
HeatMap(restaurant_latlons).add_to(map_sofia)
folium.Marker(sofia_center).add_to(map_sofia)
folium.Circle(sofia_center, radius=1000, fill=False, color='white').add_to(map_sofia)
folium.Circle(sofia_center, radius=2000, fill=False, color='white').add_to(map_sofia)
folium.Circle(sofia_center, radius=3000, fill=False, color='white').add_to(map_sofia)
map_sofia

We further define a new, more narrow region of interest with radius 2500 meters, which will include low-restaurant-count South - East from Sofia's Court House, which is a place rich on parks, nice surroundings and near the center. Much better than the low resataurant dencity part on the Nord - East from the Court House  

In [None]:
roi_x_min = sofia_center_x - 2000
roi_y_max = sofia_center_y + 1000
roi_width = 5000
roi_height = 5000
roi_center_x = roi_x_min + 2500
roi_center_y = roi_y_max - 2500
roi_center_lon, roi_center_lat = xy_to_lonlat(roi_center_x, roi_center_y)
roi_center = [roi_center_lat, roi_center_lon]

map_sofia = folium.Map(location=roi_center, zoom_start=14)
HeatMap(restaurant_latlons).add_to(map_sofia)
folium.Marker(sofia_center).add_to(map_sofia)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.4).add_to(map_sofia)
map_sofia

We get a nice visualization of the low restaurants dencity areas near the Court House

We then create a new more dense grid of "neighborhoods" only 50m in radius (100m "neighborhood" center from one another).

In [None]:
k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_step = 100
y_step = 100 * k 
roi_y_min = roi_center_y - 2500

roi_latitudes = []
roi_longitudes = []
roi_xs = []
roi_ys = []
for i in range(0, int(51/k)):
    y = roi_y_min + i * y_step
    x_offset = 50 if i%2==0 else 0
    for j in range(0, 51):
        x = roi_x_min + j * x_step + x_offset
        d = calc_xy_distance(roi_center_x, roi_center_y, x, y)
        if (d <= 2501):
            lon, lat = xy_to_lonlat(x, y)
            roi_latitudes.append(lat)
            roi_longitudes.append(lon)
            roi_xs.append(x)
            roi_ys.append(y)

print(len(roi_latitudes), 'candidate neighborhood centers generated.')

We then calculate once more one of the most important factors for the future restaurant owners: **number of restaurants in vicinity** (we'll use radius of **250 meters**).

In [None]:
def count_restaurants_nearby(x, y, restaurants, radius=250):    
    count = 0
    for res in restaurants.values():
        res_x = res[7]; res_y = res[8]
        d = calc_xy_distance(x, y, res_x, res_y)
        if d<=radius:
            count += 1
    return count

def find_nearest_restaurant(x, y, restaurants):
    d_min = 100000
    for res in restaurants.values():
        res_x = res[7]; res_y = res[8]
        d = calc_xy_distance(x, y, res_x, res_y)
        if d<=d_min:
            d_min = d
    return d_min

roi_restaurant_counts = []
roi_romanian_distances = []

print('Generating data on location candidates... ', end='')
for x, y in zip(roi_xs, roi_ys):
    count = count_restaurants_nearby(x, y, restaurants, radius=250)
    roi_restaurant_counts.append(count)
print('done.')

In [None]:
# Let's put this into a dataframe
df_roi_locations = pd.DataFrame({'Latitude':roi_latitudes,
                                 'Longitude':roi_longitudes,
                                 'X':roi_xs,
                                 'Y':roi_ys,
                                 'Restaurants nearby':roi_restaurant_counts})

df_roi_locations.head(10)

We then **filter** those locations: we're interested only in **locations with no more than two restaurants in radius of 250 meters**, and **no Romanian restaurants in a given radius, but as already metioned, since we don't have Romanian restaurants in Sofia according to our data, this is not a factor any more**.

In [None]:
good_res_count = np.array((df_roi_locations['Restaurants nearby']<=2))
print('Locations with no more than two restaurants nearby:', good_res_count.sum())

df_good_locations = df_roi_locations[good_res_count]

Let's plot this on a map.

In [None]:
good_latitudes = df_good_locations['Latitude'].values
good_longitudes = df_good_locations['Longitude'].values

good_locations = [[lat, lon] for lat, lon in zip(good_latitudes, good_longitudes)]

map_sofia = folium.Map(location=roi_center, zoom_start=14)
folium.TileLayer('cartodbpositron').add_to(map_sofia)
HeatMap(restaurant_latlons).add_to(map_sofia)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.6).add_to(map_sofia)
folium.Marker(sofia_center).add_to(map_sofia)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sofia) 
map_sofia

Great. We now have plenty of areas fairly close to the Court House with no more than two restaurants in a radius of 250m. Any of those locations is a potential candidate for a new Romanian restaurant, at least based on nearby competition.

We further highlight these locations on a heatmap:

In [None]:
map_sofia = folium.Map(location=roi_center, zoom_start=14)
HeatMap(good_locations, radius=25).add_to(map_sofia)
folium.Marker(sofia_center).add_to(map_sofia)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sofia)
map_sofia

This is a great visualization of potential regions for a future Romanian restaurant. 

Let us now **cluster** these areas in order to get  **centers of areas with good locations for further analysis**. Highlighting these zones, their centers and getting their addresses will be the final part of our analysis. 

In [None]:
number_of_clusters = 15

good_xys = df_good_locations[['X', 'Y']].values
kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(good_xys)

cluster_centers = [xy_to_lonlat(cc[0], cc[1]) for cc in kmeans.cluster_centers_]

map_sofia = folium.Map(location=roi_center, zoom_start=14)
folium.TileLayer('cartodbpositron').add_to(map_sofia)
HeatMap(restaurant_latlons).add_to(map_sofia)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.4).add_to(map_sofia)
folium.Marker(sofia_center).add_to(map_sofia)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=500, color='green', fill=True, fill_opacity=0.25).add_to(map_sofia) 
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sofia)
map_sofia

Our clusters represent groupings of most of the appropriate locations and cluster centers are placed nicely in the middle of the areas.

We finally post those areas without heatmap, using shaded areas to better indicate our clusters. One can zoom the map and observe the areas closer.

In [None]:
map_sofia = folium.Map(location=roi_center, zoom_start=14)
folium.Marker(sofia_center).add_to(map_sofia)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.Circle([lat, lon], radius=250, color='#00000000', fill=True, fill_color='#0066ff', fill_opacity=0.07).add_to(map_sofia)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sofia)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=500, color='green', fill=False).add_to(map_sofia) 
map_sofia

Getting the addresses of those clusters' centres will be a good starting point for further exploration of those neighborhoods to find the best possible location for a restaurant.

We get them with a reverse geocoding. 

In [None]:
candidate_area_addresses = []
print('==============================================================')
print('Addresses of centers of areas recommended for further analysis')
print('==============================================================\n')
for lon, lat in cluster_centers:
    addr = get_address(api_key, lat, lon).replace(', Bulgaria', '')
    candidate_area_addresses.append(addr)    
    x, y = lonlat_to_xy(lon, lat)
    d = calc_xy_distance(x, y, sofia_center_x, sofia_center_y)
    print('{}{} => {:.1f}km from Court House'.format(addr, ' '*(50-len(addr)), d/1000))

This concludes our Restaurants Analysis of Sofia. We have generated 15 addresses representing centers of areas with low number of restaurants and no Romanian restaurants nearby, all zones being fairly close to city center. Although the areas are surrounded in green circles, their shape is very irregular and these addresses, as already mentioned should be concidered as a starting point for the further analysis of the area for defining the concrete location of a restaurant. 

In [None]:
map_sofia = folium.Map(location=roi_center, zoom_start=14)
folium.Circle(sofia_center, radius=50, color='red', fill=True, fill_color='red', fill_opacity=1).add_to(map_sofia)
for lonlat, addr in zip(cluster_centers, candidate_area_addresses):
    folium.Marker([lonlat[1], lonlat[0]], popup=addr).add_to(map_sofia) 
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.Circle([lat, lon], radius=250, color='#0000ff00', fill=True, fill_color='#0066ff', fill_opacity=0.05).add_to(map_sofia)
map_sofia

### 2. Food Venues Analysis - Sofia


We use the "nearby_venues" data obtained at the beginning in order to transform it into a nice data frame which we will use for our further analysis - we have the lat,lon of the 'neighbourhood' center, the name and type of the food venues and their coordinates, too. So we need names of the columns and the addresses of the neighborhoods. We get them with the help ot their latitude - longitude.  

In [None]:
nearby_venues.columns = ['Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']

In [None]:
nearby_venues.head()

In [None]:
new_df = pd.merge(nearby_venues, df_locations[['Address','Latitude','Longitude']],  how='left', left_on=['Neighborhood Latitude','Neighborhood Longitude'], right_on = ['Latitude','Longitude'])

In [None]:
new_df.drop('Longitude', inplace=True, axis=1)
new_df.drop('Latitude', inplace=True, axis=1)

In [None]:
new_df = new_df.rename(columns={'Address': 'Neighborhood'})

In [None]:
sofia_center_venues = new_df[['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 'Venue', 'Venue Latitude','Venue Longitude','Venue Category']]

In [None]:
print(sofia_center_venues.shape)
sofia_center_venues.head()

OK. We now have the data frame we need for the analysis. Let's check how many venues were returned for each neighborhood

In [None]:
number_of_venues = sofia_center_venues.groupby('Neighborhood').count()
number_of_venues

In [None]:
number_of_venues.shape

Let us check how many unique categories of venues we have 

In [None]:
print('There are {} unique categories.'.format(len(sofia_center_venues['Venue Category'].unique())))

Let's analyse the frequency of any given venue on the particular address

In [None]:
# one hot encoding
sofia_center_onehot = pd.get_dummies(sofia_center_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
sofia_center_onehot['Neighborhood'] = sofia_center_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [sofia_center_onehot.columns[-1]] + list(sofia_center_onehot.columns[:-1])
sofia_center_onehot = sofia_center_onehot[fixed_columns]

sofia_center_onehot.head()

And let's examine the new dataframe size.

In [None]:
sofia_center_onehot.shape

Next we group rows by neighborhood and take the mean of the frequency of occurrence of each category

In [None]:
sofia_center_grouped = sofia_center_onehot.groupby('Neighborhood').mean().reset_index()
sofia_center_grouped.head()

Let's check the new size

In [None]:
sofia_center_grouped.shape

Let's print each neighborhood along with the top 5 most common venues

In [None]:
num_top_venues = 5

for hood in sofia_center_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = sofia_center_grouped[sofia_center_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

After that we put that into a _pandas_ dataframe

First, we write a function to sort the venues in a descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

After that we create a new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = sofia_center_grouped['Neighborhood']

for ind in np.arange(sofia_center_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(sofia_center_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

After that we cluster the Neighborhoods in 5 clusters after we run _K_-Means

In [None]:
# set number of clusters
kclusters = 5

sofia_center_grouped_clustering = sofia_center_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(sofia_center_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

We create a new dataframe that includes the cluster and the top 10 venues for each neighborhood.

In [None]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

sofia_center_merged = df_locations

# merge sofia_center_grouped with sofia_center_data to add latitude/longitude for each neighborhood
sofia_center_merged = sofia_center_merged[["Address","Latitude","Longitude"]].join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Address')

sofia_center_merged # check the last columns!

In [None]:
sofia_center_merged = sofia_center_merged.dropna(axis=0)

In [None]:
sofia_center_merged["Cluster Labels"] = sofia_center_merged["Cluster Labels"].astype(int)

sofia_center_merged


In [None]:
latitude = sofia_center[0] 
longitude = sofia_center[1]

Finally, we visualize the resulting clusters

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sofia_center_merged['Latitude'], sofia_center_merged['Longitude'], sofia_center_merged['Address'], sofia_center_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Examining the clusters, first we can add the number of the venues to every neighborhood

In [None]:
number_of_venues = number_of_venues[['Venue']]
number_of_venues.index.names = ['Address']

In [None]:
sofia_center_merged = pd.merge(sofia_center_merged, number_of_venues, on=["Address"])

In [None]:
sofia_center_merged

Now, we can examine each cluster and determine the discriminating venue categories that distinguish each one and name them

##### 1.We can surely call the first cluster in Sofia "BBQ Joint" cluster

In [None]:
sofia_center_merged.loc[sofia_center_merged['Cluster Labels'] == 0, sofia_center_merged.columns[[0] + list(range(4, sofia_center_merged.shape[1]))]]

##### 2.The second one is definitely the "Common Restaurants" cluster 

In [None]:
sofia_center_merged.loc[sofia_center_merged['Cluster Labels'] == 1, sofia_center_merged.columns[[0] + list(range(4, sofia_center_merged.shape[1]))]]

##### 3. The third is the "Cafes, Snack, Bakery" cluster

In [None]:
sofia_center_merged.loc[sofia_center_merged['Cluster Labels'] == 2, sofia_center_merged.columns[[0] + list(range(4, sofia_center_merged.shape[1]))]]

##### 4. The fourth is the "Fast food" Cluster 

In [None]:
sofia_center_merged.loc[sofia_center_merged['Cluster Labels'] == 3, sofia_center_merged.columns[[0] + list(range(4, sofia_center_merged.shape[1]))]]

##### 5. We can surely call the fifth cluster in Sofia "The rich on International Restaurants" cluster, where the number of venues per address is much higher, too

In [None]:
sofia_center_merged.loc[sofia_center_merged['Cluster Labels'] == 4, sofia_center_merged.columns[[0] + list(range(4, sofia_center_merged.shape[1]))]]

One can easily recognise that the predominant cluster in Sofia is luckily the fourth one - the "Rich on International Restaurants" cluster, followed by the "Common Restaurants" and the "Cafes" cluster. And again luckily much - much less neighborhoods fall in the "BBQ Joint" and "The Fast Food" clusters. We will thoroughly discuss the results in the final steps of our analysis. 

We should first continue with the analysis of the center of Bucharest. Absolutely the same technique as the one for Sofia was used, so I will not loose time on commenting the steps. There are only comments referring to the previous section.   

<h1><center>Bucharest</center></h1>

In [None]:
bucharest_center= ['44.43320239120804', '26.10238305064323']

In [None]:
def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')
print('bucharest center longitude={}, latitude={}'.format(bucharest_center[1], bucharest_center[0]))
x, y = lonlat_to_xy(bucharest_center[1], bucharest_center[0])
print('bucharest center UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('bucharest center longitude={}, latitude={}'.format(lo, la))

In [None]:
bucharest_center_x, bucharest_center_y = lonlat_to_xy(bucharest_center[1], bucharest_center[0]) # City center in Cartesian coordinates

k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_min = bucharest_center_x - 6000
x_step = 600
y_min = bucharest_center_y - 6000 - (int(21/k)*k*600 - 12000)/2
y_step = 600 * k 

latitudes = []
longitudes = []
distances_from_center = []
xs = []
ys = []
for i in range(0, int(21/k)):
    y = y_min + i * y_step
    x_offset = 300 if i%2==0 else 0
    for j in range(0, 21):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(bucharest_center_x, bucharest_center_y, x, y)
        if (distance_from_center <= 6001):
            lon, lat = xy_to_lonlat(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)

print(len(latitudes), 'candidate neighborhood centers generated.')

**The code is the same like the one used for Sofia**

In [None]:
map_bucharest = folium.Map(location=bucharest_center, zoom_start=13)
folium.Marker(bucharest_center, popup='Court House').add_to(map_bucharest)
for lat, lon in zip(latitudes, longitudes):
    #folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_berlin) 
    folium.Circle([lat, lon], radius=300, color='blue', fill=False).add_to(map_bucharest)
    #folium.Marker([lat, lon]).add_to(map_berlin)
map_bucharest

In [None]:
def get_address(api_key, latitude, longitude, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&latlng={},{}'.format(api_key, latitude, longitude)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        address = results[0]['formatted_address']
        return address
    except:
        return None

addr = get_address(api_key, bucharest_center[0], bucharest_center[1])
print('Reverse geocoding check')
print('-----------------------')
print('Address of [{}, {}] is: {}'.format(bucharest_center[0], bucharest_center[1], addr))

In [None]:
print('Obtaining location addresses: ', end='')
addresses = []
for lat, lon in zip(latitudes, longitudes):
    address = get_address(api_key, lat, lon)
    if address is None:
        address = 'NO ADDRESS'
    address = address.replace(', Romania', '') # We don't need country part of address
    addresses.append(address)
    print(' .', end='')
print(' done.')

**The code is the same like the one used for Sofia**

In [None]:
addresses[150:170]

In [None]:
df_locations = pd.DataFrame({'Address': addresses,
                             'Latitude': latitudes,
                             'Longitude': longitudes,
                             'X': xs,
                             'Y': ys,
                             'Distance from center': distances_from_center})

df_locations.head(10)

In [None]:
df_locations.to_pickle('./locations_romania.pkl')

In [None]:
# Category IDs corresponding to bulgarian restaurants were taken from Foursquare web site (https://developer.foursquare.com/docs/resources/categories):

food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues

bulgarian_restaurant_categories = ['56aa371be4b08b9a8d5734f3']

def is_restaurant(categories, specific_filter=None):
    restaurant_words = ['restaurant', 'diner', 'taverna', 'steakhouse']
    restaurant = False
    specific = False
    for c in categories:
        category_name = c[0].lower()
        category_id = c[1]
        for r in restaurant_words:
            if r in category_name:
                restaurant = True
        if 'fast food' in category_name:
            restaurant = False
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
            restaurant = True
    return restaurant, specific

def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def format_address(location):
    address = ', '.join(location['formattedAddress'])
    address = address.replace(', Romania', '')
    return address

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=300, limit=100):
    version = '20180724'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
    #try:
    results = requests.get(url).json()['response']['groups'][0]['items']
    venues = [(item['venue']['id'],
               item['venue']['name'],
              get_categories(item['venue']['categories']),
              (item['venue']['location']['lat'], item['venue']['location']['lng']),
              format_address(item['venue']['location']),
             item['venue']['location']['distance']) for item in results]        

    venues_list.append([( 
        lat, 
        lon, 
        v['venue']['name'], 
        v['venue']['location']['lat'], 
        v['venue']['location']['lng'],  
        v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    
#except:
 #   venues = [] 
 #   nearby_venues = []
    return venues, nearby_venues

**The code is the same like the one used for Sofia**

In [None]:
def get_restaurants(lats, lons):
    restaurants = {}
    bulgaria_restaurants = {}
    location_restaurants = []

    print('Obtaining venues around candidate locations:', end='')
    for lat, lon in zip(lats, lons):
        # Using radius=350 to meke sure we have overlaps/full coverage so we don't miss any restaurant (we're using dictionaries to remove any duplicates resulting from area overlaps)
        venues, nearby_venues = get_venues_near_location(lat, lon, food_category, client_id, client_secret, radius=300, limit=100)
        area_restaurants = []
        for venue in venues:
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            venue_address = venue[4]
            venue_distance = venue[5]
            is_res, is_bulgarian = is_restaurant(venue_categories, specific_filter=bulgarian_restaurant_categories)
            if is_res:
                x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
                restaurant = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_bulgarian, x, y)
                if venue_distance<=300:
                    area_restaurants.append(restaurant)
                restaurants[venue_id] = restaurant
                if is_romanian:
                    bulgarian_restaurants[venue_id] = restaurant
        location_restaurants.append(area_restaurants)
        print(' .', end='')
    print(' done.')
    return restaurants, bulgarian_restaurants, location_restaurants, nearby_venues, venues

# Try to load from local file system in case we did this before
restaurants = {}
bulgarian_restaurants = {}
nearby_venues = {}
location_restaurants = []
venues_list = []
loaded = False
try:
    
    with open('restaurants_romania_350.pkl', 'rb') as f:
        restaurants = pickle.load(f)
    with open('bulgarian_restaurants_350.pkl', 'rb') as f:
        romanian_restaurants = pickle.load(f)
    with open('location_restaurants_romania_350.pkl', 'rb') as f:
        location_restaurants = pickle.load(f)
    with open('nearby_venues_romania_350.pkl', 'rb') as f:
        nearby_venues = pickle.load(f)
    print('Restaurant data loaded.')
    loaded = True
except:
    pass

# If load failed use the Foursquare API to get the data
if not loaded:
    
    restaurants, bulgarian_restaurants, location_restaurants, nearby_venues,venues = get_restaurants(latitudes, longitudes)
    
    # Let's persists this in local file system
    with open('restaurants_romania_350.pkl', 'wb') as f:
        pickle.dump(restaurants, f)
    with open('bulgarian_restaurants_350.pkl', 'wb') as f:
        pickle.dump(romanian_restaurants, f)
    with open('location_restaurants_romania_350.pkl', 'wb') as f:
        pickle.dump(location_restaurants, f)
    with open('nearby_venues_romania_350.pkl', 'wb') as f:
        pickle.dump(nearby_venues, f)

### 1. Restaurants Analysis - Bucharest

In [None]:
print('Total number of restaurants:', len(restaurants))
print('Total number of Bulgarian restaurants:', len(bulgarian_restaurants))
print('Percentage of Bulgarian restaurants: {:.2f}%'.format(len(bulgarian_restaurants) / len(restaurants) * 100))
print('Average number of restaurants in neighborhood:', np.array([len(r) for r in location_restaurants]).mean())

In [None]:
print('List of all restaurants')
print('-----------------------')
for r in list(restaurants.values())[:10]:
    print(r)
print('...')
print('Total:', len(restaurants))

**The code is the same like the one used for Sofia**

In [None]:
print('Restaurants around location')
print('---------------------------')
for i in range(100, 110):
    rs = location_restaurants[i][:8]
    names = ', '.join([r[1] for r in rs])
    print('Restaurants around location {}: {}'.format(i+1, names))

In [None]:
map_bucharest = folium.Map(location=bucharest_center, zoom_start=13)
folium.Marker(bucharest_center, popup='Old City').add_to(map_bucharest)
for res in restaurants.values():
    lat = res[2]; lon = res[3]
    is_bulgarian = res[6]
    color = 'red' if is_bulgarian else 'blue'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_bucharest)
map_bucharest

In [None]:
location_restaurants_count = [len(res) for res in location_restaurants]

df_locations['Restaurants in area'] = location_restaurants_count

print('Average number of restaurants in every area with radius=300m:', np.array(location_restaurants_count).mean())

df_locations.head(10)

In [None]:
df_locations.shape

In [None]:
def boroughs_style(feature):
    return { 'color': 'blue', 'fill': False }

In [None]:
restaurant_latlons = [[res[2], res[3]] for res in restaurants.values()]

In [None]:
map_bucharest = folium.Map(location=bucharest_center, zoom_start=13)
folium.TileLayer('cartodbpositron').add_to(map_bucharest) #cartodbpositron cartodbdark_matter
HeatMap(restaurant_latlons).add_to(map_bucharest)
folium.Marker(bucharest_center).add_to(map_bucharest)
folium.Circle(bucharest_center, radius=1000, fill=False, color='white').add_to(map_bucharest)
folium.Circle(bucharest_center, radius=2000, fill=False, color='white').add_to(map_bucharest)
folium.Circle(bucharest_center, radius=3000, fill=False, color='white').add_to(map_bucharest)
map_bucharest

**The code is the same like the one used for Sofia**

In [None]:
roi_x_min = bucharest_center_x - 2000
roi_y_max = bucharest_center_y + 1000
roi_width = 5000
roi_height = 5000
roi_center_x = roi_x_min + 2500
roi_center_y = roi_y_max - 2500
roi_center_lon, roi_center_lat = xy_to_lonlat(roi_center_x, roi_center_y)
roi_center = [roi_center_lat, roi_center_lon]

map_bucharest = folium.Map(location=roi_center, zoom_start=14)
HeatMap(restaurant_latlons).add_to(map_bucharest)
folium.Marker(bucharest_center).add_to(map_bucharest)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.4).add_to(map_bucharest)
map_bucharest

In [None]:
k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_step = 100
y_step = 100 * k 
roi_y_min = roi_center_y - 2500

roi_latitudes = []
roi_longitudes = []
roi_xs = []
roi_ys = []
for i in range(0, int(51/k)):
    y = roi_y_min + i * y_step
    x_offset = 50 if i%2==0 else 0
    for j in range(0, 51):
        x = roi_x_min + j * x_step + x_offset
        d = calc_xy_distance(roi_center_x, roi_center_y, x, y)
        if (d <= 2501):
            lon, lat = xy_to_lonlat(x, y)
            roi_latitudes.append(lat)
            roi_longitudes.append(lon)
            roi_xs.append(x)
            roi_ys.append(y)

print(len(roi_latitudes), 'candidate neighborhood centers generated.')

In [None]:
def count_restaurants_nearby(x, y, restaurants, radius=250):    
    count = 0
    for res in restaurants.values():
        res_x = res[7]; res_y = res[8]
        d = calc_xy_distance(x, y, res_x, res_y)
        if d<=radius:
            count += 1
    return count

def find_nearest_restaurant(x, y, restaurants):
    d_min = 100000
    for res in restaurants.values():
        res_x = res[7]; res_y = res[8]
        d = calc_xy_distance(x, y, res_x, res_y)
        if d<=d_min:
            d_min = d
    return d_min

roi_restaurant_counts = []
roi_bulgarian_distances = []

print('Generating data on location candidates... ', end='')
for x, y in zip(roi_xs, roi_ys):
    count = count_restaurants_nearby(x, y, restaurants, radius=250)
    roi_restaurant_counts.append(count)
    distance = find_nearest_restaurant(x, y, bulgarian_restaurants)
    roi_bulgarian_distances.append(distance)
print('done.')

**The code is the same like the one used for Sofia**

In [None]:
df_roi_locations = pd.DataFrame({'Latitude':roi_latitudes,
                                 'Longitude':roi_longitudes,
                                 'X':roi_xs,
                                 'Y':roi_ys,
                                 'Restaurants nearby':roi_restaurant_counts})

df_roi_locations.head(10)

In [None]:
good_res_count = np.array((df_roi_locations['Restaurants nearby']<=2))
print('Locations with no more than two restaurants nearby:', good_res_count.sum())

df_good_locations = df_roi_locations[good_res_count]

In [None]:
good_latitudes = df_good_locations['Latitude'].values
good_longitudes = df_good_locations['Longitude'].values

good_locations = [[lat, lon] for lat, lon in zip(good_latitudes, good_longitudes)]

map_bucharest = folium.Map(location=roi_center, zoom_start=14)
folium.TileLayer('cartodbpositron').add_to(map_bucharest)
HeatMap(restaurant_latlons).add_to(map_bucharest)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.6).add_to(map_bucharest)
folium.Marker(bucharest_center).add_to(map_bucharest)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_bucharest) 
map_bucharest

In [None]:
map_bucharest = folium.Map(location=roi_center, zoom_start=14)
HeatMap(good_locations, radius=25).add_to(map_bucharest)
folium.Marker(bucharest_center).add_to(map_bucharest)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_bucharest)
map_bucharest

**The code is the same like the one used for Sofia**

In [None]:
from sklearn.cluster import KMeans

number_of_clusters = 15

good_xys = df_good_locations[['X', 'Y']].values
kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(good_xys)

cluster_centers = [xy_to_lonlat(cc[0], cc[1]) for cc in kmeans.cluster_centers_]

map_bucharest = folium.Map(location=roi_center, zoom_start=14)
folium.TileLayer('cartodbpositron').add_to(map_bucharest)
HeatMap(restaurant_latlons).add_to(map_bucharest)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.4).add_to(map_bucharest)
folium.Marker(bucharest_center).add_to(map_bucharest)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=500, color='green', fill=True, fill_opacity=0.25).add_to(map_bucharest) 
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_bucharest)
map_bucharest

In [None]:
map_bucharest = folium.Map(location=roi_center, zoom_start=14)
folium.Marker(bucharest_center).add_to(map_bucharest)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.Circle([lat, lon], radius=250, color='#00000000', fill=True, fill_color='#0066ff', fill_opacity=0.07).add_to(map_bucharest)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_bucharest)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=500, color='green', fill=False).add_to(map_bucharest) 
map_bucharest

In [None]:
candidate_area_addresses = []
print('==============================================================')
print('Addresses of centers of areas recommended for further analysis')
print('==============================================================\n')
for lon, lat in cluster_centers:
    addr = get_address(api_key, lat, lon).replace(', Romania', '')
    candidate_area_addresses.append(addr)    
    x, y = lonlat_to_xy(lon, lat)
    d = calc_xy_distance(x, y, bucharest_center_x, bucharest_center_y)
    print('{}{} => {:.1f}km from the Old Town'.format(addr, ' '*(50-len(addr)), d/1000))

**The code is the same like the one used for Sofia**

In [None]:
map_bucharest = folium.Map(location=roi_center, zoom_start=14)
folium.Circle(bucharest_center, radius=50, color='red', fill=True, fill_color='red', fill_opacity=1).add_to(map_bucharest)
for lonlat, addr in zip(cluster_centers, candidate_area_addresses):
    folium.Marker([lonlat[1], lonlat[0]], popup=addr).add_to(map_bucharest) 
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.Circle([lat, lon], radius=250, color='#0000ff00', fill=True, fill_color='#0066ff', fill_opacity=0.05).add_to(map_bucharest)
map_bucharest

### 2. Food Venues Analysis - Bucharest

In [None]:
nearby_venues.columns = ['Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']

In [None]:
nearby_venues.head()

In [None]:
new_df = pd.merge(nearby_venues, df_locations[['Address','Latitude','Longitude']],  how='left', left_on=['Neighborhood Latitude','Neighborhood Longitude'], right_on = ['Latitude','Longitude'])

In [None]:
new_df.drop('Longitude', inplace=True, axis=1)
new_df.drop('Latitude', inplace=True, axis=1)

In [None]:
new_df = new_df.rename(columns={'Address': 'Neighborhood'})

In [None]:
bucharest_center_venues = new_df[['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 'Venue', 'Venue Latitude','Venue Longitude','Venue Category']]

In [None]:
print(bucharest_center_venues.shape)
bucharest_center_venues.head()

In [None]:
number_of_venues = bucharest_center_venues.groupby('Neighborhood').count()
number_of_venues

**The code is the same like the one used for Sofia**

In [None]:
number_of_venues.shape

In [None]:
print('There are {} uniques categories.'.format(len(bucharest_center_venues['Venue Category'].unique())))

In [None]:
# one hot encoding
bucharest_center_onehot = pd.get_dummies(bucharest_center_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
bucharest_center_onehot['Neighborhood'] = bucharest_center_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [bucharest_center_onehot.columns[-1]] + list(bucharest_center_onehot.columns[:-1])
bucharest_center_onehot = bucharest_center_onehot[fixed_columns]

bucharest_center_onehot.head()

In [None]:
bucharest_center_onehot.shape

In [None]:
bucharest_center_grouped = bucharest_center_onehot.groupby('Neighborhood').mean().reset_index()
bucharest_center_grouped.head()

In [None]:
bucharest_center_grouped.shape

In [None]:
num_top_venues = 5

for hood in bucharest_center_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = bucharest_center_grouped[bucharest_center_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

**The code is the same like the one used for Sofia**

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = bucharest_center_grouped['Neighborhood']

for ind in np.arange(bucharest_center_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(bucharest_center_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

In [None]:
# set number of clusters
kclusters = 5

bucharest_center_grouped_clustering = bucharest_center_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(bucharest_center_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

In [None]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

bucharest_center_merged = df_locations

# merge bucharest_center_grouped with bucharest_center_data to add latitude/longitude for each neighborhood
bucharest_center_merged = bucharest_center_merged[["Address","Latitude","Longitude"]].join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Address')

bucharest_center_merged # check the last columns!

In [None]:
bucharest_center_merged = bucharest_center_merged.dropna(axis=0)

In [None]:
bucharest_center_merged["Cluster Labels"] = bucharest_center_merged["Cluster Labels"].astype(int)

bucharest_center_merged

In [None]:
latitude = bucharest_center[0] 
longitude = bucharest_center[1]

**The code is the same like the one used for Sofia**

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(bucharest_center_merged['Latitude'], bucharest_center_merged['Longitude'], bucharest_center_merged['Address'], bucharest_center_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [None]:
number_of_venues = number_of_venues[['Venue']]
number_of_venues.index.names = ['Address']

In [None]:
bucharest_center_merged = pd.merge(bucharest_center_merged, number_of_venues, on=["Address"])

In [None]:
bucharest_center_merged

Now, we can examine each specific Bucharest cluster and determine the different venue categories that distinguish each one and name them

##### 1. The first is a restaurants cluster, but the Italian Restaurants are so dominant in it that it can easily be called the "Italian Restaurant" cluster 

In [None]:
bucharest_center_merged.loc[bucharest_center_merged['Cluster Labels'] == 0, bucharest_center_merged.columns[[0] + list(range(4, bucharest_center_merged.shape[1]))]]

##### 2. The second cluster is "The Restaurants Cluster" - predominantly common restaurants, Romanian, Hungarian and East - European, but also many others

In [None]:
bucharest_center_merged.loc[bucharest_center_merged['Cluster Labels'] == 1, bucharest_center_merged.columns[[0] + list(range(4, bucharest_center_merged.shape[1]))]]

##### 3. The third cluster is definitely the "Cafes" cluster

In [None]:
bucharest_center_merged.loc[bucharest_center_merged['Cluster Labels'] == 2, bucharest_center_merged.columns[[0] + list(range(4, bucharest_center_merged.shape[1]))]]

##### 4. This one is rich on Romanian restaurants but it is so mixed and rich on other venues, too, so we will simply call it "The Mixed" cluster. And it is the one that is dominating in the selected region of Bucharest

In [None]:
bucharest_center_merged.loc[bucharest_center_merged['Cluster Labels'] == 3, bucharest_center_merged.columns[[0] + list(range(4, bucharest_center_merged.shape[1]))]]

##### 5. The fifth cluster is the "Pizza Place" cluster  

In [None]:
bucharest_center_merged.loc[bucharest_center_merged['Cluster Labels'] == 4, bucharest_center_merged.columns[[0] + list(range(4, bucharest_center_merged.shape[1]))]]

## Results<a name="results"></a>

Our analysis showed that surpisingly there are no Romanian/Bulgarian restaurants in Sofia/Bucharest in a circle with radius of 6km from the Court House/Old Town. I further conducted an Internet research which also gave the same results. For both cities I found only one sign of a venue with a Bulgarian dishes in Bucharest which was approximately 4-5 km from the Old Town and its advertisement was not very clear (it stayed that it is a fast food venue), so I am not sure if it can be even found on Foursquare. 

I think that these are good news for eventual stakeholders/entrepreneurs who want to open such a restaurant in one of the both countries. Although the national dishes on the Balkans are similar there is still a big variety of different cuisines in every country and the citizens of these both capitals will be happy to try some of the national dishes of their neighbour country. I think that the main reason that there are still not such restaurants is because of the similarity of both national dishes, second because there are not many expats from the neighbor countries in the both cities and the third - the economic situation of both countries in the transition period. 

Our analysis can be very useful for future stakeholders, because we highlighted very appealing areas of the city for such an investment. After defining the zone with a 6km radius around the chosen spot, we extracted data for all the food venues and restaurants in it. Using heatmaps to show us where the areas with high density of restaurants are, we defined new areas South - East from the chosen central spots. Those areas have low dencity of restaurants - less than two in a radius of 250m.The initial idea that the areas should also have no Romanian/Bulgarian restaurant in a radius of 400m didn't need further analysis, because there weren't such restaurants in the whole central area we chose.    

We clustered these zones with low density of restaurants and got 15 zones/clusters that are the perfect places for a new Romanian/Bulgarian restaurant. The addresses of these new best "neighborhoods" were recorded, which will be the perfect starting point for further analysis of the "neighborhood" and the choice of a restaurant place.  

We used the obtained data of food venues in order to cluster the city center and define areas with different predominant venues, tastes, national dishes, which help us get some initial picture for the atmosphere of the city. We found that the predominant cluster for the city center of Sofia is the one rich on international restaurants, followed by the cluster rich on common restaurants and caffees. At the back are the BBQ joint and the Fast food clusters, which paints a very nice first impression for the city center of Sofia. 

The predominant cluster for Bucharest is also rich on international restaurants, but not that well defined. There are a lot of spots with predominant common restaurants, Romanian restaurant and a lot of different food venues, too. That is why we called it a "Mixed" cluster. And it is not that much dominating the area like the predominant one in Sofia.The second one is the "Cafes" cluster with a small edge over the "Restaurants Cluster" with mainly common restaurants and some other types like Romanian, East - European, Hungarian and others. The less often to find are the "Pizza Place" and the "Italian Restaurants" clusters. But as a whole, the division of clusters in Bucharest shows less dominance of one of them, as already mentioned. Another particular feature of Bucharest is the very high number of Italian Restaurants which is pretty obvious.   

## Discussion<a name="discussion"></a>

As already mentioned, the analysis is only a starting point for further thorougher analysis of suitable places for Romanian/Bulgarian restaurants. The defined 15 clusters of neighborhoods with low density of restaurants were highlighted as circles, but actually they have an irregular form and the found addresses are only the starting point for further analysis of the area. There can be many reasons why there are not that many restaurants in these zones, which can make them unsuitable for an investment. 

Another topic which should be discussed is the clustering of the food venues. Since the "K-Means" function performs unsupervised Machine Learning process, we can get slightly different results, every time we perfom it. Especially when we run it on not that easily separable data like the food venues in Bucharest (that is why we have this "Mixed" cluster). The Data for Sofia is pretty stable and gives almost the same reuslts, but the one for Bucharest gives slightly different clustering every time it is run. The main separation of clustering and their dominance in the city center is still recognizable, but one should not be surprized if she doen't get the results discussed and the graphs showed to 100%. 

As already discussed, too, I was a little bit surprised that there are no Romanian restaurants in Sofia and no Bulgarian restaurants in Bucharest center, but I think that this is mainly due to reasons mentioned in the "Results" section. Both Balkan countries and its capitals faced a big transformation in the last thirty years and I am sure that the lack of such restaurants is actually a very good niche for future stakeholders and such venues would be very profitable if their place, interior and service are on the needed level.  

The next step in the further analysis of the center of both cities is to analyse all the venues in it (not only the food venues). This will help us see the entire icture of it and discover fully its atmosphere. 

## Conclusion <a name="conclusion"></a>

Our analysis is a good starting point for a business plan for a new restaurant, for state administration or tourism companies which want to promote both cities. We got 15 addresses in each citiy, which are a good starting point for analysis of potential neighborhoods for a future Romanian or Bulgarian restaurant. We analysed the "flavor" of both city-centres and highlighted spots where we can relax and enjoy a nice meal after a long tourist walk along the city. Next time we visit one of the both cities - either as a tourist or on a business - trip, we know where we can find delicious food.      