<h1 align=center><font size = 6>Best locations for establishing new hotels in Budapest</font></h1>
<h2 align=center><font size = 5>Applied Data Science Capstone by IBM/Coursera (Week 2)</font></h2>
<h3 align=center><font size = 4>Prepared by Ferenc Farkas, PhD (2019-02-24)</font></h3>

## Abstract

There is a steady state growing in the tourism of Budapest and similar growth is expected in the coming years. Thus, there is a business case to establish new hotel(s) in the city. And several stakeholders are eager to do so. This analysis try to explore the current market and propose possible locations for establishing a new hotels in Budapest in aiming stakeholders to choose the optimal location for a new hotel.

### Table of contents
1. [Introduction](#Intro)
2. [Business understanding](#Business)
3. [Data collection and analysis](#Data)
4. [Methodology](#Methodology)
5. [Model creation](#Model)
6. [Results and Discussion](#Results)
7. [Conclusion](#Conclusion)

## 1. Introduction <a name="Intro"></a>

Every year, more and more people visit Budapest, the capital of Hungary and, even better, those visitors spend increasingly more time in the city ([Budapest tourism](http://www.ksh.hu/gyorstajekoztatok/#/en/document/ksz1812)). The passenger traffic of Budapest International Airport (BUD) increased heavily in the last 5 years (annual growth rate well above 10%) and by the end of this year is expected to almost double compared to 2013 ([BUD traffic](https://www.bud.hu/file/documents/2/2863/bud_international_airport_traffic_2009_2018.pdf)). Several developments have been carried out in the airport and more are planned for the near future. As a result, Budapest Airport has been awarded the Skytrax title for “Best Airport in the region” for the fifth time in a row. In the history of the most prestigious award in the industry based on passengers’ votes, winning the title in five consecutive years by the same airport in the region has been unprecedented. ([Skytrax award](https://www.bud.hu/en/budapest_airport/media/news/actual_press_releases/unprecedented_budapest_airport_receives_skytrax_award_for_the_fifth_time.html)).

And this impressive increase should continue, as Budapest took first place in European best destination' voting for “BEST EUROPEAN TRAVEL DESTINATION”. The notification of the winning the prize states that “no other winning European travel destination has received such international support, i.e. votes from outside the country concerned. 77% of the votes in support of Budapest came from outside Hungary, in particular the UK, USA, Germany, France, Austria and Italy ([BEST EUROPEAN TRAVEL DESTINATION](https://www.europeanbestdestinations.com/european-best-destinations-2019/)). The EU remains an attractive destination for Chinese tourists, and while the U.K. is still the most popular looking at sheer numbers, Hungaryʼs 25.1% growth in arrivals in 2018 puts the country in third place in terms of relative growth ([Chinese tourist arrivals](https://bbj.hu/analysis/hungary-posts-3rd-highest-growth-in-chinese-tourist-arrivals_161863)).

<img src="https://www.budapestinfo.hu/clab2/rest/image/file/10593/page_desktop/european_best_destination.jpg">

## 2. Business understanding <a name="Business"></a>

The increasing number of tourists visiting Budapest need to be accommodated somewhere. Thus, there is a great potential in establishing new hotels in Budapest in the coming years. But before starting to build a new hotel (either from the ground or by renovating an existing old building) requires a good understanding of the best locations which guarantee a good percentage of occupancy of the hotel over the whole year. For this reason, one should avoid locations where there are already plenty of hotels, and choose locations where tourists are still frequent, but hotels are rear. Hotel location shall also count the proximity of the metro station and popular sites, like landmarks, monuments, historic sites, museums, and even spas. This data analysis tries to help stakeholders in selecting the best locations in Budapest for establishing new hotels.

## 3. Data collection and analysis <a name="Data"></a>

Based on definition of our problem, factors that will influence our decision are:
* number of existing hotels in the neighborhood
* distance to nearest metro station
* distance from the city center
* number of nearby popular sites (landmarks, monuments, historic sites, museums, spas)

For gathering data we use the Foursquare API <br>
Subcategory IDs are taken from Foursquare web site https://developer.foursquare.com/docs/resources/categories

Let's start importing the required libraries.

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json
import requests

### Finding the geo location of Budapest

Budapest geo location can be obtained from [coordinates of Budapest](https://www.gps-latitude-longitude.com/gps-coordinates-of-budapest) which represents the so called 0km statue from where the counting is started for all main motorways and roads of Hungary going out of Budapest. This 0km statue is on the Clark Adam square near the Chain Bridge (see embedded picture above) and Budavár Castle (from where the above picture was taken). On the picture below you can see the 0km statue with the funicular in the back which takes you to the Budavár castle.

<img src=https://upload.wikimedia.org/wikipedia/commons/thumb/8/86/Nulla_kilometer_siklo.jpg/250px-Nulla_kilometer_siklo.jpg>

In [2]:
BP_longitude=19.040235
BP_latitude=47.497912

Providing credentials for the Foursquare API call (should be a hidden cell):

In [3]:
# @hidden_cell
CLIENT_ID = 'NSQVNDPYZB0FIKJWDSUXH3BTSDVJGVGFPZNCDUCD2KFYC14J' # your Foursquare ID
CLIENT_SECRET = 'SEZNIKT3GAWHQV5P2ISRRAVBCOK3CVZJSEMJTWQE42Y0JER5' # your Foursquare Secret
VERSION = '20180724' # Foursquare API version

### Collecting geo locations for metro stations in Budapest

There is a direct relation between flat prices and proximity of metro station ([Subway proximity effect](https://www.fciq.ca/pdf/mot_economiste/me_112016_en.pdf)), thus during analysis the distance of the metro stations are also counted. For this reason geo location of metro stations are also queried using Foursquare API.

In [4]:
category='4bf58dd8d48988d1fd931735' # Metro stations
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 10000 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    BP_latitude, 
    BP_longitude,
    category,
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=NSQVNDPYZB0FIKJWDSUXH3BTSDVJGVGFPZNCDUCD2KFYC14J&client_secret=SEZNIKT3GAWHQV5P2ISRRAVBCOK3CVZJSEMJTWQE42Y0JER5&v=20180724&ll=47.497912,19.040235&categoryId=4bf58dd8d48988d1fd931735&radius=10000&limit=100'

In [5]:
metrostations = requests.get(url).json()["response"]['groups'][0]['items']
#metrostations

Converting the obtained result to pandas dataframe.

In [6]:
metro_list=[]
metro_list.append([[
    v['venue']['name'],
    v['venue']['location']['lat'],
    v['venue']['location']['lng']] for v in metrostations])
metro_list
bp_metros=pd.DataFrame(data=metro_list[0])
bp_metros.columns=['Station','Latitude','Longitude']
bp_metros

Unnamed: 0,Station,Latitude,Longitude
0,"Deák Ferenc tér (M1, M2, M3)",47.497923,19.054016
1,Vörösmarty tér (M1),47.496698,19.050395
2,Metro Retro Lángos,47.503257,19.054823
3,Kossuth Lajos tér (M2),47.505446,19.046506
4,Bajcsy-Zsilinszky út (M1),47.499895,19.055054
5,Arany János utca (M3),47.503308,19.054552
6,Opera (M1),47.502263,19.058919
7,Astoria (M2),47.494509,19.060185
8,Batthyány tér (M2),47.506563,19.038631
9,Déli pályaudvar (M2),47.500583,19.024608


Removing unnecessary rows to obtain the final list of the metro stations with geo location. The total number of metrostations in Budapest is 52, but the list is reduced to 48 because locations where you can switch to another metro is counted only once.

In [7]:
idx=np.where(bp_metros['Station'].str.contains('M1|M2|M3|M4', regex=True).to_numpy()==False)[0]
bp_metros=bp_metros.drop(idx)
bp_metros=bp_metros.drop([13]).reset_index(drop=True)
bp_metros

Unnamed: 0,Station,Latitude,Longitude
0,"Deák Ferenc tér (M1, M2, M3)",47.497923,19.054016
1,Vörösmarty tér (M1),47.496698,19.050395
2,Kossuth Lajos tér (M2),47.505446,19.046506
3,Bajcsy-Zsilinszky út (M1),47.499895,19.055054
4,Arany János utca (M3),47.503308,19.054552
5,Opera (M1),47.502263,19.058919
6,Astoria (M2),47.494509,19.060185
7,Batthyány tér (M2),47.506563,19.038631
8,Déli pályaudvar (M2),47.500583,19.024608
9,Oktogon (M1),47.505175,19.063291


### Collecting geo locations of popular monuments, landmarks, historic sites, museums and spas in the city.

From my personal experience, tourist are in favor of hotels which are in close proximity to the most of the popular sites, like monuments, landmarks, historic sites, museums and spas in the city. Spas is included here because Budapest is famous of the dozens of spas, one of the famous being the [Széchenyi thermal bath](https://en.wikipedia.org/wiki/Sz%C3%A9chenyi_thermal_bath) in the City Park.
Let's start with the momuments and landmarks category.

In [8]:
category='4bf58dd8d48988d12d941735' # Monument / Landmark category
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 15000 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    BP_latitude, 
    BP_longitude,
    category,
    radius, 
    LIMIT)
landmarks_results = requests.get(url).json()["response"]['groups'][0]['items']
#landmarks_results

In [9]:
bp_list=[]
bp_list.append([[
    v['venue']['name'],
    v['venue']['categories'][0]['shortName'],
    v['venue']['location']['lat'],
    v['venue']['location']['lng']] for v in landmarks_results])
bp_list
bp_landmarks=pd.DataFrame(data=bp_list[0])
bp_landmarks.columns=['Popular sites','Category','Latitude','Longitude']
bp_landmarks

Unnamed: 0,Popular sites,Category,Latitude,Longitude
0,Bécsi Kapu,Landmark,47.505015,19.030654
1,Budavári Palota,Castle,47.496198,19.039543
2,Szabadság Szobor | Statue of Liberty (Szabadsá...,Landmark,47.486719,19.048083
3,Centenáriumi Emlékmű,Landmark,47.518103,19.044623
4,Kiskirálylány Szobor | Little Princess Statue,Landmark,47.49599,19.048202
5,Halászbástya | Fisherman's Bastion (Halászbástya),Scenic Lookout,47.502029,19.035058
6,Szent István Bazilika,Church,47.500786,19.053898
7,Hősök Tere | Heroes Square (Hősök tere),Plaza,47.514947,19.077716
8,1956-os Emlékmű,Landmark,47.511628,19.081761
9,Újpesti víztorony,Landmark,47.562316,19.106437


Repeat querry for historic sites.

In [10]:
category='4deefb944765f83613cdba6e' # Historic Site category
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 15000 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    BP_latitude, 
    BP_longitude,
    category,
    radius, 
    LIMIT)
historic_results = requests.get(url).json()["response"]['groups'][0]['items']
#historic_results

In [11]:
bp_list=[]
bp_list.append([[
    v['venue']['name'],
    v['venue']['categories'][0]['shortName'],
    v['venue']['location']['lat'],
    v['venue']['location']['lng']] for v in historic_results])
bp_list
bp_historic=pd.DataFrame(data=bp_list[0])
bp_historic.columns=['Popular sites','Category','Latitude','Longitude']
bp_historic

Unnamed: 0,Popular sites,Category,Latitude,Longitude
0,Citadella,Historic Site,47.486998,19.046345
1,Halászbástya | Fisherman's Bastion (Halászbástya),Historic Site,47.502029,19.035058
2,Várkert Bazár,Historic Site,47.494343,19.042807
3,Szent István Bazilika,Historic Site,47.500786,19.053898
4,Mária Magdolna templom / Mary Magdalene Tower ...,Historic Site,47.504045,19.029335
5,Parlament,Historic Site,47.507041,19.045658
6,Bécsi Kapu,Historic Site,47.505015,19.030654
7,Cipők a Duna-parton,Historic Site,47.503843,19.044917
8,Gül Baba Türbéje,Historic Site,47.515896,19.034629
9,Hősök Tere | Heroes Square (Hősök tere),Plaza,47.514947,19.077716


Continue querry with the museum categories.

In [12]:
category='4bf58dd8d48988d181941735' # Museum category
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 15000 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    BP_latitude, 
    BP_longitude,
    category,
    radius, 
    LIMIT)
museum_results = requests.get(url).json()["response"]['groups'][0]['items']
#museum_results

In [13]:
bp_list=[]
bp_list.append([[
    v['venue']['name'],
    v['venue']['categories'][0]['shortName'],
    v['venue']['location']['lat'],
    v['venue']['location']['lng']] for v in museum_results])
bp_list
bp_museums=pd.DataFrame(data=bp_list[0])
bp_museums.columns=['Popular sites','Category','Latitude','Longitude']
bp_museums

Unnamed: 0,Popular sites,Category,Latitude,Longitude
0,Budavári Palota,Castle,47.496198,19.039543
1,Magyar Nemzeti Galéria | Hungarian National Ga...,Art Museum,47.496082,19.039468
2,Flippermúzeum,Arcade,47.514703,19.054302
3,Sziklakórház (Sziklakórház és Atombunker),History Museum,47.500652,19.031667
4,1956. In Memorian Kossuth tér,History Museum,47.50637,19.046642
5,Dohány utcai zsinagóga,Synagogue,47.496031,19.060764
6,Magyar Nemzeti Múzeum,Museum,47.49116,19.062803
7,Szépművészeti Múzeum,Art Museum,47.516085,19.076493
8,Láthatatlan Kiállítás,Entertainment,47.51209,19.023817
9,Ludwig Múzeum,Art Museum,47.4695,19.070817


And finally look for thermal baths.

In [14]:
category='4bf58dd8d48988d1ed941735' # Spa category
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 10000 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    BP_latitude, 
    BP_longitude,
    category,
    radius, 
    LIMIT)
spa_results = requests.get(url).json()["response"]['groups'][0]['items']
#spa_results

In [15]:
bp_list=[]
bp_list.append([[
    v['venue']['name'],
    v['venue']['categories'][0]['shortName'],
    v['venue']['location']['lat'],
    v['venue']['location']['lng']] for v in spa_results])
bp_spa=pd.DataFrame(data=bp_list[0])
bp_spa.columns=['Popular sites','Category','Latitude','Longitude']
bp_spa

Unnamed: 0,Popular sites,Category,Latitude,Longitude
0,Kempinski Hotel Corvinus Budapest,Hotel,47.497337,19.052384
1,Rudas Gyógyfürdő és Uszoda,Spa,47.489188,19.047761
2,Széchenyi Gyógyfürdő és Uszoda,Spa,47.518302,19.082394
3,Irgalmasok Veli Bej fürdője,Spa,47.519143,19.038035
4,Szent Gellért Gyógyfürdő és Uszoda,Spa,47.483917,19.052256
5,Szent Lukács Gyógyfürdő és Uszoda,Spa,47.518224,19.037189
6,Mandala Day Spa,Spa,47.520137,19.05494
7,Corinthia Hotel Budapest,Hotel,47.502754,19.066858
8,Dandár Gyógyfürdő,Spa,47.476337,19.071061
9,Király Gyógyfürdő,Spa,47.510608,19.038185


Removing sites which are not Spa, Resort or Water Park. Then selecting those which are thermal bath (see [A guide to Budapest’s thermal baths](https://www.lonelyplanet.com/hungary/budapest/travel-tips-and-articles/a-guide-to-budapests-thermal-baths/40625c8c-8a11-5710-a052-1479d2760252)). The last one added because it is on the Margaret island.

In [16]:
idx=np.where(bp_spa['Category'].str.contains('Hotel|Massage Studio|Gym / Fitness|Salon / Barbershop|Sporting Goods|Salon / Barbershop', regex=True).to_numpy()==True)[0]
bp_spa=bp_spa.drop(idx).reset_index(drop=True)
idx=[0,1,2,3,4,6,7,13,18,22,25]
bp_spa=bp_spa.iloc[idx].reset_index(drop=True)
bp_spa

Unnamed: 0,Popular sites,Category,Latitude,Longitude
0,Rudas Gyógyfürdő és Uszoda,Spa,47.489188,19.047761
1,Széchenyi Gyógyfürdő és Uszoda,Spa,47.518302,19.082394
2,Irgalmasok Veli Bej fürdője,Spa,47.519143,19.038035
3,Szent Gellért Gyógyfürdő és Uszoda,Spa,47.483917,19.052256
4,Szent Lukács Gyógyfürdő és Uszoda,Spa,47.518224,19.037189
5,Dandár Gyógyfürdő,Spa,47.476337,19.071061
6,Király Gyógyfürdő,Spa,47.510608,19.038185
7,"Dagály Termálfürdő, Strandfürdő és Uszoda",Water Park,47.538782,19.061464
8,Danubius Health Spa Resort Margitsziget,Resort,47.533654,19.052578
9,Paskál Gyógy- és Strandfürdő,Water Park,47.520571,19.127469


Now, we are almost ready, only need to merge the lists and drop from the merged list the duplicated sites (which have identical geo locations).

In [17]:
bp_sites=pd.concat([bp_landmarks,bp_historic,bp_museums,bp_spa])
bp_sites=bp_sites.drop_duplicates(subset=['Latitude','Longitude'], inplace=False).reset_index(drop=True)
bp_sites

Unnamed: 0,Popular sites,Category,Latitude,Longitude
0,Bécsi Kapu,Landmark,47.505015,19.030654
1,Budavári Palota,Castle,47.496198,19.039543
2,Szabadság Szobor | Statue of Liberty (Szabadsá...,Landmark,47.486719,19.048083
3,Centenáriumi Emlékmű,Landmark,47.518103,19.044623
4,Kiskirálylány Szobor | Little Princess Statue,Landmark,47.49599,19.048202
5,Halászbástya | Fisherman's Bastion (Halászbástya),Scenic Lookout,47.502029,19.035058
6,Szent István Bazilika,Church,47.500786,19.053898
7,Hősök Tere | Heroes Square (Hősök tere),Plaza,47.514947,19.077716
8,1956-os Emlékmű,Landmark,47.511628,19.081761
9,Újpesti víztorony,Landmark,47.562316,19.106437


In [18]:
nr_sites,c=bp_sites.shape
print("Number of popular sites:",nr_sites)

Number of popular sites: 118


Show the distribution of popular sites as a heatmap using folium.

In [19]:
import folium
from folium.plugins import HeatMap

bp_sites_coords=[[bp_sites.loc[i,'Latitude'],bp_sites.loc[i,'Longitude']] for i in range(nr_sites)] 

map_bp= folium.Map(location=[BP_latitude, BP_longitude], zoom_start=11)
HeatMap(bp_sites_coords).add_to(map_bp)
map_bp

### Collecting geo locations for hotels

The idea was to use the geo location of each district for querying the hotel locations. Unfortunately, due to limitation of max. 100 returned findings of Foursquare API that was not a possible choice. Moreover, in the downtown there are more hotels than in the outer districts of Budapest. After several hours of struggling the idea was to create a fine grid around the downtown (9 geo locations - see on the folium map below) and create additional five geo locations outside of this fine grid (see the folium map below). 

In [20]:
dy=0.008993216059185 # 1000 m delta
dx=0.013311114236034 # 1000 m delta
start=np.array((47.481620, 19.037910))
raster_size=(4,3)
geo_raster=np.array([[start+np.array((i*dy,j*dx)) for j in range(raster_size[1])] for i in range(raster_size[0])])

Creating another 4 geo locations outside of the fine grid shown on the folium map above, one to north, one to south, one to west, and one to east, each at 5km from the edge of the fine raster shown above on the map. Because Budapest has an elongated shape toward its airport, then we should add another location to south-east. Some small outside portion of outskirt districts are not covered, but these locations are far away from the city center and no nearby metro stations, so we are not interested in these locations.

In [21]:
outskirts=[(47.560364, 19.052675), # 5km to north
           (47.497548, 19.141610), # 5km to east
           (47.432527, 19.053521), # 5km to south
           (47.497909, 18.961585), # 5km to west
           (47.437748, 19.192777)] # 6km to south-east

We can show the coverage on the map using folium.

In [22]:
import folium
map_bp= folium.Map(location=[BP_latitude, BP_longitude], zoom_start=11)
for i in range(raster_size[0]):
    for j in range(raster_size[1]):
        lat,lng=geo_raster[i,j]
        folium.CircleMarker(
                [lat, lng],
                radius=2,
                popup=None,
                color='blue',
                fill=True,
                fill_color='#3186cc',
                fill_opacity=0.7,
                parse_html=False).add_to(map_bp)  
        folium.Circle(
                [lat, lng],
                radius=707,
                color='blue',
                fill=False).add_to(map_bp)
for i in range(5):
    lat, lng=outskirts[i]
    folium.CircleMarker(
            [lat, lng],
            radius=2,
            popup=None,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_bp)  
    folium.Circle(
            [lat, lng],
            radius=5500,
            color='blue',
            fill=False).add_to(map_bp)

map_bp

Starting querrying hotel locations based on the fine grid geo location. Only 'hotel' subcategories (using 'shortname') are saved in a list.

In [23]:
category='4bf58dd8d48988d1fa931735' # Hotel category
LIMIT = 100 # limit the number of venues returned by Foursquare API
radius = 708 # define radius as 1000/sqrt(2)

lista=[]
for i in range(raster_size[0]):
    for j in range(raster_size[1]):
        lat,lng=geo_raster[i,j]
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng,
            category,
            radius, 
            LIMIT)
        results = requests.get(url).json()["response"]['groups'][0]['items']
        bp_list=[]
        bp_list.append([[
            v['venue']['name'],
            v['venue']['categories'][0]['shortName'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng']] for v in results])
        bp_hotel=pd.DataFrame(data=bp_list[0])
        bp_hotel.columns=['Hotel','Category','Latitude','Longitude']
        idx=np.where(bp_hotel['Category']!='Hotel')[0]
        bp_hotel=bp_hotel.drop(idx).reset_index(drop=True)
        lista.append(bp_hotel)

Merging the hotel locations from the list and removing duplicates

In [24]:
bp_hotels=pd.DataFrame(columns=['Hotel','Category','Latitude','Longitude'])
for i in range(len(lista)):
    bp_hotels=pd.concat([bp_hotels,lista[i]])
bp_hotels=bp_hotels.drop_duplicates(subset=['Latitude','Longitude'], inplace=False).reset_index(drop=True)
bp_hotels

Unnamed: 0,Hotel,Category,Latitude,Longitude
0,Danubius Hotel Flamenco,Hotel,47.477694,19.039513
1,Danubius Hotel Gellért,Hotel,47.483852,19.052578
2,Breakfast At Gellert,Hotel,47.484063,19.053157
3,Ibis Styles Budapest City,Hotel,47.479265,19.068071
4,The Three Corners Lifestyle Hotel,Hotel,47.485042,19.067252
5,Corvin Hotel Budapest,Hotel,47.483335,19.070335
6,Ramada Budapest Hotel,Hotel,47.481635,19.072666
7,Leonardo Hotel,Hotel,47.481747,19.072636
8,Waytostay,Hotel,47.4818,19.0722
9,Hotel Thomas Budapest,Hotel,47.48257,19.071333


Let's do the same for the remaining 5 geo locations.

In [25]:
category='4bf58dd8d48988d1fa931735' # Hotel category
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 5500 # define radius

lista=[]
for i in range(5):
    geo_latitude, geo_longitude=outskirts[i]
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        geo_latitude, 
        geo_longitude,
        category,
        radius, 
        LIMIT)
    results = requests.get(url).json()["response"]['groups'][0]['items']
    bp_list=[]
    bp_list.append([[
        v['venue']['name'],
        v['venue']['categories'][0]['shortName'],
        v['venue']['location']['lat'],
        v['venue']['location']['lng']] for v in results])
    bp_hotel=pd.DataFrame(data=bp_list[0])
    bp_hotel.columns=['Hotel','Category','Latitude','Longitude']
    idx=np.where(bp_hotel['Category'].str.contains('Hotel', regex=True).to_numpy()==False)[0]
    bp_hotel=bp_hotel.drop(idx).reset_index(drop=True)
    lista.append(bp_hotel)

In [26]:
bp_hotels1=pd.DataFrame(columns=['Hotel','Category','Latitude','Longitude'])
for i in range(len(lista)):
    bp_hotels1=pd.concat([bp_hotels1,lista[i]])
bp_hotels1=bp_hotels1.drop_duplicates(subset=['Latitude','Longitude'], inplace=False).reset_index(drop=True)
bp_hotels1

Unnamed: 0,Hotel,Category,Latitude,Longitude
0,Holiday Beach Budapest,Hotel,47.587661,19.068082
1,Danubius Grand Hotel Margitsziget,Hotel,47.532457,19.052748
2,NH Budapest City,Hotel,47.512773,19.05211
3,Park Inn by Radisson Budapest,Hotel,47.553458,19.078649
4,European Youth Center,Hotel,47.514257,19.030689
5,Hilton Budapest City,Hotel,47.513068,19.057917
6,The Aquincum Hotel Budapest,Hotel,47.537698,19.046407
7,Fortuna Szálloda- és Étteremhajó,Hotel,47.518662,19.049223
8,Premium Apartment,Hotel,47.530982,19.083232
9,Korda Villa,Hotel,47.531844,19.013318


Merge the two list of hotels into one.

In [27]:
bp_hotels_final=pd.concat([bp_hotels,bp_hotels1])
bp_hotels_final=bp_hotels_final.drop_duplicates(subset=['Latitude','Longitude'], inplace=False).reset_index(drop=True)
bp_hotels_final

Unnamed: 0,Hotel,Category,Latitude,Longitude
0,Danubius Hotel Flamenco,Hotel,47.477694,19.039513
1,Danubius Hotel Gellért,Hotel,47.483852,19.052578
2,Breakfast At Gellert,Hotel,47.484063,19.053157
3,Ibis Styles Budapest City,Hotel,47.479265,19.068071
4,The Three Corners Lifestyle Hotel,Hotel,47.485042,19.067252
5,Corvin Hotel Budapest,Hotel,47.483335,19.070335
6,Ramada Budapest Hotel,Hotel,47.481635,19.072666
7,Leonardo Hotel,Hotel,47.481747,19.072636
8,Waytostay,Hotel,47.4818,19.0722
9,Hotel Thomas Budapest,Hotel,47.48257,19.071333


In [28]:
nr_hotels,c=bp_hotels_final.shape
print("Number of hotels:",nr_hotels)

Number of hotels: 313


313 hotels have been retreived from Foursquare API. Let's show them on the map with folium.

In [29]:
from folium.plugins import HeatMap

map_bp= folium.Map(location=[BP_latitude, BP_longitude], zoom_start=11)
bp_hotel_coords=[[bp_hotels_final.loc[i,'Latitude'],bp_hotels_final.loc[i,'Longitude']] for i in range(312)] 

HeatMap(bp_hotel_coords).add_to(map_bp)
map_bp

Save the hotel locations in a file.

In [30]:
bp_hotels_final.to_csv('bp_existing_hotels.csv')

## 4. Methodology <a name="Methodology"></a>

In this project we will direct our efforts on detecting areas of Budapest that have low hotel density, particularly those which have high number of popular sites (landmarks, monuments, historic sites, museums, and spa), are close to a metro station, and are not far away from the city center.

In first step we have collected the required **data: location of every hotels in Budapest, the location of popular sites, and location of metro stations.**

In the second step in our analysis will be identifying high **hotel density** and low **hotel density** areas - we will use **heatmaps** to identify visually those locations. We will use **heatmaps** to identify visually where the popular sites are concentrated.

In third step we will use a machine learning algorithm to find locations which are not so dense regarding existing hotels. 
For this purpose we choose the Density-based spatial clustering of applications with noise (DBSCAN) algorithm which ia a density-based clustering algorithm. We will look for low density locations which are represented by outliers (noise) by the DBSCAN algorithm. We will also check locations which are the borders of the high density regions (clusters defined by DBSCAN).

When performing DBSCAN clustering methodology we are not interested in the number of clusters obtained, only the outliers and borders of the clusters. Based on the requirements stated  by the stakeholders, the existing hotels can be categorized as:
1. 'core' hotel, a hotel which has at least another 2 neighboring hotels within 250 m distance
2. 'border' hotel, a hotel which has another hotel within 250 m being a 'core' hotel
3. 'outlier' hotel, a hotel which has no 'core' or ‘border’ hotel within 250 m <br>
*Note: Because we set the min. number of hotels to form a cluster to 3, two hotels being 250 m away from each other but not being neighbors of 'core' or ‘border’ hotels still represents 'outlier' hotels.*

## 5. Model creation <a name="Model"></a>

Let's build up the model using scikit-learn library.

In [31]:
from sklearn.cluster import DBSCAN
from sklearn import metrics
import time

kms_per_radian = 6371.0088
epsilon = 0.25 / kms_per_radian # distance between hotels to form a cluster is 250 m 
cluster_nr=3 # minimum number of hotels to form a cluster

coords = bp_hotels_final.as_matrix(columns=['Latitude','Longitude'])


start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=cluster_nr, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
cluster_labels = db.labels_

# get the number of clusters
num_clusters = len(set(cluster_labels))

# all done, print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(bp_hotels_final), num_clusters, 100*(1 - float(num_clusters) / len(bp_hotels_final)), time.time()-start_time))

print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(coords, cluster_labels)))
db.labels_


Clustered 313 points down to 9 clusters, for 97.1% compression in 0.04 seconds
Silhouette coefficient: -0.065


  if __name__ == '__main__':


array([-1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  3,  1,  0,  0,  2,  2,
       -1, -1,  2,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  1,  3,  3,  3,  1,  3,  3,  3,  3,  3,  1,
        3,  3,  3,  3,  3,  1,  3,  4,  4,  4,  5,  4,  4,  5,  5,  5,  5,
       -1,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
        3,  3,  3,  3, -1, -1, -1, -1,  3,  6,  6,  3, -1,  3,  3,  6,  3,
        6,  6,  3,  3,  3,  3,  3,  3,  3,  3,  3, -1,  3, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1

Find the indices for 'outlier' and 'border' hotels and print out those hotels.

In [32]:
idx_outliers=(db.labels_<0).nonzero()[0]
idx_cores=db.core_sample_indices_
idx_clusters=(db.labels_>=0).nonzero()[0]
idx_borders=idx_clusters[np.isin(idx_clusters,idx_cores,invert=True).nonzero()]
print("Number of 'outlier' hotels:",len(idx_outliers))
print("Number of 'border' hotels:",len(idx_borders))
idx=np.concatenate((idx_borders,idx_outliers))
potential_loc=bp_hotels_final.loc[idx].reset_index(drop=True)
potential_loc

Number of 'outlier' hotels: 109
Number of 'border' hotels: 10


Unnamed: 0,Hotel,Category,Latitude,Longitude
0,Airbnb Budapest,Hotel,47.48746,19.064133
1,Gold Hotel Buda****,Hotel,47.488962,19.034231
2,Hotel Museum Budapest,Hotel,47.493722,19.063393
3,Jimi Hendrix Residence,Hotel,47.489924,19.06659
4,Lanchid 19 Design Hotel Budapest,Hotel,47.496589,19.041748
5,art'otel Budapest,Hotel,47.502888,19.039457
6,Gateway Budapest City Center,Hotel,47.502768,19.046838
7,Szinyei Merse Ház,Hotel,47.511679,19.06647
8,Castel Garden Hotel,Hotel,47.505612,19.028881
9,Hotel Mediterran,Hotel,47.485961,19.025048


We can show on a map with darkgray the 'core' hotels, with blue the 'border' hotels, and with red the 'outliers' hotels.

In [39]:
latitude, longitude=(47.572225, 18.988360)
map_bp= folium.Map(location=[BP_latitude, BP_longitude], zoom_start=12)

for i, (lat, lon) in enumerate(zip(bp_hotels_final['Latitude'], bp_hotels_final['Longitude'])):
    if i in idx_cores:
        folium.CircleMarker(
            [lat, lon],
            radius=2,
            popup=None,
            color='dimgray',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7).add_to(map_bp)
    elif i in idx_borders:
        folium.CircleMarker(
            [lat, lon],
            radius=2,
            popup=None,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7).add_to(map_bp)
    else:
        folium.CircleMarker(
            [lat, lon],
            radius=2,
            popup=None,
            color='red',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7).add_to(map_bp)
        
map_bp

We can observe that clusters are formed only in the city center indicating a high density hotel regions. In the outskirt we can see only outliers (noise). We can see some outliers near the international airport.

Let's define the function which calculates the Haversine distance in km between two coordinates:<br>
(Source: https://stackoverflow.com/questions/19412462/getting-distance-between-two-points-based-on-latitude-longitude/43211266#43211266)

In [34]:
import math
def haversine_dist(origin, destination):
    lat1, lon1 = origin
    lat2, lon2 = destination
    radius = 6371  # km

    dlat = math.radians(lat2 - lat1)
    dlon = math.radians(lon2 - lon1)
    a = (math.sin(dlat / 2) * math.sin(dlat / 2) +
         math.cos(math.radians(lat1)) * math.cos(math.radians(lat2)) *
         math.sin(dlon / 2) * math.sin(dlon / 2))
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    d = radius * c
    return d

Let's calculate the distances between the outliers and metro stations. Filter out hotels which are farther away than 10 min. walking distance counted as 800 m. The rest shall be considered potential new hotel locations. For those calculate the distance from the center (center is the geo location of the metro station in Deak Ferenc tér in the downtown - first row in the bp_metros dataframe). Also calculate the number of popular sites (museums, historic sites, landmarks, spas) which are within 15 min. walking distance (counted as 1200 m) and add to the table. Tourists are attracted by hotels which are close to many popular sites and close to the city center. We filter out hotels which are more than 3km away from the center.

In [35]:
geo_center=(bp_metros.loc[0,'Latitude'],bp_metros.loc[0,'Longitude'])
nr_outliers,c=potential_loc.shape
nr_metros,c=bp_metros.shape
nr_sites,c=bp_sites.shape
print((nr_outliers,nr_metros,nr_sites))

(119, 48, 118)


In [36]:
new_hotel_location=pd.DataFrame(columns=['Latitude','Longitude','Dist. from metro','Dist. from center',
                                         'Nr. of nearby popular sites','Singleton'])
for i in range(nr_outliers):
    min_dist=np.min(np.array([haversine_dist((potential_loc.loc[i,'Latitude'],potential_loc.loc[i,'Longitude']),
                                             (bp_metros.loc[j,'Latitude'],bp_metros.loc[j,'Longitude'])) for j in range(nr_metros)]))
    if min_dist<=0.8:
        dist_center=haversine_dist((potential_loc.loc[i,'Latitude'],potential_loc.loc[i,'Longitude']),geo_center)
        if dist_center<=3:
            nearby_sites=np.sum(np.array([haversine_dist((potential_loc.loc[i,'Latitude'],potential_loc.loc[i,'Longitude']),
                                                 (bp_sites.loc[j,'Latitude'],bp_sites.loc[j,'Longitude'])) for j in range(nr_sites)])<=1.2)
            new_hotel_location=new_hotel_location.append({
                                        'Latitude':potential_loc.loc[i,'Latitude'],
                                       'Longitude':potential_loc.loc[i,'Longitude'],
                                       'Dist. from metro':min_dist,
                                       'Dist. from center':dist_center,
                                       'Nr. of nearby popular sites':nearby_sites,
                                       'Singleton':'No' if i in idx_borders else 'Yes'},ignore_index=True)

Sort the hotels based on how close they are to the city center and how many popular sites are nearby. Also mark whether this is an 'outlier' hotel (only one hotel in the location - we call it singleton) or 'border' hotel (there are two hotels around the location).

In [37]:
new_hotel_location=new_hotel_location.sort_values(by=['Dist. from center','Nr. of nearby popular sites']).reset_index(drop=True)
new_hotel_location

Unnamed: 0,Latitude,Longitude,Dist. from metro,Dist. from center,Nr. of nearby popular sites,Singleton
0,47.502768,19.046838,0.29883,0.762249,38,Yes
1,47.493722,19.063393,0.256393,0.84526,10,Yes
2,47.496589,19.041748,0.649746,0.933512,42,Yes
3,47.492477,19.044258,0.657939,0.950875,32,Yes
4,47.502888,19.039457,0.413285,1.22513,39,Yes
5,47.489924,19.06659,0.356582,1.297544,7,Yes
6,47.48746,19.064133,0.280573,1.389737,10,Yes
7,47.484063,19.053157,0.169639,1.542466,15,Yes
8,47.483852,19.052578,0.17463,1.568391,15,No
9,47.50853,19.038986,0.220343,1.632684,31,Yes


In [41]:
nr_newloc,c=new_hotel_location.shape
map_bp= folium.Map(location=[BP_latitude, BP_longitude], zoom_start=14)
for i in range(nr_sites):
    lat, lng=(bp_sites.loc[i,'Latitude'],bp_sites.loc[i,'Longitude'])
    folium.CircleMarker(
            [lat, lng],
            radius=2,
            popup=None,
            color='red',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_bp)  
for i in range(nr_newloc):
    lat, lng=(new_hotel_location.loc[i,'Latitude'],new_hotel_location.loc[i,'Longitude'])
    if new_hotel_location.loc[i,'Singleton']=='Yes':
        folium.Circle(
                [lat, lng],
                radius=250,
                color='blue',
                fill=False).add_to(map_bp)
    else:
        folium.Circle(
                [lat, lng],
                radius=250,
                color='yellow',
                fill=False).add_to(map_bp)

map_bp

## 6. Results and Discussion <a name="Results"></a>

Our analysis shows that there are several locations close to city center with low density hotel regions (defined by a circle with 250 m radius having only one hotel). These locations are close to metro stations (within 10 minute walking distance) which has several nearby popular sites (within 15 minutes walking distance).

More locations could be found if not only metro stations, but tram stations are also counted. Moreover, there are locations where there are not hotels at all, which are not identified by DBSCAN algorithm. However, identifying those locations are not simple, because Budapest has big areas where hotels cannot be established even those there are no hotels at all, simply because there is the river Danube, there are big islands, hills, public parks, forests, or fields without infrastructure. That is why the simplistic approach of looking for outliers of DBSCAN was chosen.

## 7. Conclusion <a name="Conclusion"></a>


Purpose of this project was to identify areas in Budapest close to city center with low number of hotels in order to aid stakeholders in narrowing down the search for optimal location for a new hotel. When looking for the optimal location the distance to the metro station and the number of nearby popular sites were also considered. 

Because Budapest has big regions, even close to the center, which are not suitable for establishing hotels, like the river Danube, big islands (the most famous being the Margaret island and the Obuda island, the letter being known from the [Sziget Festival](https://en.wikipedia.org/wiki/Sziget_Festival)), hills (like Gellért with Citadella), public parks (like City Park), forests, or fields without infrastructure, the simplest approach was chosen that we look for locations where one, and only one hotel exists.

Our analysis found 33 such locations not further than 3km from the city center. Additional 3 locations have been identified where only two hotels exists. Final decision on optimal hotel location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended location.
