# Business Case: Identify the optimum location for a new hotel business

###### _Note: Folium maps do not appear on GitHub. To visualise the results please refer to the pictures in the folder of this project_

#### _Description of problem:_
An investor is looking to open a new hotel in Berlin. To be succesful with his new business he employs data science to identify the optimum    location to open up his new business. 
The investor's main concern is that he needs the area to open up his new business to be well facilitated with restaurants/cafes/shops and well connected with transport. 

#### _Description of data to be used in this analysis:_
To be able to visualise the optimum location for the new hotel business the Foursquare database and API will be employed. 
From the available data the existing successful hotel business will be retrieved. Factors that will determine the success of a hotel business based on the Foursquare data can be:  

 - The hotel/accommodation business is in the recommended list retrieved from the API _(it is very likely to be doing well as a business)_
 - The number of recommended venues near the hotel/accommodation business. Say within a radius of 500m
 - The venue is in an area where is well served by public transport. Let's estimate within 200m radius to have public transport services. 
 - The hotel/accommodation is located in an area with high population (based on the dataset)

**Considering the above points, the new hotel will be placed in an area where there is not enough competition. This means the target area will meet all the above points and will also have the lowest number of hotel/ accommodation businesses.**

Once the hotel/accomodation businesses which meet the above criteria are identified, they will be stored in a dataframe and then clustered with Kmeans. The venues will then be mapped via folium to visualise where these points of interest are located. 

Looking at Berlin, each borough is divided in localities based on the data provided by wikipedia: https://en.wikipedia.org/wiki/Boroughs_and_neighborhoods_of_Berlin#Administration

Data Wrangling it is used to obtain a dataframe with the boroughs and localities in each borough together with the coordinates for each locality.
The coordinates will be use to explore each locality.

#### Limitations: 
Due to the free subscrition with FourSquare it is not feasible to retrieve more meaningful data e.g venue details, venue ratings. These details would be very useful in increasing the accuracy of the successful businesses detected. This is therefore recommended for future work.

Also due to subscription limitation it is not feasible to explore all the range of the localities; therefore the 10 most populated localities have been selected.



###### Table of contents:

* [1. Retrieve data, web scraping](#getdata)
* [2. Explore the Foursquare data via the API](#Fsquare)
* [3. Get the scores for each venue](#scores)
* [4. Results](#results)
* [5. Conclusion and future work](#conclusion)

# 1. Retrieve data, web scraping <a name="getdata"></a>

Berlin boroughs-localities source: https://en.wikipedia.org/wiki/Boroughs_and_neighborhoods_of_Berlin#Administration 




In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import geocoder
from sklearn.cluster import KMeans
import folium 
import matplotlib.cm as cm
import matplotlib.colors as colors
import Square


In [3]:
url = r"https://en.wikipedia.org/wiki/Boroughs_and_neighborhoods_of_Berlin#Administration"
r= requests.get(url)
# soup = BeautifulSoup(r.text, 'html.parser')
source_t=pd.read_html(r.text) #This code finds the table automatically use a[0] for the table

### 1.1 Prepare the table to hold the boroughs and localities 

In [4]:
boroughs = ["Mitte", "Friedrichshain-Kreuzberg", "Pankow", "Charlottenburg-Wilmersdorf", "Spandau", "Steglitz-Zehlendorf", "Tempelhof-Schöneberg",
           "Neukölln", "Treptow-Köpenick", "Marzahn-Hellersdorf", "Lichtenberg", "Reinickendorf"] 
    
localities = pd.concat([source_t[i] for i in range(2,14)], ignore_index=True)
boroughs_locality = []
for i in range(2,14):
    for locality in source_t[i]["Locality"].values: 
        boroughs_locality.append(boroughs[i-2])

localities.insert(0,"Boroughs",boroughs_locality)
localities["Locality"] = localities["Locality"].apply(lambda x: x[7:])
localities.rename(columns={"{}".format(localities.columns[2]) : "Locality Area in km\u00b2"}, inplace=True)
localities.rename(columns={"{}".format(localities.columns[3]) : "Locality population"}, inplace=True)
localities.rename(columns={"{}".format(localities.columns[4]) : "Locality population density in km\u00b2"}, inplace=True)
localities.drop(columns = "Map", axis =1, inplace=True)



### 1.2 Search and find the geo coordinates for each locality

In [5]:
#this function takes about 1 min

from geopy.geocoders import Nominatim

coods=[]
for B,L in zip(localities.Boroughs, localities.Locality):
    address = "Berlin, " + B +", " + L
    geolocator = Nominatim(user_agent="Berlin_explore")
    location = geolocator.geocode(address)
    coods.append((location.latitude,location.longitude) )
localities["Coordinates(Lat, Lon)"] = coods




Narrow the localities dataframe to 10 localities with the highest population (due to Foursquare subscription limitations)

In [6]:
localities.sort_values(by = "Locality population", ascending=False, inplace=True)
localities.reset_index(drop=True, inplace=True)
localities_table = localities.head(10)

# 2. Explore the Foursquare data via the API <a name="Fsquare"></a>

### 2.1 Get the list of the the recommended hotel/accommodation businesses for each locality

In [7]:
# these are the credentials for the foursquare API
CLIENT_ID = Square.CLIENT_ID
CLIENT_SECRET = Square.CLIENT_SECRET 
TOKEN = Square.TOKEN
VERSION = "20201128"
LIMIT=100
radius = 500

API_counter=0 #variable to monitor how many times the API is called


In [15]:
def find_venues(localities_table):
    recommended=[]
    for (lat,lon), Locality in zip(localities_table.iloc[:,-1],localities_table.iloc[:,1] ) :
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
                    CLIENT_ID, 
                    CLIENT_SECRET, 
                    VERSION, 
                    lat, 
                    lon, 
                    radius, 
                    LIMIT)
        results = requests.get(url).json()["response"]['groups'][0]['items']
        global API_counter
        API_counter+=1

        for entry in results:
            recommended.append((Locality,
                                entry["venue"]["name"],
                                entry["venue"]["categories"][0]["name"],
                                entry["venue"]["location"]["lat"],
                                entry["venue"]["location"]["lng"],
                                      ))  
    venues = pd.DataFrame(recommended, columns=["Locality", "Venue_name", "Category", "Latitude", "Longitude"])
    return (venues, API_counter)
    
    
    

In [16]:
Recom_venues, API = find_venues(localities_table)

### 2.2 Filter the recommended venues for the hotel/accommodation businesses only

In [10]:
list_of_hotels = ["Hotel", "Bed & Breakfast", "Boarding House", "Hostel", "Inn", "Motel", "Resort", "Vacation Rental"]

In [11]:
Hotel_venues = Recom_venues[Recom_venues["Category"].isin(list_of_hotels)]
Hotel_venues.reset_index(drop=True, inplace=True)

It seems that there are 13 recommended hotel/accommodation businesses in Berlin.
Let's see on the map where are these businesses located.


In [12]:
Berlin_map = folium.Map(width=700, height=400,location=[52.5200, 13.4050], zoom_start=11)

for lat, lon,cat, vn in zip(Hotel_venues['Latitude'], Hotel_venues['Longitude'],Hotel_venues["Category"], Hotel_venues['Venue_name']):
    label = folium.Popup(cat + "\n" +vn, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color= "#7F00FF",
        fill=True,
        fill_color= "#7F00FF",
        fill_opacity=0.7).add_to(Berlin_map)

Berlin_map

In [13]:
Hotel_venues

Unnamed: 0,Locality,Venue_name,Category,Latitude,Longitude
0,Charlottenburg,Leonardo Hotel Berlin,Hotel,52.512792,13.304764
1,Charlottenburg,Smart Stay,Hostel,52.513457,13.305024
2,Charlottenburg,Ibis Styles Berlin an der Oper,Hotel,52.51168,13.312215
3,Friedrichshain,Sunflower Hostel,Hostel,52.509634,13.446249
4,Friedrichshain,Kiez Hostel Berlin,Hostel,52.50985,13.447432
5,Wilmersdorf,Hotel-Pension Gasteiner Hof,Hotel,52.487317,13.323413
6,Wilmersdorf,Ibis Berlin City West,Hotel,52.489119,13.316607
7,Gesundbrunnen,MOXY Berlin Humboldthain Park,Hotel,52.549568,13.384071
8,Gesundbrunnen,Hostelo Berlin,Hostel,52.550452,13.391515
9,Gesundbrunnen,Apartment & Hotel Ocak,Hotel,52.550891,13.391649


Let's find out which of these are located in good business areas. The conditions set previously need to be met:
* The number of hotel/accommodation businesses in the area. Say within a radius of 400m
* The venue is in an area where is well served by public transport. Let's estimate within 200m radius to have public transport services. 
* The hotel/accommodation is located in an area with high population (based on the dataset)


### 2.3. Find the number of recommended venues within the 400m of the hotel/ accommodation business.

In [17]:
def check_for_venues(Hotel_venues):
    places=[]
    for lat, lon in zip(Hotel_venues.iloc[:,3], Hotel_venues.iloc[:,4]):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
                    CLIENT_ID, 
                    CLIENT_SECRET, 
                    VERSION, 
                    lat, 
                    lon, 
                    400, 
                    LIMIT)
        results = requests.get(url).json()["response"]['groups'][0]['items']
        global API_counter
        API_counter+=1

        places.append(len(results))
         
    return (places, API_counter)

No_of_venues,API  = check_for_venues(Hotel_venues)
            
      


### 2.4 Is the venue in an area with good public transport? Public transport within 200m and >=2 transport means?

In [18]:
transport_list = ["Travel & Transport", "Transportation Service", "Tram Station", "Train Station",
                  "Train", "Platform", "Taxi", "Bus Station", "Bus Line", "Bus Stop", "Cable Car", "Tram Station"]

In [19]:
Public_transport=[]
for lat, lon in zip(Hotel_venues.iloc[:,3], Hotel_venues.iloc[:,4]):
    url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
                lat, 
                lon, 
                200, 
                LIMIT)
    
    results = requests.get(url).json()["response"]["venues"]
    global API_counter
    API+=1
    
    temp=[]
    for entry in results:
        try:
            if entry["categories"][0]["name"] in transport_list:
                temp.append(entry["categories"][0]["name"])
        except:
            pass
    if len(temp)>1:
        Public_transport.append(temp)
    else:
        Public_transport.append("Not adequate transport")
        
        
 

### 2.5 Merge hotel table with retrieved data

In [20]:
Hotel_venues["No. of recom. places"] = np.array(No_of_venues)
Hotel_venues["Public transport"] = Public_transport

### 2.6 Add locality population to the table

In [21]:
Hotel_venues["Locality population"] = np.zeros([Hotel_venues.shape[0],1])
Hotel_venues["Locality population"] = Hotel_venues["Locality"].apply(
    lambda x: localities_table.iloc[localities_table.index[localities_table["Locality"]==str(x)],3].values[0]
                                           )  


In [22]:
Hotel_venues

Unnamed: 0,Locality,Venue_name,Category,Latitude,Longitude,No. of recom. places,Public transport,Locality population
0,Charlottenburg,Leonardo Hotel Berlin,Hotel,52.512792,13.304764,40,Not adequate transport,118704
1,Charlottenburg,Smart Stay,Hostel,52.513457,13.305024,28,Not adequate transport,118704
2,Charlottenburg,Ibis Styles Berlin an der Oper,Hotel,52.51168,13.312215,23,Not adequate transport,118704
3,Friedrichshain,Sunflower Hostel,Hostel,52.509634,13.446249,48,Not adequate transport,114050
4,Friedrichshain,Kiez Hostel Berlin,Hostel,52.50985,13.447432,59,Not adequate transport,114050
5,Wilmersdorf,Hotel-Pension Gasteiner Hof,Hotel,52.487317,13.323413,41,"[Bus Stop, Bus Line, Bus Stop]",92815
6,Wilmersdorf,Ibis Berlin City West,Hotel,52.489119,13.316607,25,Not adequate transport,92815
7,Gesundbrunnen,MOXY Berlin Humboldthain Park,Hotel,52.549568,13.384071,30,"[Bus Line, Bus Stop]",82729
8,Gesundbrunnen,Hostelo Berlin,Hostel,52.550452,13.391515,48,"[Train Station, Platform, Platform, Platform, ...",82729
9,Gesundbrunnen,Apartment & Hotel Ocak,Hotel,52.550891,13.391649,45,"[Train Station, Platform, Platform]",82729


It seems that not all of the venues hotel/ accommodation businesses have good public transport within 200m.

It is important that the new hotel business is located in an area with low competition. This means in an area with the least hotels which meet the criteria from the list above.


# 3. Get the scores for each venue <a name="scores"></a>

The scores for each venue will be based on the: (1) number of recommended places in the vicinity, (2) number of public transport, (3) locality population, (4) number of hotel businesses that meet the search criteria.

The formula to be used to get the scores for each venue is (1) x (2) x (3) )/ (4)




### 3.1 First let's get the number of hotel business for each area and include this in the dataframe

In [23]:
Hotel_venues["No. of hotels in area"] = np.zeros(Hotel_venues.shape[0])

In [24]:
Hotel_venues["No. of hotels in area"] = Hotel_venues["Locality"].apply(
    lambda x: Hotel_venues["Locality"].value_counts()[x]
                                           )  

### 3.2 Second let's create a column in the dataframe for the number of transport means in the vicinity of each business

In [25]:
Nearby_transport=[]
for i in range(0,Hotel_venues.shape[0]):
    if Hotel_venues.iloc[i,6]=="Not adequate transport":
        Nearby_transport.append(1)
    else:
        Nearby_transport.append(len(Hotel_venues.iloc[i,6]))
        
Hotel_venues.insert(7, column="No. of competitors", value=Nearby_transport)

### 3.3 Now let's calculate the scores for each hotel/ accommodation business as described above

In [26]:
scores=[]
for i in range(0,Hotel_venues.shape[0]):
    scores.append( round( ((Hotel_venues.iloc[i,5]*Hotel_venues.iloc[i,7]*Hotel_venues.iloc[i,5]*8)/Hotel_venues.iloc[i,9]),2) )

Hotel_venues["scores"] = np.asarray(scores)


Sort the table by the score values - high to low

In [27]:
Hotel_venues.sort_values(by="scores", ascending=False,inplace=True )
Hotel_venues.reset_index(drop=True, inplace=True)

In [294]:
Hotel_venues

Unnamed: 0,Locality,Venue_name,Category,Latitude,Longitude,No. of recom. places,Public transport,No. of transport means,Locality population,No. of hotels in area,scores
0,Gesundbrunnen,Hostelo Berlin,Hostel,52.550452,13.391515,48,"[Train Station, Platform, Platform, Platform, ...",5,82729,3,30720.0
1,Mitte,Radisson Blu,Hotel,52.519561,13.402857,65,"[Bus Line, Bus Line]",2,79582,3,22533.33
2,Wilmersdorf,Hotel-Pension Gasteiner Hof,Hotel,52.487317,13.323413,41,"[Bus Stop, Bus Line, Bus Stop]",3,92815,2,20172.0
3,Gesundbrunnen,Apartment & Hotel Ocak,Hotel,52.550891,13.391649,45,"[Train Station, Platform, Platform]",3,82729,3,16200.0
4,Friedrichshain,Kiez Hostel Berlin,Hostel,52.50985,13.447432,59,Not adequate transport,1,114050,2,13924.0
5,Mitte,Capri By Fraser Berlin,Hotel,52.513972,13.404902,49,"[Bus Stop, Bus Stop]",2,79582,3,12805.33
6,Mitte,Hotel Nikolai Residence,Hotel,52.517283,13.406423,65,Not adequate transport,1,79582,3,11266.67
7,Friedrichshain,Sunflower Hostel,Hostel,52.509634,13.446249,48,Not adequate transport,1,114050,2,9216.0
8,Gesundbrunnen,MOXY Berlin Humboldthain Park,Hotel,52.549568,13.384071,30,"[Bus Line, Bus Stop]",2,82729,3,4800.0
9,Charlottenburg,Leonardo Hotel Berlin,Hotel,52.512792,13.304764,40,Not adequate transport,1,118704,3,4266.67


### 3.3 Let's visualise these businesses on the map based on their scoring 

In [40]:
red = np.array([255 / (Hotel_venues.iloc[:,-1].max()) * Hotel_venues.iloc[i,-1] for i in range(0,13) ])

Business_map = folium.Map(width=700, height=400,location=[52.5200, 13.4050], zoom_start=12)

for i, (lat, lon) in enumerate(zip(Hotel_venues['Latitude'], Hotel_venues['Longitude'])):
    folium.CircleMarker(
        [lat, lon],
        radius=10,
        color= "#{0:02x}{1:02x}{2:02x}".format(int(red[i]),0,0),
        fill=True,
        fill_color= "#{0:02x}{1:02x}{2:02x}".format(int(red[i]),0,0) ,
        fill_opacity=0.7).add_to(Business_map)

Business_map






# 4. Results <a name="results"></a>

From the map above we can see the areas with the higher potential to be a good business based on the available information.

The areas with more intense red colour are the areas with higher score meaning good areas to consider.

If we take a closer look we see that the competitors in the areas of interest are as follows:
* Gesundbrunnen: 2 competitors
* Wilmersdorf: 1 competitor
* Mitte: 3 competitors

Friedrichshain and Charlottenburg are clearly lower scored therefore we disregard these areas for the new hotel business.


# 5. Conclusion and future work <a name="conclusion"></a>

The areas of interest have been identified and marked on the map with the use of folium API.
These areas of interest are Gesundbrunnen, Wilmersdorf and Mitte.

These results need to be taken on board and shared with the stakeholders of the new business to recieve their feedback and recommednations.
Their feedback should be used for further analysis. 

We can also propose the following studies provided that the required data will be obtaind:

 * Analyse the tourism in each area; i.e how strong is the touristic population in each area and understand the reasons to visit each area.
 * Consider the cost of building in each area; land, permissions, agreements.
 * Analyse tehe demographics of each area; i.e how is the area growing and how did it grow within the last say 10 years. Prospects for careers    around the area? 












Thank you for reading through my work! I hope you found this interesting. If you have feedback even better! Drop me some thoughts in _cn.costantinou@gmail.com_ 



