<h1 align="left">Where to Open a new Airbnb in Tokyo?</h1>


## Table of Contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Imports](#imports)
* [Foursquare Credentials and Version](#foursquare)
* [Data Collection & Preparation ](#dataCollAndPrep)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#resultsDiscussion)
* [Conclusion](#conclusion)

# Introduction: Business Problem <a name="introduction"></a>

Visiting Tokyo feels like taking a trip to the future. Robots serve and deliver your food. Vending machines sell everything — from umbrellas to puppies. This modern world is combined with old traditions and creates a unique place. It is no wonder that Tokyo attracts countless tourists every year. Therefore Tokyo has been a growing market for property owners renting out their space for the public; 民泊 or minpaku as they would call it in Japanese. Like everywhere else in the world, Airbnb has started to take over a large part of the hotel industry through its disruptive business model.

Since Airbnb is already playing a major role for tourism in Tokyo and is likely to continue to do so in the future, I would like to perform a data-driven location analysis for possible new Airbnbs.

Tokyo is the de facto capital and most populous prefecture of Japan, which is located at the head of Tokyo Bay. As of 2021, the prefecture has an estimated population of 13,960,236. The Greater Tokyo Area is the most populous metropolitan area in the world, with more than 37.393 million residents as of 2020. This large city attracts more tourists every year — tourism is steadily increasing. It is very likely that the COVID-19 period is just an exception to this trend. I expect an rebound in this sector when COVID-19 is largely and effectively treatable or preventable. This rebound is also one reason why many people fear inflation, but that is another topic.

To figure out where it would be worthwhile to open an Airbnb in Tokyo, we need to understand the supply and demand relationship first. When supply noticeably exceeds demand, strong competition is likely to occur, often leading to price wars. At the end of this process the market equilibrium will be restored by the exit of market participants. This circumstance leads us to the first principle of this analysis:
- If a new Airbnb is to be opened, competition in the surrounding area should be at a healthy level.

Let’s take a look at the demand. In Tokyo, everything can be reached very well by the public transport network. Nevertheless, it can be assumed that tourists want to be surrounded by certain venues and do not want to be too far away from the city center. Tourists tend to like traditional sushi bars, authentic local ramen stalls and extraordinary coffee shops nearby. The bottom line is that it should be an area where tourist can discover a lot, which leads us to the second principle:
- If an new Airbnb is to be opened, the area should have top rated venues which are relevant for tourists and the location should not be too far away from the city center.

The goal is to identify places in Tokyo that fulfill both principles.

# Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
- number of existing Airbnbs in the neighborhoods of Tokyo
- most popular venues which mainly characterise a neighborhood

Following data sources will be needed to extract/generate the required information:
- Airbnb data: http://insideairbnb.com/get-the-data.html
- Venues data: https://foursquare.com/

# Imports <a name="imports"></a>

In [None]:
import re
import json
import requests
import numpy as np
from bs4 import BeautifulSoup

import pandas as pd
#display all rows
pd.set_option('display.max_rows', None)
#display all columns
pd.set_option('display.max_columns', None)

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

from geopy.geocoders import Nominatim

import folium

import matplotlib.image as mpimg
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
%matplotlib inline

print('Libraries imported.')

# Foursquare Credentials and Version <a name="foursquare"></a>

In [None]:
CLIENT_ID = 'F0DBD3KFRW1D155EIDSD35QWHUYJA2KGUDORUHARQMAHDZQK' # your Foursquare ID
CLIENT_SECRET = 'CR2E2ZJGBG5UQZUHTLOLM4PAWAAUTHISBEQ1I3PIELIFN4AQ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

# Data Collection & Preparation <a name="dataCollAndPrep"></a>

## Let's get the Airbnb Data

In [None]:
bnbdf= pd.read_csv("http://data.insideairbnb.com/japan/kant%C5%8D/tokyo/2021-02-25/visualisations/listings.csv")

In [None]:
bnbdf.head()

The data looks pretty good so far. Next we delete all columns that are not relevant.

In [None]:
df_Airbnb = bnbdf.drop(['id', 'host_id', 'host_name', 'neighbourhood_group', 'room_type', 'minimum_nights', 'number_of_reviews', 'last_review', 'calculated_host_listings_count', 'availability_365'], axis=1)

Rename the column 'name' to 'Venue'.

In [None]:
df_Airbnb.rename(columns = {'name': 'Venue', 'neighbourhood': 'Neighborhood', 'latitude':'Venue Latitude', 'longitude':'Venue Longitude', 'price': 'Price', 'reviews_per_month': 'Reviews per Month'}, inplace = True)

In [None]:
df_Airbnb.head()

Initializing a column for the venue categories.

In [None]:
df_Airbnb['Venue Category'] = 'Airbnb'

In [None]:
df_Airbnb.head()

Drop every line which has no coordinate values. 

In [None]:
nan_value = float("NaN")
df_Airbnb.replace("", nan_value, inplace=True)
# drop all NaN rows when NaN is found in the column COLUMNNAME
df_Airbnb.dropna(subset = ["Venue Latitude"], inplace=True)
df_Airbnb.dropna(subset = ["Venue Longitude"], inplace=True)
df_Airbnb.head()

Now we got all neighborhoods and Airbnbs of Tokyo.

## Let's get the Coordinates of each Neighborhood

This is necessary, because we will create clusters based on the neighborhoods.

Function for getting the coordinates of a neighborhood:

In [None]:
def getCoordinatesOfLocation(address: str):
    d = dict()
    d['latitude'] = ""
    d['longitude'] = ""
    try: 
        geolocator = Nominatim(user_agent="tokyo_explorer")
        location = geolocator.geocode(address)
        d['latitude'] = location.latitude
        d['longitude'] = location.longitude
    except: 
        pass
    return d

Test the function

In [None]:
getCoordinatesOfLocation('Sumida Ku Tokyo')

Let's define the future data frame structure.

In [None]:
df_Airbnb["Neighborhood Latitude"] = ""
df_Airbnb["Neighborhood Longitude"] = ""
df_Airbnb = df_Airbnb.reindex(['Neighborhood','Neighborhood Latitude','Neighborhood Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude', 'Price', 'Reviews per Month', 'Venue Category'], axis=1)
df_Airbnb.head()

Let's create a dictionary for the neighborhood coordinates. This directory is used to fill the coordinates of all Airbnbs in Tokyo. With this step, we ensure faster processing of the data.

In [None]:
neighborhood_dic = []

Let´s create a search function for the distionary.

In [None]:
def search(nh):
    for p in neighborhood_dic:
        if p['nh'] == nh:
            return p
    return ""

In [None]:
for index, row in df_Airbnb.iterrows():
    curAddressPrefix = row['Neighborhood']
    curAddress = curAddressPrefix + ' Tokyo'
    dicSearchResult = search(curAddressPrefix) 
    if dicSearchResult:
        df_Airbnb.at[index,'Neighborhood Latitude'] = dicSearchResult['lat']
        df_Airbnb.at[index,'Neighborhood Longitude'] = dicSearchResult['long']
    else:
        curCoordinatesObject = getCoordinatesOfLocation(curAddress)
        curLatitude = curCoordinatesObject['latitude']
        curLong = curCoordinatesObject['longitude']
        df_Airbnb.at[index,'Neighborhood Latitude'] = curLatitude
        df_Airbnb.at[index,'Neighborhood Longitude'] = curLong
        neighborhood_dic.append({'nh': curAddressPrefix, 'lat': str(curLatitude), 'long': str(curLong)})
    print(index, end='\r')
print()
df_Airbnb.head()

Let's check if every neighborhood has coordinates.

In [None]:
cols_to_excl = ['Reviews per Month']
df_Airbnb.loc[df_Airbnb[df_Airbnb.columns ^ cols_to_excl].isnull().any(axis=1)]

Perfect, our table of Airbnb’s master data is created.

## Let's Visualize all Airbnbs in Tokyo

In [None]:
address = 'Setagaya Ku Tokyo'

geolocator = Nominatim(user_agent="tokyo_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of Tokyo are {}, {}.'.format(latitude, longitude))

tokyo_map = folium.Map(location=[latitude, longitude], zoom_start=11)    


Let's display all existing Airbnb’s of Tokyo. This is an intersection of the total competition.

In [None]:
for lat, lng in zip(df_Airbnb['Venue Latitude'], df_Airbnb['Venue Longitude']):
    label = '{}'.format('Airbnb')
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(tokyo_map)  

tokyo_map

In [None]:
df_Airbnb_GROUPD = df_Airbnb.groupby(['Neighborhood'], sort=True)['Venue'].count()
df_Airbnb_GROUPD.plot.bar(figsize=(18,6), title="Total Airbnbs grouped by Neighborhood")

## Let's get the Neighbarhood Data from Tokyo

The following function will send a explore request for each neighborhood and return the 100 most popular places in the neighborhood around 750 meters. With this information the neighborhoods can be classified. This classification is intended to identify where an Airbnb fits best. Together with the information about the competition, it should be possible to derive suitable recommendations.

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng,
            radius, 
            LIMIT)
            
        # make the GET request
        try:
            results = requests.get(url).json()["response"]['groups'][0]['items']
        except:
            print(requests.get(url).json())
            
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Let's get a distinct list of all neighborhoods to get the left venues data.

In [None]:
disc_neighborhood_df = df_Airbnb.copy()
disc_neighborhood_df.drop(['Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category', 'Price', 'Reviews per Month'], axis=1, inplace=True)
disc_neighborhood_df.drop_duplicates(subset=['Neighborhood'], keep='first', inplace = True)
disc_neighborhood_df.head()

Let's get Tokyo's venues.

In [None]:
tokyo_venues = getNearbyVenues(names=disc_neighborhood_df['Neighborhood'], latitudes=disc_neighborhood_df['Neighborhood Latitude'], longitudes=disc_neighborhood_df['Neighborhood Longitude'])
print(tokyo_venues.shape)
tokyo_venues.head()

In [None]:
tokyo_venues.groupby(['Neighborhood'], sort=True)['Venue'].count().plot.bar(figsize=(18,6), title="Venues grouped by Neighborhood")

We can see that some neighborhoods have more popular places than others. 

# Methodology <a name="methodology"></a>

Now we have all the public data to find out which place has potential for a new Airbnb. It is time for the data analysis, which will be conducted as follows.

With the Airbnb data we will create an competition index (0–100) for all neighborhoods. Additionally we will use a heatmap to display the distribution of all Airbnbs, a second heatmap to display the differences in renting prices and a third one to get an impression of the booking utilization. With this evaluation, it should be possible to obtain a good understanding of the competitive situation.

After we have analyzed all the information about the competitive situation, we will analyze the attractiveness of each neighborhood regarding Airbnbs with tourists as target group. Here we will cluster all neighborhoods of Tokyo based on their venue structure with the k-means algorithm. Through the characteristics of the clusters, we can find out which clusters are interesting for tourists.

In the final part of the analysis, we combine the information from the competitive situation and the attractiveness of each cluster for tourists. Interesting clusters are examined further up to the neighborhood level.

# Analysis <a name="analysis"></a>

## Airbnb

The Airbnb data looks good and requires no further adjustments. In order to create a competition index, we group all Airbnbs by their neighborhoods and calculate their percentage of the total number of Airbnbs in Tokyo.

In [None]:
df_compIndex = df_Airbnb['Neighborhood'].value_counts(normalize=True,sort=False).mul(100) # mul(100) is == *100
df_compIndex.index.name,df_compIndex.name='Neighborhood','percentage_' #setting the name of index and series
df_compIndex = df_compIndex.to_frame()
df_compIndex.rename(columns = {'Neighborhood': 'Competition Index'}, inplace = True)
df_compIndex.sort_values('percentage_',ascending=False, inplace=True)
df_compIndex.plot.bar(figsize=(18,6), title="Competition Index (0-100) of Airbnbs grouped by Neighborhood")

Now we have an initial understanding of how Airbnbs are distributed in Tokyo. Still, it lacks a clear picture. Therefore, we generate a heatmap with the heatmap plugin from folium that displays this distribution on a map.

In [None]:
from folium.plugins import HeatMap
heatMap_Tokyo = folium.Map(location=[latitude, longitude], zoom_start=11)
data = list(zip(df_Airbnb['Venue Latitude'],df_Airbnb['Venue Longitude']))
HeatMap(data,radius=8,gradient={0.2:'blue',0.4:'purple',0.6:'orange',1.0:'red'}).add_to(heatMap_Tokyo)
display(heatMap_Tokyo)

Now there is already a pretty good understanding of how the competition is distributed. Next, it is interesting to understand if some Airbnbs are more expensive than others in some areas of the city. For this we create a price heatmap. Before we visualize the price heatmap we need to identify outlier to increase the likelihood that the prices are related to the area and not to extremely luxurious furniture etc. We do that by excluding all Airbnbs that are more expensive than the upper quartile of the boxplot for the rental fees.

In [None]:
boxplot_Airbnb_price = df_Airbnb.boxplot(column=['Price'])
df_Airbnb['Price'].describe()

In [None]:
df_Airbnb_noOutlier = df_Airbnb[df_Airbnb['Price'] < 12000]
df_Airbnb_noOutlier.head()

Next, we create a scatter plot and place the map of Tokyo behind it.

In [None]:
# import our image (https://www.openstreetmap.org/export#map=12/35.6933/139.7650)
tokyo_img = mpimg.imread('Tokyo2.png')
# plot the data
ax = df_Airbnb_noOutlier.plot(
    kind="scatter", 
    x="Venue Longitude", 
    y="Venue Latitude",
    figsize=(15*2,8*2),
    c="Price", 
    cmap='Reds',
    colorbar=True, 
    alpha=0.4,
)
# use our map with it's bounding coordinates
plt.imshow(tokyo_img, extent=[139.5384, 139.9916, 35.5702, 35.8161], alpha=0.5)  
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)

plt.show()

It would be perfect to know how utilized the Airbnbs are. Unfortunately, this data is not available to the public. However, the prices of Airbnbs already reveal a lot about the locations, as those are formed via supply and demand. Nevertheless, the Airbnb platform is based on mutual reviews, so the number of average monthly reviews can be an indicator of the occupancy. Let’s visualize the average monthly reviews as a heatmap, like the price heatmap.

In [None]:
df_Airbnb_avgMonthReviews_cleaned = df_Airbnb.copy()
nan_value = float("NaN")
df_Airbnb_avgMonthReviews_cleaned.replace("", nan_value, inplace=True)
# drop all NaN rows when NaN is found in the column COLUMNNAME
df_Airbnb_avgMonthReviews_cleaned.dropna(subset = ["Reviews per Month"], inplace=True)
print('Rows of the original Airbnb dataset: ', df_Airbnb.shape[0])
print('Rows of the cleaned Airbnb dataset for the average monthly reviews: ', df_Airbnb_avgMonthReviews_cleaned.shape[0])
print('Difference of both datasets: ', df_Airbnb.shape[0] - df_Airbnb_avgMonthReviews_cleaned.shape[0])

2.107 Airbnbs have not data in the relevant column. Therefore, the dataset is incomplete regarding the average monthly reviews. Nevertheless, let's get the heatmap for the impression.

Let's get rid of the outliers to capture the average situation.

In [None]:
df_Airbnb_avgMonthReviews_cleaned['Reviews per Month'].describe()

In [None]:
df_Airbnb_avgMonthReviews_noOutlier = df_Airbnb_avgMonthReviews_cleaned[df_Airbnb_avgMonthReviews_cleaned['Reviews per Month'] < 1.610000]

In [None]:
# import our image (https://www.openstreetmap.org/export#map=12/35.6933/139.7650)
tokyo_img = mpimg.imread('Tokyo2.png')
# plot the data
ax = df_Airbnb_avgMonthReviews_noOutlier.plot(
    kind="scatter", 
    x="Venue Longitude", 
    y="Venue Latitude",
    figsize=(15*2,8*2),
    c="Reviews per Month", 
    cmap='Greens',
    colorbar=True, 
    alpha=0.4,
)
# use our map with it's bounding coordinates
plt.imshow(tokyo_img, extent=[139.5384, 139.9916, 35.5702, 35.8161], alpha=0.5)  
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)

plt.show()

## Venues

In [None]:
print('- There are {} relevant and unique venue categories in Tokyo.'.format(len(tokyo_venues['Venue Category'].unique())))
print('- There are {} relevant and different venues in Tokyo.'.format(tokyo_venues.shape[0]))

In [None]:
boxplot = tokyo_venues.groupby('Neighborhood').count().boxplot(column=['Venue'])
tokyo_venues.groupby(['Neighborhood'], sort=True)['Venue'].count().describe()

The venues database contains 1.451 relevant venues with 228 unique venue categories in Tokyo. By this collection of the most popular venues for each neighborhood we have created a database that categorizes and reflects the attractiveness of each neighborhood. Looking at the data, it is noticeable that there are neighborhoods which have only a few popular venues.

The lower quartile of the box plot of Tokyo's venue data is defined by 13 different venues.
We only want to include areas that are attractive for tourists, therefore we only consider neighborhoods in the analysis that are above the lower quartile of the box plot. A quick google check of these neighborhoods below the lower quartile of the box plot confirmed this approach.

### Final Data Preparation of the Venue Data

Ensuring that only neighborhoods are taken into account that are relevant for tourists. 

In [None]:
tokyo_venues_grouped = tokyo_venues.groupby(['Neighborhood'], sort=False)['Venue'].count()
tokyo_venues_grouped = tokyo_venues_grouped[tokyo_venues_grouped >= 13]
tokyo_venues_grouped.plot.bar(figsize=(18,6))

In [None]:
tokyo_venues_prepared = tokyo_venues[tokyo_venues['Neighborhood'].isin(tokyo_venues_grouped.index.tolist())]
tokyo_venues_prepared.head()

Now we want to cluster all the neighborhoods in order to identify those cluster which should be interesting for tourism. For this we use the k-means clustering algorithm. After one hot encoding the venue categories, grouping them together and calculating the mean value of the frequency of occurrence of each category we get the data structure required for the k-means algorithm.

### One Hot Encoding

In [None]:
tokyo_venues_prepared_onehot = pd.get_dummies(tokyo_venues_prepared['Venue Category'], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
tokyo_venues_prepared_onehot['Neighborhood'] = tokyo_venues_prepared['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = tokyo_venues_prepared_onehot.columns.tolist()
fixed_columns.insert(0, fixed_columns.pop(fixed_columns.index('Neighborhood')))
tokyo_venues_prepared_onehot = tokyo_venues_prepared_onehot.reindex(columns = fixed_columns)

print(tokyo_venues_prepared_onehot.shape)
tokyo_venues_prepared_onehot.head()

Let's group the data by neighborhood and calculate the mean value of the frequency of occurrence of each category.

In [None]:
tokyo_venues_prepared_onehot_grouped = tokyo_venues_prepared_onehot.groupby('Neighborhood').mean().reset_index()
print(tokyo_venues_prepared_onehot_grouped.shape)
tokyo_venues_prepared_onehot_grouped.head()

The following function returns the top venues of each neighborhood.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Let's use the above function to obtain the 10 most common venues in each neighborhood and store it in the new pandas data frame neighborhoods_venues_sorted.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create column names according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = tokyo_venues_prepared_onehot_grouped['Neighborhood']

for ind in np.arange(tokyo_venues_prepared_onehot_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(tokyo_venues_prepared_onehot_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Let's print each neighborhood along with the top 10 most common venues and there frequency to get a better understanding of the neighborhoods of Tokyo.

In [None]:
num_top_venues = 10

for hood in tokyo_venues_prepared_onehot_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = tokyo_venues_prepared_onehot_grouped[tokyo_venues_prepared_onehot_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

### Clustering the Neighborhoods with K-Means

In order to find out which neighborhood could suit an Airbnb according to the venues, the neighborhoods are grouped together. Each group should reflect what venues mainly define the neighborhoods in that group. For this, the k-means clustering algorithm is used.

Drop 'Neightborhood' for better clustering results.

In [None]:
tokyo_venues_prepared_onehot_grouped_clustering = tokyo_venues_prepared_onehot_grouped.drop('Neighborhood', 1)
tokyo_venues_prepared_onehot_grouped_clustering.head()

### What is the best K (Hyperparameter)

#### Elbow Method

In [None]:
Sum_of_squared_distances = []
K = range(1,15)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(tokyo_venues_prepared_onehot_grouped_clustering)
    Sum_of_squared_distances.append(km.inertia_)

In [None]:
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method for Optimal k')
plt.show()

According to the elbow method the best k is 3.

#### Silhouette Score 

In [None]:
max_score = 10
scores = []

for kclusters in range(2, max_score):
    # Run k-means clustering
    kmeans = KMeans(n_clusters = kclusters, init = 'k-means++', random_state = 0).fit_predict(tokyo_venues_prepared_onehot_grouped_clustering)
    
    # Gets the silhouette score
    score = silhouette_score(tokyo_venues_prepared_onehot_grouped_clustering, kmeans)
    scores.append(score)

plt.figure(figsize=(20,10))
plt.plot(np.arange(2, max_score), scores, 'ro-')
plt.xlabel("Number of clusters")
plt.ylabel("Silhouette Score")
plt.xticks(np.arange(2, max_score))
plt.show()

### Run K-Means Clustering

In [None]:
# select best number of clusters
kclusters = 3

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(tokyo_venues_prepared_onehot_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:38]

### Display the Clustering Results

Add the cluster labels to the neighborhoods_venues_sorted data frame.

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

tokyo_venues_clustered = tokyo_venues_prepared[tokyo_venues_prepared.columns[0:3]].drop_duplicates()
tokyo_venues_clustered.reset_index(drop = True, inplace = True)

# merge tokyo_venues_clustered with neighborhoods_venues_sorted to add latitude/longitude for each neighborhood
tokyo_venues_clustered = tokyo_venues_clustered.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

In [None]:
tokyo_venues_clustered.head()

#### Cluster 1

In [None]:
tokyo_venues_clustered.loc[tokyo_venues_clustered['Cluster Labels'] == 0, tokyo_venues_clustered.columns[[0] + list(range(4, tokyo_venues_clustered.shape[1]))]]

In [None]:
cluster1 = tokyo_venues_clustered.loc[tokyo_venues_clustered['Cluster Labels'] == 0, tokyo_venues_clustered.columns[[0] + 
                                                                                    list(range(4, tokyo_venues_clustered.shape[1]))]]
venues1 = (cluster1['1st Most Common Venue'].append(
    cluster1['2nd Most Common Venue']).append(
    cluster1['3rd Most Common Venue']).append(
    cluster1['4th Most Common Venue']).append(
    cluster1['5th Most Common Venue']).append(
    cluster1['6th Most Common Venue']).append(
    cluster1['7th Most Common Venue']).append(
    cluster1['8th Most Common Venue']).append(
    cluster1['9th Most Common Venue']).append(
    cluster1['10th Most Common Venue']))

print(venues1.value_counts().head(10))

#### Cluster 2

In [None]:
tokyo_venues_clustered.loc[tokyo_venues_clustered['Cluster Labels'] == 1, tokyo_venues_clustered.columns[[0] + list(range(4, tokyo_venues_clustered.shape[1]))]]

In [None]:
cluster2 = tokyo_venues_clustered.loc[tokyo_venues_clustered['Cluster Labels'] == 1, tokyo_venues_clustered.columns[[0] + 
                                                                                    list(range(4, tokyo_venues_clustered.shape[1]))]]
venues2 = (cluster2['1st Most Common Venue'].append(
    cluster2['2nd Most Common Venue']).append(
    cluster2['3rd Most Common Venue']).append(
    cluster2['4th Most Common Venue']).append(
    cluster2['5th Most Common Venue']).append(
    cluster2['6th Most Common Venue']).append(
    cluster2['7th Most Common Venue']).append(
    cluster2['8th Most Common Venue']).append(
    cluster2['9th Most Common Venue']).append(
    cluster2['10th Most Common Venue']))

print(venues2.value_counts().head(10))

#### Cluster 3

In [None]:
tokyo_venues_clustered.loc[tokyo_venues_clustered['Cluster Labels'] == 2, tokyo_venues_clustered.columns[[0] + list(range(4, tokyo_venues_clustered.shape[1]))]]

In [None]:
cluster3 = tokyo_venues_clustered.loc[tokyo_venues_clustered['Cluster Labels'] == 2, tokyo_venues_clustered.columns[[0] + 
                                                                                    list(range(4, tokyo_venues_clustered.shape[1]))]]
venues3 = (cluster3['1st Most Common Venue'].append(
    cluster3['2nd Most Common Venue']).append(
    cluster3['3rd Most Common Venue']).append(
    cluster3['4th Most Common Venue']).append(
    cluster3['5th Most Common Venue']).append(
    cluster3['6th Most Common Venue']).append(
    cluster3['7th Most Common Venue']).append(
    cluster3['8th Most Common Venue']).append(
    cluster3['9th Most Common Venue']).append(
    cluster3['10th Most Common Venue']))

print(venues3.value_counts().head(10))

#### Display as Bar Chart

In [None]:
df_list = [venues1 ,venues2, venues3]
fig, axes = plt.subplots(3, 1)

for index, val in enumerate(df_list):
        ax = val.value_counts().head(10).plot.barh(ax = axes[index], width=0.5, figsize=(15,10))
        ax.invert_yaxis()
        axes[index].set_title('Cluster {}'.format(index+1))
        plt.sca(axes[index])
        plt.xticks(np.arange(0, 17))
        plt.xlabel('No. of Venues')

fig.tight_layout()

Due to the values that tourists have and the number of popular venues, cluster 2 is clearly the one to favor. The third cluster can also have its advantages, if there is significant low competition and the environment is examined closely. However, the first cluster seems to represent neighborhoods where the general population lives and therefore is not designed for tourists.

## Airbnb and Venues<a name="results"></a>

Let’s connect the analysis for Airbnbs and the venues to get the relevant information. First, let’s summarize the results visually. For this we will combine the heatmap of the distribution of Airbnbs with the classification of the neighborhoods.

In [None]:
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tokyo_venues_clustered['Neighborhood Latitude'], tokyo_venues_clustered['Neighborhood Longitude'], tokyo_venues_clustered['Neighborhood'], tokyo_venues_clustered['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster + 1), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(heatMap_Tokyo)
       
heatMap_Tokyo

The original interpretation of the clusters is confirmed by the visual representation, because Airbnbs are distributed accordingly.

There are almost no Airbnbs in the first cluster (red). Airbnbs that still belong to the first cluster are relatively cheap and have few average monthly reviews. There is almost no competition, but the owners try to attract customers with low prices. It can be stated that it is not desirable to open an Airbnb in the first cluster.

The third cluster (green) is mainly located between the city center and the first cluster. However, tourists want to explore the city and are attracted to the city center. The areas of the third cluster which are closer to the city center have disadvantages in the environment compared to the second cluster. As the price heatmap map shows, the surrounding area is one of the main reasons why tourists choose an Aribnb. The focus should therefore be on uncovering market gaps in the second cluster (purple).

The following neighborhoods fall under the desirable cluster (cluster 2):

The following neighborhoods fall under the first cluster:

In [None]:
tokyo_venues_clustered.loc[tokyo_venues_clustered['Cluster Labels'] == 1, tokyo_venues_clustered.columns[[0] + list(range(4, tokyo_venues_clustered.shape[1]))]]

Let’s visualize each neighborhood in the third cluster with their top 10 most common venues and its frequencies to get a better understanding of those neighborhoods.

In [None]:
targetNeighborhoods = []

for index, row in tokyo_venues_clustered.loc[tokyo_venues_clustered['Cluster Labels'] == 1, tokyo_venues_clustered.columns[[0]]].iterrows():
    targetNeighborhoods.append(row["Neighborhood"])

targetNeighborhoods

In [None]:
num_top_venues = 10
fig, axes = plt.subplots(len(targetNeighborhoods), 1)

index = 0
for hood in tokyo_venues_prepared_onehot_grouped['Neighborhood']:
    if hood in targetNeighborhoods:
        
        temp = tokyo_venues_prepared_onehot_grouped[tokyo_venues_prepared_onehot_grouped['Neighborhood'] == hood].T.reset_index()
        temp.columns = ['venue','freq']
        temp = temp.iloc[1:]
        temp['freq'] = temp['freq'].astype(float)
        temp = temp.round({'freq': 2})
        
        ax = temp.sort_values('freq', ascending=False).set_index('venue').head(num_top_venues).plot.barh(ax = axes[index], width=0.5, figsize=(15,10*6))
        ax.invert_yaxis()
        axes[index].set_title('Neighborhood {}'.format(hood))
        plt.sca(axes[index])
        plt.xticks([0.0, 0.1, 0.2, 0.3])
        plt.xlabel('Frequenzy of Venues')
        
        index += 1

fig.tight_layout()

From this selection, I would exclude less attractive neighborhoods for a new Airbnb based on the comparison of the most popular places in each neighborhood. In combination with the heatmaps, areas can be discovered where an opening of a new Airbnb could make sense.

# Results and Discussion <a name="resultsDiscussion"></a>

Although I have never been to Tokyo myself, I was able to get a detailed impression of Tokyo through the data analysis. The data analysis clearly shows that the distribution of Airbnbs are linked to the attractiveness of each neighborhood. The concept of the Invisible Hand from Adam Smith was confirmed. This supports the correctness of this data-driven analysis.

Although competition is concentrated in some neighborhoods, it appears that demand has not yet been fully satisfied. At the same time, there are some areas that indicate high occupancy (due to the average monthly reviews), but have little direct competition.

According to this analysis Shinjuku, Shibuya, Taito, Asakusa, Shinagawa or Chiyoda are neighborhoods which should have locations where the demand could be high enough to cover new Airbnbs for tourism.

# Conclusion <a name="conclusion"></a>

Purpose of this project was to identify neighborhoods in Tokyo which are suited for new Airbnbs. To achieve this goal the competitive situation as well as the environment were analyzed with the help of public available data. My secondary research has shown that the results obtained are most likely correct. However, I do not recommend making a final decision based on this analysis only. In most cases a meaningful analysis should consist of qualitative and quantitative data. Therefore this analysis can be used as a filter to focus the qualitative research on the right areas.

Furthermore, it must be mentioned that with more accurate data on the occupancy of Airbnbs, levels of noise, proximity to major roads, real estate availability, prices, social and economic dynamics of every neighborhood etc., an even better result can be achieved.

A logical thought process, visualizations, the ability to interpret data and to critically question one’s own assumptions are essential for data analysis. The technology and the mathematical understanding allows to make the best use of these skills.

It becomes clear that competitive advantages result from the insights or from the questions that arise when suitable data is analyzed. Therefore it is likely that the future belongs to organizations that collect data in large quantities and have the right people to analyze it.