# Tokyo analysis - Where to set a new business?

## Introduction

### 1. Background and problem description:

Tokyo, or officially known as Tokyo Metropolis, is the capital of Japan and the most populated prefecture in the entire country. The city holds around 13,960,236 people across 23 special wards [[1]](https://en.wikipedia.org/wiki/Tokyo). Such a high amount of people aggregated in one city causes its density to reach around 6,363 people per square kilometer [[2]](https://www.metro.tokyo.lg.jp/tosei/hodohappyo/press/2021/01/28/01.html).

Densely populated areas tend to lead to a highly diversified market demand for food and other catering services. This can easily turn into a double edge sword. On one hand, successful businesses can thrive at a faster pace and expand, however, this also means that businesses have added pressure to keep up with world trends and to cater to new customer needs in order to out-compete their massive competition. Furthermore, new businesses have an even harder time to enter this already established ecosystem.

*Location, location, location.* **Where should one start?** 
* From a shop owner perspective, a place that is located in a highly dense area, with "hopefully" lower land costs and even more "hopefully" less direct competition would be a good start. 
* From an investor perspective, the same information could be quite insightful to understand a business potential longevity and challenges (competition wise) in the short to mid-term.

This project aims at providing a solution that relies heavily on the "easy visualization" data staple. So that both new shop owners and investors can quickly gather insight on viable new opportunities.

### 2. Data description:

**Data Plan:**

1) Initially, a potential ward of interest will be shortlisted based on population density and land price factors;

2) From this, its respective boroughs will be analysed in terms of common venues (indirect/direct competitors and potential synergies with other businesses);

3) Lastly, the land price values for each borough will be overlapped on a world map with the common venues clustering information to further help reduce the potential areas to set up a new business.  

The required information will be extracted from publicly available resources:
* Wards Density information; [[3]](https://en.wikipedia.org/wiki/Special_wards_of_Tokyo#List_of_special_wards)
* Tokyo Land market value list; [[4]](https://utinokati.com/en/details/land-market-value/area/Tokyo/)
* ZIP codes within Tokyo; [[5]](https://japan-postcode.810popo.net/tokyoto/)
* Location coordinates using Geocoder Python Package; [[6]](https://geocoder.readthedocs.io/)
* **Foursquare location** to extract respective borough information. [[7]](https://foursquare.com/)

## Methodology (case study - new bakery)

 #### Required libraries to run the notebook:

In [None]:
import requests
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import matplotlib.cm as cm
import matplotlib.pyplot as plt
import matplotlib.colors as colors

from sklearn.cluster import KMeans

import folium
import geocoder
from geopy.geocoders import Nominatim

from bs4 import BeautifulSoup

### Step 1 - Webscrape Tokyo's Wards data: 
    1. Name & density
    2. Average price per land (JPY per square meters)

In [None]:
# Ward name and density
url = 'https://en.wikipedia.org/wiki/Special_wards_of_Tokyo#List_of_special_wards'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'class':'wikitable sortable'})

ward_info = pd.read_html(str(table))[0]

ward_info.drop(ward_info.columns[[0,1,4,6,7]], axis = 1, inplace = True) # Remove extra columns from the original table

In [None]:
ward_info.head(5)

In [None]:
# Ward average price per land
url = 'https://utinokati.com/en/details/land-market-value/area/Tokyo/'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'id':'region_overview'})

ward_price = pd.read_html(str(table))[0]

# Extra cleaning steps with the dataframe
ward_price.drop(ward_price.columns[[1,3]], axis = 1, inplace = True) # Remove added columns within the dataframe
ward_price.drop(ward_price.index[23:len(ward_price)], axis = 0, inplace = True) # Remove cities information
ward_price['Average Unit Price'] = ward_price['Average Unit Price'].str.replace('JPY/sq.m','', regex = True)
ward_price['Average Unit Price'] = ward_price['Average Unit Price'].str.replace(',','', regex = True)
ward_price = ward_price.rename(columns = {'Average Unit Price': 'Average Price(JPY/sq.m)'})

In [None]:
ward_price.head(5)

### Step 2 - Choose the most promising ward:

In [None]:
# Join both dataframes into one
wards_df = pd.concat([ward_info, ward_price], axis = 1)
wards_df.drop(['Area'], axis = 1, inplace = True)
wards_df.head(5)

In [None]:
# Check that every column has the correct type of data structure
wards_df.dtypes

In [None]:
# Change Average Price column to integers
wards_df['Average Price(JPY/sq.m)'] = pd.to_numeric(wards_df['Average Price(JPY/sq.m)'])

In [None]:
fig = px.scatter(data_frame = wards_df,
                 x = 'Density(/km2)',
                 y = 'Average Price(JPY/sq.m)',
                 color = 'Name')

fig.update_traces(marker = dict(size = 15))
fig.show()

Based on this scatter plot, Toshima ward has the most density while Chiyoda has the least density by it is by far the most expensive in terms of land value. Let's quickly confirm the density per price ratio:

In [None]:
wards_df['Density/Price ratio'] = wards_df['Density(/km2)']/wards_df['Average Price(JPY/sq.m)']
potential_wards = wards_df.sort_values(by = 'Density/Price ratio', ascending = False).head(10)

In [None]:
fig = px.bar(data_frame = potential_wards,
             x = 'Name',
             y = 'Density/Price ratio',
             labels = {'Name':'Ward names'})

fig.update_layout(title_text = 'Top 10 wards with the highest density per land price ratio', title_x = 0.5)

fig.show()

### Step 3 - Extract information from Arakawa ward:
    1. Obtain neighborhood's names and postal codes
    2. Extract each neighborhood's coordinates
    3. Acquire the land price for each neighborhood

In [None]:
# Neighborhood's names and Postal codes
neighborhood_pc = pd.read_csv('Arakawa_ward_zipCodes.txt', sep = ',', names = ['Names', 'Postal Code'])
neighborhood_pc.head(7)

In [None]:
# Coordinates for each neighborhood using Geocode 
# Unfortunately, I could not extract using this library -> None problem...

for k in range(len(neighborhood_pc)):
    lat_lng_coords = None
    
    # Since Geocode can sometimes fail, a while loop has to be initiated until the coordinates are actually extracted
    while(lat_lng_coords is None):
        g = geocoder.google('{}, {}, Tokyo'.format(neighborhood_pc['Postal Code'][k], neighborhood_pc['Names'][k]))
        lat_lng_coords = g.latlng
        
    print(lat_lng_coords)
    borough_df['Latitude'][k] = lat_lng_coords[0]
    borough_df['Longitude'][k] = lat_lng_coords[1]

##### Neighborhood's coordinates (extracted using Google Maps , search query = Neighborhood's name + Postal code)

In [None]:
neighborhood_coord = pd.read_csv('Arakawa_coordinates.txt', sep = ',', names = ['Name', 'Latitude', 'Longitude'])
neighborhood_coord.head(7)

In [None]:
# Acquire average land price for each neighborhood
# I saved the table as .txt file since I needed to translate each neighborhood name from Japanese to English
neighborhood_price = pd.read_csv('Arakawa_price.txt', sep = ',', names = ['Name', 'Average Price(JPY/sq.m)'])
neighborhood_price.head(7)

In [None]:
# Combine all data into one dataframe for the respective borough = Arakawa
borough_df = pd.concat([neighborhood_pc, neighborhood_price, neighborhood_coord], axis = 1)
borough_df.drop(['Name'], axis = 1, inplace = True)
borough_df.head(7)

In [None]:
# Obtain the coordinates for Arakawa ward to plot all neighborhoods on a world map
geolocator = Nominatim(user_agent = 'on_explorer')
location = geolocator.geocode('Arakawa, JP')
latitude = location.latitude
longitude = location.longitude
print('Coordinates for Arakawa ward are the following: {} / {}'.format(latitude, longitude))

In [None]:
map_arakawa = folium.Map(location = [latitude, longitude], zoom_start = 14)

for lat, lng, label in zip(borough_df['Latitude'], borough_df['Longitude'], borough_df['Names']):
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
    [lat, lng],
    radius = 8,
    popup = label,
    color = 'blue',
    fill = True,
    fill_color = '#3186cc',
    fill_opacity = 0.7,
    parse_html = False).add_to(map_arakawa)
    
map_arakawa

### Step 4 - Extract venue information for each neighborhood:

**Initiate Foursquare to obtain nearby venues**

In [None]:
CLIENT_ID = 'NKIICUFP3F2YR2CNU31ATDOOGFMZE2MDEHC3ZORF3C0EUSUI'
CLIENT_SECRET = 'WMNZ2GATYAXECUDL4YZX0AEIMUBY4EHYNMYKALOU0FY3QAZR'
VERSION = '20200615'
LIMIT = 200

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius = 500):
    """
    names - neighborhood names
    latitudes/longitudes - for each respective neighborhood
    radius - search radius using Foursquare API
    """
    
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        lng,
        radius,
        LIMIT)
        
        # create the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])
        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',
                            'Neighborhood Latitude',
                            'Neighborhood Longitude',
                            'Venue',
                            'Venue Latitude',
                            'Venue Longitude',
                            'Venue Category']
    
    return nearby_venues

In [None]:
# Obtain all venues (limited to a maximum of 200) for each neighborhood within Arakawa ward
arakawa_venues = getNearbyVenues(names = borough_df['Names'], 
                                  latitudes = borough_df['Latitude'],
                                  longitudes = borough_df['Longitude'])

In [None]:
print(arakawa_venues.shape)
print('There are {} uniques categories'.format(len(arakawa_venues['Venue Category'].unique())))
arakawa_venues.head(7)

In [None]:
# Expand more on the unique venue categories found within Arakawa
arakawa_venues['Venue Category'].unique()

In [None]:
# One-hot encode the different types of venues for each neighborhood (to help with clustering later on...)
arakawa_onehot = pd.get_dummies(arakawa_venues[['Venue Category']], prefix = "", prefix_sep = "")
arakawa_onehot['Neighborhood'] = arakawa_venues['Neighborhood']

fixed_columns = [arakawa_onehot.columns[-1]] + list(arakawa_onehot.columns[:-1])
arakawa_onehot = arakawa_onehot[fixed_columns]
arakawa_onehot.head(7)

##### Understand the top 5 venues that exist for each neighborhood

In [None]:
arakawa_grouped = arakawa_onehot.groupby('Neighborhood').mean().reset_index()
arakawa_grouped

In [None]:
# Check the ratio for the top 5 venues for each neighborhood
num_top_venues = 5

for hood in arakawa_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = arakawa_grouped[arakawa_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending = False).reset_index(drop = True).head(num_top_venues))
    print('\n')

###### Examine the 15 most common venues for each neighborhood

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending = False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# Create new columns based on the number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind + 1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind + 1))

neighborhoods_venues_sorted = pd.DataFrame(columns = columns)
neighborhoods_venues_sorted['Neighborhood'] = arakawa_grouped['Neighborhood']

for ind in np.arange(arakawa_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(arakawa_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head(7)

### Step 5 - Cluster neighborhoods based on their venue information:

##### Perform the elbow method to choose the right cluster number

In [None]:
sum_of_squared_distances = []
distortions = []
kclusters = range(1,8)

arakawa_grouped_clustering = arakawa_grouped.drop('Neighborhood', 1)

for k in kclusters:
    kmeans = KMeans(n_clusters = k, random_state = 0).fit(arakawa_grouped_clustering)
    sum_of_squared_distances.append(kmeans.inertia_)
    
plt.plot(kclusters, sum_of_squared_distances, 'bx-')
plt.xlabel('Cluster number')
plt.ylabel('Sum of squared distances')
plt.title('Elbow Method for discovering optimal cluster n')

The elbow method was a bit inconclusive but it seems like clustering with k = 3 could be the ideal one.

In [None]:
n_cluster = 3
kmeans = KMeans(n_clusters = n_cluster, random_state = 0).fit(arakawa_grouped_clustering)

neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
arakawa_merged = borough_df

arakawa_merged = arakawa_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on = 'Names')
arakawa_merged.head(7)

##### Attempt to overlap price values within the Choropleth map (not working as intended )

In [None]:
# Make sure that both information (json+dataframe) matches for the choropleth map

import json 
idjson = json.load(open('arakawa_map.json')) 

for index, x in enumerate(idjson['features']): 
    if type(x['properties']['name']) == type(borough_df['Names'][index]):
        print('Name match')
    print (x['properties']['name'])

In [None]:
neighborhood_geo = 'arakawa_map.geojson'

arakawa_map_clusters = folium.Map(location = [latitude, longitude], zoom_start = 14)

folium.Choropleth(
    geo_data = neighborhood_geo,
    df = borough_df,
    columns = ['Names','Average Price(JPY/sq.m)'],
    key_on = 'features.properties.name',
    fill_color = 'Y1Gn',
    fill_opacity = 0.7,
    line_opacity = 0.2,
    legend_name = 'Average price per land',
    highlight = True
).add_to(arakawa_map_clusters)

# set color scheme for the clusters
x = np.arange(n_cluster)
ys = [i + x + (i*x)**2 for i in range(n_cluster)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(arakawa_merged['Latitude'], arakawa_merged['Longitude'], arakawa_merged['Names'], arakawa_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html = True)
    folium.CircleMarker(
        [lat, lon],
        radius = 5,
        popup = label,
        color = rainbow[int(cluster)-1],
        fill = True,
        fill_color = rainbow[int(cluster)-1],
        fill_opacity = 0.7).add_to(arakawa_map_clusters)
     
folium.LayerControl().add_to(arakawa_map_clusters)
        
arakawa_map_clusters

### Step 6 - Choose the neighborhood with the most potential:
    1. Obtain information regarding competitors
    2. Use the price per land value as a filter

In [None]:
# Based on the unique category venues, the following could be direct competitors: Bakery, Pastry Shop, Sandwich Place
n_bak = arakawa_venues.groupby('Neighborhood')['Venue Category'].apply(lambda x: x[x.str.contains('Bakery')].count())
n_pas = arakawa_venues.groupby('Neighborhood')['Venue Category'].apply(lambda x: x[x.str.contains('Pastry Shop')].count())
n_san = arakawa_venues.groupby('Neighborhood')['Venue Category'].apply(lambda x: x[x.str.contains('Sandwich Place')].count())

arakawa_merged['Number of competitors'] = list(n_bak+n_pas+n_san)
arakawa_merged.head(7)

In [None]:
# Visualize both the number of competitors as well as price per land for each neighborhood
fig = make_subplots(rows = 1, cols = 2, subplot_titles = ('Similar businesses estimation', 'Average price per land'))

fig.add_trace(go.Bar(x = arakawa_merged['Names'], y = arakawa_merged['Number of competitors']), row = 1, col = 1)
fig.add_trace(go.Bar(x = arakawa_merged['Names'], y = arakawa_merged['Average Price(JPY/sq.m)']), row = 1, col = 2)

fig.update_yaxes(title_text = 'Number of direct competitors', row = 1, col = 1)
fig.update_yaxes(title_text = 'Price value JPY/sqm', row = 1, col = 2)

fig.update_layout(height = 400, 
                  width = 1000,
                  showlegend = False)
fig.show()

* Assumptions:
    1. Higashinippori is excluded due to the high land price
    2. Higashiogu, Nishinippori and Nishiogu also get excluded because of similar businesses being already established
    
The potential neighborhoods to establish a new bakery would be either Arakawa, Machiya or Minaminseju since they would have a very similar land cost value plus no established direct competition yet.

## Discussion

As introduced before, Tokyo is a city with enormous potential for new businesses, however, it is a high risk, high reward kind of situation. When starting a new business in such a competitive established market, location and access to target customers are key for the longevity of the business.

Through this analysis, I have used public data to narrow the number of options down to the neighborhood area. I am aware that a few assumptions have been made about the chosen "bakery" study case that would need to be re-adapted for other businesses that could have different priorities.

Further research could involve data regarding the average age population within each neighborhood as well as commuting numbers in/out of each neighborhood. The main reason is because some services could want to establish themselves where there is a high number of people converging to during the working week. Restaurants, as an example, could wand to focus on workers that go around to find a meal during lunch time, areas with the highest number of companies/number of employees ratio could be attractive in this situation. As for the former type of information, some services have a target age group which was not accounted for in this analysis.

## Conclusion

In such competitive ecosystems, it is difficult for a newcomer to know where to start a new business. It can also be challenging for investors to gauge how well a new business could perform in the short-mid term due to a multitude of variables. This analysis is not by far exhaustive, but hopefully can shed some insight on potential factors that could help in the decision or evaluation making process.

## References

[1] Tokyo Information -> https://en.wikipedia.org/wiki/Tokyo \
[2] Tokyo density numbers -> https://www.metro.tokyo.lg.jp/tosei/hodohappyo/press/2021/01/28/01.html \
[3] Tokyo Ward list -> https://en.wikipedia.org/wiki/Special_wards_of_Tokyo#List_of_special_wards \
[4] Market value for land in Tokyo -> https://utinokati.com/en/details/land-market-value/area/Tokyo/ \
[5] Japan Postcode finder -> https://japan-postcode.810popo.net/tokyoto/ \
[6] Geocoder -> https://geocoder.readthedocs.io/ \
[7] Foursquare -> https://foursquare.com/ \
[8] Google Maps -> https://www.google.com/maps (To obtain coordinates for each neighborhood as an alternative) \
[9] Geojson file creator -> https://geojson.io/#map=20/35.75063/139.74093 (To create my own Choropleth) 

[1-5] - Accessed on the 13/06/2021.