# Introduction

## Background
To be successful with any restaurant opening, market research is the best thing to start off.

At one time, market researches before runing a business were implemented in traditional ways, like visiting local communities or interviewing potential customers. Whereas nowadays, the researches are largely facilitated since the machine learning power were added.

Data analyse and machine learning techniques play very important role in helping business people grab market share and become profitable when opening a restaurant in a smart and scientific way.

## Business Problem

A client seeks to establish a Chinese restaurant in Tokyo, Japan. 

This capstone report will determine the  optimal and most strategic location for running the restaurant by exploring the following questions:

* **Is the market saturated?** 
* **Who are the potential customers?**
* **Who are the local competitors?**
* **What are the local demographic and economic features?**


# Data Acquisition and Wrangling

## Data Sources

### Tokyo Wards Table from Wikipedia

#### I first make use of Special Wards of Tokyo page from [Wikipedia - Special Wards of Tokyo](https://en.wikipedia.org/wiki/Special_wards_of_Tokyo) to scrap the table to create a data-frame. Simply use pandas.read_html and then extract the target table.

In [3]:
import lxml
import pandas as pd

# Set up dataframe display
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)


# Read all tables from Wiki Special wards of Tokyo Page into Dataframes. 
tables = pd.read_html('https://en.wikipedia.org/wiki/Special_wards_of_Tokyo')
# Find the table of Tokyo special wards
df = tables[3]

# Rename all columns
df.columns = ['No.', 'Flag', 'Name', 'Kanji', 'Population', 'Density', 'Area', 'Major district']

# Create dataframe of Tokyo special wards without Flag and Kanji
df_tokyo_wards = df[['Name', 'Population', 'Density', 'Area', 'Major district']]
df_tokyo_wards.head()

Unnamed: 0,Name,Population,Density,Area,Major district
0,Chiyoda,59441,5100,11.66,"Nagatachō, Kasumigaseki, Ōtemachi, Marunouchi, Akihabara, Yūrakuchō, Iidabashi, Kanda"
1,Chūō,147620,14460,10.21,"Nihonbashi, Kayabachō, Ginza, Tsukiji, Hatchōbori, Shinkawa, Tsukishima, Kachidoki, Tsukuda"
2,Minato,248071,12180,20.37,"Odaiba, Shinbashi, Hamamatsuchō, Mita, Roppongi, Toranomon, Aoyama, Azabu"
3,Shinjuku,339211,18620,18.22,"Shinjuku, Takadanobaba, Ōkubo, Kagurazaka, Ichigaya, Yotsuya"
4,Bunkyō,223389,19790,11.29,"Hongō, Yayoi, Hakusan"


#### As we can see, there are too many districts major to each ward. Here I keep the first one as the major district of the ward.


In [2]:
# Keep the first distritc of the list as ward's Major distritc 
df_tokyo_wards['Major district'] = df_tokyo_wards['Major district'].str.split(',', expand=True)[0]

In [3]:
# Trim all special characters to English letters
df_tokyo_wards.Name.replace([r'\ū',r'\ō', r'Ō'],['u', 'o', 'O'], regex=True, inplace=True)
df_tokyo_wards['Major district'].replace([r'\ū',r'\ō', r'Ō'],['u', 'o', 'O'], regex=True, inplace=True)

In [26]:
df_tokyo_wards.dropna(inplace=True)
df_tokyo_wards.tail(10)

Unnamed: 0,Name,Population,Density,Area,Major district
13,Nakano,332902,21350,15.59,Nakano
14,Suginami,570483,16750,34.06,Koenji
15,Toshima,294673,22650,13.01,Ikebukuro
16,Kita,345063,16740,20.61,Akabane
17,Arakawa,213648,21030,10.16,Arakawa
18,Itabashi,569225,17670,32.22,Itabashi
19,Nerima,726748,15120,48.08,Nerima
20,Adachi,674067,12660,53.25,Ayase
21,Katsushika,447140,12850,34.8,Tateishi
22,Edogawa,685899,13750,49.9,Kasai


### Getting the Longitudes and Latitudes of Major Districts from **Geopy Client**

#### Next objective is to get the coordinates of these 23 major districts using geocoder class of Geopy client

In [6]:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent='Tokyo_explorer')
df_tokyo_wards['Major_district_geo'] = df_tokyo_wards['Major district'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
df_tokyo_geo = df_tokyo_wards.copy()

In [62]:
# Drop the row with NaN values
df_tokyo_geo.dropna(inplace=True)

# switch string coordinations into tuple
df_tokyo_geo['Major_district_geo'] = df_tokyo_geo['Major_district_geo'].apply(lambda x: eval(x))

# flat the coordinations to latitude and longitude
df_tokyo_geo[['lat', 'long']] = df_tokyo_geo['Major_district_geo'].apply(pd.Series)

# drop coordination tuples
df_tokyo_geo.drop('Major_district_geo', axis=1, inplace=True)

df_tokyo_geo

Unnamed: 0,Name,Population,Density,Area,Major district,lat,long
0,Chiyoda,59441,5100,11.66,Nagatacho,35.675618,139.743469
1,Chuo,147620,14460,10.21,Nihonbashi,35.684068,139.774503
2,Minato,248071,12180,20.37,Odaiba,35.61905,139.779364
3,Shinjuku,339211,18620,18.22,Shinjuku,35.693763,139.703632
4,Bunkyo,223389,19790,11.29,Hongo,35.175376,137.013476
5,Taito,200486,19830,10.11,Ueno,35.711759,139.777645
6,Sumida,260358,18910,13.77,Kinshicho,35.696312,139.815043
7,Koto,502579,12510,40.16,Kiba,23.013134,-80.832875
8,Shinagawa,392492,17180,22.84,Shinagawa,35.599252,139.73891
9,Meguro,280283,19110,14.67,Meguro,35.62125,139.688014


#### Some coordinations are not correct, therefore the data needs to be updated.

In [68]:
# Update lat, long for Kiba
df_tokyo_geo.iloc[7, -2] = 35.672200
df_tokyo_geo.iloc[7, -1] = 138.806100

# Update lat, long for Tateishi
df_tokyo_geo.iloc[21, -2] = 34.176335
df_tokyo_geo.iloc[21, -1] = 132.226020

# Update lat, long for Kasai
df_tokyo_geo.iloc[22, -2] = 35.663400
df_tokyo_geo.iloc[22, -1] = 139.873100

df_tokyo_geo

Unnamed: 0,Name,Population,Density,Area,Major district,lat,long
0,Chiyoda,59441,5100,11.66,Nagatacho,35.675618,139.743469
1,Chuo,147620,14460,10.21,Nihonbashi,35.684068,139.774503
2,Minato,248071,12180,20.37,Odaiba,35.61905,139.779364
3,Shinjuku,339211,18620,18.22,Shinjuku,35.693763,139.703632
4,Bunkyo,223389,19790,11.29,Hongo,35.175376,137.013476
5,Taito,200486,19830,10.11,Ueno,35.711759,139.777645
6,Sumida,260358,18910,13.77,Kinshicho,35.696312,139.815043
7,Koto,502579,12510,40.16,Kiba,35.6722,138.8061
8,Shinagawa,392492,17180,22.84,Shinagawa,35.599252,139.73891
9,Meguro,280283,19110,14.67,Meguro,35.62125,139.688014


### Location Data by Foursquare API 
#### For returning the popular spots in the vicinity of each neighbourhood, Foursquare API will be utilized, more specifically, its explore function.

In [81]:
import requests

# define Foursquare credentials
CLIENT_ID = 'ND2O34KM3AFFVCYQY4LKBBOVXIDZCQJNQZI4X5YZWJ2CBCHK'
CLIENT_SECRET = 'ZBFCFCM0VMWZPD1KJHEIMY4QQPB1I5CEZ4VSZQP3MCO1SYJI'
VERSION = '20180605' 

# request definitions
LIMIT = 500 
radius = 1000

# function to get batch of venues as a dataframe
def get_nearby_venues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
#         print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [82]:
tokyo_venues = get_nearby_venues(
                        names=df_tokyo_geo['Major district'].iloc[:],
                        latitudes=df_tokyo_geo['lat'].iloc[:],
                        longitudes=df_tokyo_geo['long'].iloc[:]
                    )
tokyo_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Nagatacho,35.675618,139.743469,Nagatacho Kurosawa (永田町 黒澤),35.674699,139.741737,Japanese Restaurant
1,Nagatacho,35.675618,139.743469,The Capitol Hotel Tokyu (ザ・キャピトルホテル東急),35.673927,139.741019,Hotel
2,Nagatacho,35.675618,139.743469,Tully's Coffee,35.674594,139.743007,Coffee Shop
3,Nagatacho,35.675618,139.743469,Shinamen Hashigo (支那麺 はしご),35.672184,139.741576,Ramen Restaurant
4,Nagatacho,35.675618,139.743469,All Day Dining Origami (オールデイダイニング ORIGAMI),35.673815,139.741104,Restaurant


In [84]:
tokyo_venues.shape

(1300, 7)