## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#Data)
* [Methodology](#Methodology)
* [Analysis](#Analysis)
* [Results and Discussion](#Results)
* [Conclusion](#Conclusion)

# Introduction: Business Problem <a name="introduction"></a>

<p><u><i>Goal</i></u></p>
<p>Singapore is a vibrant and diverse country. It is separated into different regions with each region having its own specialties. The business problem is that I would like to open up a restaurant in Singapore and would like to know what is the best region to set up this restaurant. We will employ data science methodologies to identify the ideal region to open a restaurant in Singapore.</p>

# Data  <a name="Data"></a>

Based on the business problem, there are a few data that would aid us in our decision making on where to open up a restaurant:
1. List of the different regions within Singapore
1. Latitude and longitude of the different regions
1. Venue data from Foursquare around the different regions

<p><u><i>Data Sources</i></u></p>
<p>Data from multiple sources will be used to identify clusters of restaurants in the different regions of Singapore.</p>
<p>1. <a href="https://en.wikipedia.org/wiki/Postal_codes_in_Singapore"> Wikipedia</a>: Different regions in Singapore</p>
<p>2. <a href="https://docs.onemap.sg/#onemap-rest-apis/"> OneMap SG</a>: API to obtain latitude and longtitude for each Singapore region</p>
<p>3. <a href="https://foursquare.com/discoversing/"> Foursquare SG</a>: Use foursquare to identify clusters and determine ideal location to open a restaurant</p>

## Step 1: Strip zipcodes location and regions from wikipedia

In [1]:
import pandas as pd
import numpy as np
from pandas.io.html import read_html

#Read wikitables in the wikipedia page
page = 'https://en.wikipedia.org/wiki/Postal_codes_in_Singapore'
wikitable = pd.read_html(page, index_col = 0, attrs={"class":"wikitable"})

#write table into dataframe
sg_post = wikitable[0]
sg_post.reset_index(inplace=True)

sg_post2 = sg_post.rename(columns={'Postal district': 'postal_district', 'Postal sector(1st 2 digits of 6-digit postal codes)': 'postal_sector', 'General location': 'region'})
sg_post2['seq'] = sg_post2.index

postal_sector = sg_post2[{'seq', 'postal_sector'}]
postal_location = sg_post2[{'seq','region'}]
postal_district = sg_post2[{'seq','postal_district'}]

sg_post2.head()


Unnamed: 0,postal_district,postal_sector,region,seq
0,1,"01, 02, 03, 04, 05, 06","Raffles Place, Cecil, Marina, People's Park",0
1,2,"07, 08","Anson, Tanjong Pagar",1
2,3,"14, 15, 16","Bukit Merah, Queenstown, Tiong Bahru",2
3,4,"09, 10","Telok Blangah, Harbourfront",3
4,5,"11, 12, 13","Pasir Panjang, Hong Leong Garden, Clementi New...",4


As you can see, multiple regions are grouped into 1 row, we will have to split them

In [2]:
def pir(df, c):
    colc = df[c].str.split(',')
    clst = colc.values.tolist()
    lens = [len(l) for l in clst]

    cdf = pd.DataFrame({c: np.concatenate(clst)}, df.index.repeat(lens))
    return df.drop(c, 1).join(cdf).reset_index(drop=True)

postal_location2 = pir(postal_location, 'region')
postal_location2.head()

Unnamed: 0,seq,region
0,0,Raffles Place
1,0,Cecil
2,0,Marina
3,0,People's Park
4,1,Anson


Merge district and location back into 1 dataframe and move the key 'seq' back to the front of the dataframe

In [3]:
final_post = postal_district.merge(postal_location2, on= 'seq', how='left', sort = True)

seq = final_post['seq']
final_post.drop(labels=['seq'], axis=1,inplace = True)
final_post.insert(0, 'seq', seq)
final_post.head()

Unnamed: 0,seq,postal_district,region
0,0,1,Raffles Place
1,0,1,Cecil
2,0,1,Marina
3,0,1,People's Park
4,1,2,Anson


We now have all the regions within Singapore

## Step 2: Obtain latitude and longitude data from Onemap SG

Import all necessary libraries to call API from SGLocate to get latitude and longtitude data

In [4]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import json

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
#import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Solving environment: done

# All requested packages already installed.

Folium installed
Libraries imported.


Create list of region to be mapped to its lat and long

In [5]:
sg_region = final_post['region']
print(sg_region.shape)
sg_region[0:10]

(75,)


0     Raffles Place
1             Cecil
2            Marina
3     People's Park
4             Anson
5     Tanjong Pagar
6       Bukit Merah
7        Queenstown
8       Tiong Bahru
9     Telok Blangah
Name: region, dtype: object

<i><u>Use Onemap SG API to obtain latitude and longtitude information</u></i>
<p>Call Onemap SG API to obtain the latitude and longtitude for all 75 Singapore regions and append into a list

In [6]:
LATITUDE = []
REGION = []
LONGTITUDE = []
x=0

for x in range(0,75):
        try:
            y= json.loads(requests.get('https://developers.onemap.sg/commonapi/search?searchVal='+sg_region[x]+'&returnGeom=Y&getAddrDetails=Y&pageNum=1').content)['results'][0]['LATITUDE']
            z= sg_region[x]
            c= json.loads(requests.get('https://developers.onemap.sg/commonapi/search?searchVal='+sg_region[x]+'&returnGeom=Y&getAddrDetails=Y&pageNum=1').content)['results'][0]['LONGITUDE']
    
            LATITUDE.append(y)
            REGION.append(z)
            LONGTITUDE.append(c)
            x=x+1
            
        except: 
            y= "NaN"
            z= sg_region[x]
            c= "NaN"
    
            LATITUDE.append(y)
            REGION.append(z)
            LONGTITUDE.append(c)
            x=x+1

Convert latitude and longtitude list into dataframe and merge with the original location dataframe

In [7]:
dictionary = {'region':REGION,'latitude':LATITUDE, 'longtitude':LONGTITUDE}
df = pd.DataFrame(dictionary)
df.head()

Unnamed: 0,region,latitude,longtitude
0,Raffles Place,1.283933262,103.8514631
1,Cecil,1.279788786,103.8480433
2,Marina,1.280978571,103.8527216
3,People's Park,1.285792095,103.8438684
4,Anson,1.274098093,103.8456692


Remove regions without latitude/longtitude values

In [8]:
final_post2 = final_post.merge(df, on= 'region', how='left', sort = True)
final_post3 = final_post2[final_post2.latitude != "NaN"]
final_post3.reset_index(inplace = True)
final_post3

Unnamed: 0,index,seq,postal_district,region,latitude,longtitude
0,0,14,15,Amber Road,1.301350935,103.901269
1,1,19,20,Ang Mo Kio,1.368778389,103.8402739
2,3,12,13,Braddell,1.346634724,103.8618812
3,4,22,23,Bukit Panjang,1.377762325,103.7735167
4,5,9,10,Bukit Timah,1.345125895,103.7726938
5,6,8,9,Cairnhill,1.30899157,103.8359739
6,7,0,1,Cecil,1.279788786,103.8480433
7,8,16,17,Changi,1.3446055030000001,103.96350859999998
8,9,22,23,Choa Chu Kang,1.39978928,103.7513161
9,11,20,21,Clementi Park,1.32954742,103.7659733


## Step 3:Obtain Foursquare neighbourhood data in Singapore

Explore neighbourhoods in Singapore.

In [9]:
CLIENT_ID = 'WRMPR1Z0CI0I0LX5KTOMSGBE2QXK0VNDE32RSVJEFJEWN0QK' # your Foursquare ID
CLIENT_SECRET = 'GKHRMCBSSQI20CITCTHF5HMA0LUYPBO510K4AYVIG3URMA3Y' # your Foursquare Secret
VERSION = '20200202' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: WRMPR1Z0CI0I0LX5KTOMSGBE2QXK0VNDE32RSVJEFJEWN0QK
CLIENT_SECRET:GKHRMCBSSQI20CITCTHF5HMA0LUYPBO510K4AYVIG3URMA3Y


In [10]:
def getNearbyVenues(region, latitude, longtitude, radius=100, LIMIT=50):
    
    venues_list=[]
    for region, latitude, longtitude in zip(region, latitude, longtitude):
        print(region)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            latitude, 
            longtitude, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            region, 
            latitude, 
            longtitude, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [11]:
singapore_venues = getNearbyVenues(region=final_post3['region'],
                                   latitude=final_post3['latitude'],
                                   longtitude=final_post3['longtitude']
                                  )

 Amber Road
 Ang Mo Kio
 Braddell
 Bukit Panjang
 Bukit Timah
 Cairnhill
 Cecil
 Changi
 Choa Chu Kang
 Clementi Park
 Dairy Farm
 Eastwood
 Eunos
 Farrer Park
 Golden Mile
 Harbourfront
 Holland Road
 Hougang
 Jalan Besar
 Joo Chiat
 Kew Drive
 Lavender
 Marina
 Novena
 Pasir Ris
 People's Park
 Punggol
 Queenstown
 River Valley
 Sembawang
 Serangoon
 Springleaf
 Tampines
 Tanglin
 Tanjong Pagar
 Tengah
 Thomson
 Tiong Bahru
 Toa Payoh
 Tuas
 Ulu Pandan
 Upper East Coast
 Woodgrove
 Woodlands
Anson
Ardmore
Balestier
Bedok
Bishan
Bukit Merah
Geylang
High Street
Hillview
Jurong
Katong
Kranji
Lim Chu Kang
Little India
Loyang
Macpherson
Middle Road
Orchard
Pasir Panjang
Raffles Place
Seletar
Serangoon Garden
Simei
Telok Blangah
Upper Bukit Timah
Upper Thomson
Watten Estate
Yishun


In [12]:
print(singapore_venues.shape)
singapore_venues

(164, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Amber Road,1.301350935,103.901269,Amber Hotel,1.301522,103.901799,Hotel
1,Bukit Panjang,1.377762325,103.7735167,Bangkit Market,1.377676,103.773297,Miscellaneous Shop
2,Bukit Timah,1.345125895,103.7726938,Bus Stop 42111 (Woh Hup Bldg),1.345877,103.772663,Bus Stop
3,Cecil,1.279788786,103.8480433,Napoleon Food & Wine Bar,1.279925,103.847333,Wine Bar
4,Cecil,1.279788786,103.8480433,Park Bench Deli,1.279872,103.847287,Deli / Bodega
5,Cecil,1.279788786,103.8480433,ShuKuu Izakaya,1.280111,103.847762,Japanese Restaurant
6,Cecil,1.279788786,103.8480433,Meat Smith,1.280205,103.847410,Southern / Soul Food Restaurant
7,Cecil,1.279788786,103.8480433,Oven & Fried Chicken,1.280479,103.847522,Korean Restaurant
8,Cecil,1.279788786,103.8480433,Pantler,1.280137,103.847256,Bakery
9,Cecil,1.279788786,103.8480433,Common Man Stan,1.280339,103.847888,Coffee Shop


Let's check how many venues were returned for each neighborhood

In [13]:
singapore_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Amber Road,1,1,1,1,1,1
Bukit Panjang,1,1,1,1,1,1
Bukit Timah,1,1,1,1,1,1
Cecil,10,10,10,10,10,10
Changi,1,1,1,1,1,1
Eastwood,2,2,2,2,2,2
Farrer Park,5,5,5,5,5,5
Golden Mile,23,23,23,23,23,23
Harbourfront,4,4,4,4,4,4
Holland Road,1,1,1,1,1,1


In [14]:
## one hot encoding
singapore_onehot = pd.get_dummies(singapore_venues[['Venue Category']], prefix="", prefix_sep="")

### add neighborhood column back to dataframe
singapore_onehot['Neighborhood'] = singapore_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [singapore_onehot.columns[-1]] + list(singapore_onehot.columns[:-1])
singapore_onehot = singapore_onehot[fixed_columns]

singapore_onehot.head()


Unnamed: 0,Neighborhood,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bakery,Bar,Bed & Breakfast,Bike Rental / Bike Share,Bistro,Bookstore,...,Southern / Soul Food Restaurant,Street Food Gathering,Supermarket,Sushi Restaurant,Thai Restaurant,Theme Park,Train Station,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,Amber Road,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Bukit Panjang,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Bukit Timah,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Cecil,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,Cecil,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
#Check shape of dataframe
singapore_onehot.shape

(164, 78)

In [16]:
#Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
singapore_grouped = singapore_onehot.groupby('Neighborhood').mean().reset_index()
singapore_grouped

Unnamed: 0,Neighborhood,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bakery,Bar,Bed & Breakfast,Bike Rental / Bike Share,Bistro,Bookstore,...,Southern / Soul Food Restaurant,Street Food Gathering,Supermarket,Sushi Restaurant,Thai Restaurant,Theme Park,Train Station,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,Amber Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bukit Panjang,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Bukit Timah,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Cecil,0.0,0.1,0.0,0.1,0.0,0.0,0.0,0.0,0.0,...,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0
4,Changi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Eastwood,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Farrer Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0
7,Golden Mile,0.0,0.130435,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.173913,0.0,0.0,0.0,0.0,0.0
8,Harbourfront,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Holland Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
num_top_venues = 5

for hood in singapore_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = singapore_grouped[singapore_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

---- Amber Road----
                   venue  freq
0                  Hotel   1.0
1    Arts & Crafts Store   0.0
2         Massage Studio   0.0
3            Pizza Place   0.0
4  Performing Arts Venue   0.0


---- Bukit Panjang----
                   venue  freq
0     Miscellaneous Shop   1.0
1    Arts & Crafts Store   0.0
2         Massage Studio   0.0
3            Pizza Place   0.0
4  Performing Arts Venue   0.0


---- Bukit Timah----
                   venue  freq
0               Bus Stop   1.0
1    Arts & Crafts Store   0.0
2             Poke Place   0.0
3            Pizza Place   0.0
4  Performing Arts Venue   0.0


---- Cecil----
                 venue  freq
0          Coffee Shop   0.2
1  Japanese Restaurant   0.1
2        Deli / Bodega   0.1
3                 Café   0.1
4    Korean Restaurant   0.1


---- Changi----
                   venue  freq
0                   Pool   1.0
1         Massage Studio   0.0
2            Pizza Place   0.0
3  Performing Arts Venue   0.0
4         

Sort the venues in descending order

In [18]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Create the new dataframe and display the top 10 venues for each neighborhood

In [19]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = singapore_grouped['Neighborhood']

for ind in np.arange(singapore_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(singapore_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Amber Road,Hotel,Yoga Studio,Dim Sum Restaurant,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Dessert Shop
1,Bukit Panjang,Miscellaneous Shop,Yoga Studio,Dessert Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Dim Sum Restaurant
2,Bukit Timah,Bus Stop,Yoga Studio,Dessert Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Dim Sum Restaurant
3,Cecil,Coffee Shop,Deli / Bodega,Asian Restaurant,Wine Bar,Bakery,Café,Southern / Soul Food Restaurant,Korean Restaurant,Japanese Restaurant,Yoga Studio
4,Changi,Pool,Yoga Studio,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega


Our Singapore region data is now ready for further analysis!

# Methodology <a name="Methodology"></a>

In this project, we will identify areas with cluster of restaurants. The assumption is that for regions with many restaurant, there would be an equal amount of customer flow. 

First, we have collected the required data
1. All regions from Singapore
1. Latitude and longitude data for each region obtained from OneMap SG
1. Venue data around the neighbourhood of each region from Foursquare

Secondly, our analysis will be to identify different clusters of Singapore region. We will use k-means clustering methodology to identify the different clusters. Then we will look at venues in each clusters to identify the best region to startup a new restaurant. 

# Analysis <a name="Analysis"></a>

Import libraries to run k-means clustering

In [20]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


Cluster Neighborhoods
Run k-means to cluster the neighborhood into 5 clusters.

In [21]:
# set number of clusters
kclusters = 5

singapore_grouped_clustering = singapore_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(singapore_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 3, 1, 1, 2, 1, 1, 1, 1], dtype=int32)

In [22]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

singapore_merged = final_post3

# merge singapore_grouped with toronto_data to add latitude/longitude for each neighborhood
singapore_merged = singapore_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='region')

singapore_merged.head() # check the last columns!

Unnamed: 0,index,seq,postal_district,region,latitude,longtitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,14,15,Amber Road,1.301350935,103.901269,1.0,Hotel,Yoga Studio,Dim Sum Restaurant,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Dessert Shop
1,1,19,20,Ang Mo Kio,1.368778389,103.8402739,,,,,,,,,,,
2,3,12,13,Braddell,1.346634724,103.8618812,,,,,,,,,,,
3,4,22,23,Bukit Panjang,1.377762325,103.7735167,1.0,Miscellaneous Shop,Yoga Studio,Dessert Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Dim Sum Restaurant
4,5,9,10,Bukit Timah,1.345125895,103.7726938,3.0,Bus Stop,Yoga Studio,Dessert Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Dim Sum Restaurant


Drop regions without venue data due to missing latitude and longitude and convert cluster from float to integer

In [23]:
singapore_merged2 = singapore_merged[singapore_merged['Cluster Labels'].notnull()]
singapore_merged2 = singapore_merged2.astype({"Cluster Labels": int, "latitude": float, "longtitude":float}, inplace = True)
singapore_merged2

Unnamed: 0,index,seq,postal_district,region,latitude,longtitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,14,15,Amber Road,1.301351,103.901269,1,Hotel,Yoga Studio,Dim Sum Restaurant,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Dessert Shop
3,4,22,23,Bukit Panjang,1.377762,103.773517,1,Miscellaneous Shop,Yoga Studio,Dessert Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Dim Sum Restaurant
4,5,9,10,Bukit Timah,1.345126,103.772694,3,Bus Stop,Yoga Studio,Dessert Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Dim Sum Restaurant
6,7,0,1,Cecil,1.279789,103.848043,1,Coffee Shop,Deli / Bodega,Asian Restaurant,Wine Bar,Bakery,Café,Southern / Soul Food Restaurant,Korean Restaurant,Japanese Restaurant,Yoga Studio
7,8,16,17,Changi,1.344606,103.963509,1,Pool,Yoga Studio,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega
11,13,15,16,Eastwood,1.322869,103.95692,2,Playground,Park,Deli / Bodega,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Dessert Shop
13,15,7,8,Farrer Park,1.313279,103.853623,1,Chinese Restaurant,Hostel,Hotel,Sushi Restaurant,Restaurant,Deli / Bodega,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store
14,16,6,7,Golden Mile,1.303142,103.863877,1,Thai Restaurant,Indian Restaurant,Asian Restaurant,Noodle House,Chinese Restaurant,Dessert Shop,Coffee Shop,Miscellaneous Shop,Burger Joint,Comfort Food Restaurant
15,17,3,4,Harbourfront,1.26601,103.820475,1,Coffee Shop,Furniture / Home Store,Casino,Food Court,Deli / Bodega,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop
16,18,9,10,Holland Road,1.314052,103.791269,1,Pizza Place,Yoga Studio,Dessert Shop,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega


In [24]:
#set singapore latitude and longitude
latitude= 1.3521
longitude=103.8198

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(singapore_merged2['latitude'], singapore_merged2['longtitude'], singapore_merged2['region'], singapore_merged2['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# Results & Discussions <a name="Results"></a>

Looking at the folium map, we can see that it is not easy to categorise the different regions in Singapore into clusters. It would seem like the different regions in Singapore have very similar venues in their neighbourhood. In other words, every neighbourhood is quite similar to each other in terms of variety of stores around the different regions.

After looking at the result cluster, the initial assumption for the business problem would have to be updated. Instead of finding a location to start a restaurant based on areas where there are alot of other restaurants (thus ensuring customer flow), the data is showing us <b>areas that we should avoid</b> when it comes to opening up a restaurant. That is because the regions are populated mainly with public parks and infrastructure with close to zero restaurants.

Now let's examine each clusters in detail

Cluster 1: potential region for a restaurant depending on the type of restaurant we would like to open. This region seems to be attracting young demographic and vibrant for nightlife

In [25]:
singapore_merged2.loc[singapore_merged2['Cluster Labels'] == 0, singapore_merged2.columns[[1] + list(range(5, singapore_merged2.shape[1]))]]

Unnamed: 0,seq,longtitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
57,7,103.853435,0,Pakistani Restaurant,Yoga Studio,Dessert Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Dim Sum Restaurant


Cluster 2: we may want to avoid opening up a restaurant in this region given there is limited eateries in surrounding neighbourhood

In [26]:
singapore_merged2.loc[singapore_merged2['Cluster Labels'] == 1, singapore_merged2.columns[[1] + list(range(5, singapore_merged2.shape[1]))]]

Unnamed: 0,seq,longtitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,14,103.901269,1,Hotel,Yoga Studio,Dim Sum Restaurant,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Dessert Shop
3,22,103.773517,1,Miscellaneous Shop,Yoga Studio,Dessert Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Dim Sum Restaurant
6,0,103.848043,1,Coffee Shop,Deli / Bodega,Asian Restaurant,Wine Bar,Bakery,Café,Southern / Soul Food Restaurant,Korean Restaurant,Japanese Restaurant,Yoga Studio
7,16,103.963509,1,Pool,Yoga Studio,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega
13,7,103.853623,1,Chinese Restaurant,Hostel,Hotel,Sushi Restaurant,Restaurant,Deli / Bodega,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store
14,6,103.863877,1,Thai Restaurant,Indian Restaurant,Asian Restaurant,Noodle House,Chinese Restaurant,Dessert Shop,Coffee Shop,Miscellaneous Shop,Burger Joint,Comfort Food Restaurant
15,3,103.820475,1,Coffee Shop,Furniture / Home Store,Casino,Food Court,Deli / Bodega,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop
16,9,103.791269,1,Pizza Place,Yoga Studio,Dessert Shop,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega
17,18,103.887466,1,Snack Place,Fried Chicken Joint,Café,Restaurant,Yoga Studio,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop
18,7,103.859467,1,BBQ Joint,Vietnamese Restaurant,Bar,Bed & Breakfast,Soup Place,Ice Cream Shop,Yoga Studio,Dim Sum Restaurant,Construction & Landscaping,Convenience Store


Cluster 3: We can probably open up our restaurant in any one of these regions given the variety of eateries in the neighbourhood

In [27]:
singapore_merged2.loc[singapore_merged2['Cluster Labels'] == 2, singapore_merged2.columns[[1] + list(range(5, singapore_merged2.shape[1]))]]

Unnamed: 0,seq,longtitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,15,103.95692,2,Playground,Park,Deli / Bodega,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Dessert Shop
43,24,103.758554,2,Park,Yoga Studio,Dessert Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Dim Sum Restaurant


Cluster 4: we may also want to avoid opening up a restaurant in this region given there is limited eateries in surrounding neighbourhood

In [28]:
singapore_merged2.loc[singapore_merged2['Cluster Labels'] == 3, singapore_merged2.columns[[1] + list(range(5, singapore_merged2.shape[1]))]]

Unnamed: 0,seq,longtitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,9,103.772694,3,Bus Stop,Yoga Studio,Dessert Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Dim Sum Restaurant


Cluster 5: Potential regions to open up a restaurant

In [29]:
singapore_merged2.loc[singapore_merged2['Cluster Labels'] == 4, singapore_merged2.columns[[1] + list(range(5, singapore_merged2.shape[1]))]]

Unnamed: 0,seq,longtitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
24,17,103.963619,4,Bus Station,Yoga Studio,Dessert Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Dim Sum Restaurant
65,18,103.873219,4,Bus Station,Yoga Studio,Dessert Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Dim Sum Restaurant


# Conclusion <a name="Conclusion"></a>

Purpose of this project is to identify a region with vibrant foodie scene to open up a restaurant in Singapore. The data however, shows us areas where we may want to avoid opening up a restaurant (regions in cluster 4 and 2) instead, given there are too many regions that belongs in the same clusters. This shows that many regions have very similar characteristics, and we would need to avoid the regions that do not have many food attractions.

There is also a lack of venue data for many of the Singapore regions as seen in the results in *singapore_venues.groupby('Neighborhood').count()* , where many regions only returned 1 venue from Foursquare. This can be further improved by using other food reviews data provider instead of Foursquare. Therefore, we do not have sufficient information to advise on where is the best region to open up a restaurant, but we can definately highlight the regions within clusters 2 and 4 to avoid opening up our restaurant