# Indonesia: The Capital vs The Tourist Attraction

Indonesia is a big and vast countries, ranked 4 in hte list of countries by population. Based on the [worldmeters.info](https://www.worldometers.info/world-population/indonesia-population/), currently Indonesia has a population of 275 millions people. 

![Indonesia](https://images.mapsofworld.com/earthquake/1482303515indonesia-map.gif)

Indonesia is divided into 34 different provinces, scattered in 5 main islands and thousands of small islands. One of the most popular province is the capital province of Jakarta which is also one of the main financial hub in Indonesia. Another popular province is the province of Bali, which is famous as a tourist attraction both for domestic and international tourist with total contribution of almost 29% of the national foreign exchange.

This project will explore and find different cluster of neighborhood both in Jakarta and Bali to see if there is certain characteristics that is associated with each province or if both provinces share the same characteristics based on various places and landmarks on its neighborhood.

In [1]:
import numpy as np
import pandas as pd

pd.set_option('display.max_columns', None)

# Web Scraping
import requests
from requests import get
from bs4 import BeautifulSoup

# Data

The postal code and information about each province and city will be gathered from https://postcode.id/, which will give us the following fields from the data table:

- Propinsi (Province): The name of the province
- Jenis (City Type): Type of the city: either city or regency
- Kabupaten/Kota: The name of the city/regency
- Kecamatan: The name of district in the city/regency
- Keluarahan: The name of the neighborhood
- Postcode: The postal code of the neighborhood

I will also use the `geocoder` package to get the latitude and longitude information from each district/neighborhood. To get the list of places and landmarks in each area, I will use the `Foursquare API`.

With the mentioned data collection method, I will be able to extracts various places and landmarks. I will use the clustering analysis to find different characteristics of each city and the respective neighborhood to see if Jakarta and Bali has the same overall characteristics based on each province's main features. Another business perspective is to determine if it is better to open up a restaurant in Bali or in Jakarta. Is Bali as a tourist attraction more associated with restaurant than Jakarta? Is it better to setup a contractor office in Jakarta since Jakarta is a capital and an economic districts? We will find out in this project.

## Jakarta Province

The following code will scrape the data from the https://postcode.id/dki-jakarta/ and will give me the list of information about the district and neighborhood in the capital province of Jakarta. Jakarta is a big province and is divided into 5 cities: Central Jakarta, South Jakarta, East Jakarta, West Jakarta, and North Jakarta.

In [15]:
url = "https://postcode.id/dki-jakarta/"
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')

In [16]:
table_content = html_soup.find_all('div', class_ = "supsystic-tables-wrap")
table_content = table_content[0].find_all('td')

raw_text = []
for i in np.arange(0, len(table_content)):
 raw_text.append(table_content[i].text)

province = raw_text[::6]
city_type = raw_text[1::6]
city = raw_text[2::6]
district = raw_text[3::6]
neighborhood = raw_text[4::6]
postal_code = raw_text[5::6]

df_jakarta = pd.DataFrame({'province':province, 'city_type':city_type, 'city':city, 'district': district, 'neighborhood':neighborhood, 'postal_code':postal_code})
df_jakarta.head()

Unnamed: 0,province,city_type,city,district,neighborhood,postal_code
0,DKI Jakarta,Kota,Jakarta Barat,Taman Sari,Pinangsia,11110
1,DKI Jakarta,Kota,Jakarta Barat,Taman Sari,Glodok,11120
2,DKI Jakarta,Kota,Jakarta Barat,Taman Sari,Keagungan,11130
3,DKI Jakarta,Kota,Jakarta Barat,Taman Sari,Krukut,11140
4,DKI Jakarta,Kota,Jakarta Barat,Taman Sari,Taman Sari,11150


Let's check the number of neighborhood in Jakarta.

In [13]:
df_jakarta.shape

(267, 6)

## Bali Province

The following code will scrape the data from the https://postcode.id/bali/ and will give me the list of information about the district and neighborhood in the province of Bali.

In [8]:
url = "https://postcode.id/bali/"
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
table_content = html_soup.find_all('div', class_ = "supsystic-tables-wrap")
table_content = table_content[0].find_all('td')

raw_text = []
for i in np.arange(0, len(table_content)):
 raw_text.append(table_content[i].text)

province = raw_text[::6]
city_type = raw_text[1::6]
city = raw_text[2::6]
district = raw_text[3::6]
neighborhood = raw_text[4::6]
postal_code = raw_text[5::6]

df_bali = pd.DataFrame({'province':province, 'city_type':city_type, 'city':city, 'district': district, 'neighborhood':neighborhood, 'postal_code':postal_code})
df_bali.head()

Unnamed: 0,province,city_type,city,district,neighborhood,postal_code
0,Bali,Kab.,Badung,Mengwi,Abianbase,80351
1,Bali,Kab.,Badung,Mengwi,Baha,80351
2,Bali,Kab.,Badung,Mengwi,Buduk,80351
3,Bali,Kab.,Badung,Mengwi,Cemagi,80351
4,Bali,Kab.,Badung,Mengwi,Gulingan,80351


Let's check the number of neighborhood in Bali.

In [14]:
df_bali.shape

(714, 6)