# Jakarta Living Area Recommendation

Table of Content:
1. [Introduction](#1.-Introduction)
2. [Data](#2.-Data)
3. [Methodology](#3.-Methodology)
4. [Result](#4.-Result)
5. [Discussion](#5.-Discussion)
6. [Conclusion](#6.-Conclusion)

# 1. Introduction

Jakarta is the capital city and largest city of Indonesia with more than 10 million population per 2019. Greater Jakarta area or Jabodetabek has 30 million population, 2nd most populated area in the world after Greater Tokyo area. Jakarta is divided to 44 different districts, 2 of them in Kepulauan Seribu Regency, a regency with 342 island, situated 70 km north of Jakarta core area.

Sometimes, it is hard for people to find a good area to live in Jakarta. So, I am thinking of doing a recommendation for users who wanted to find the best district to live at. The recommendation will be based on the user's interest. User will input the venue categories they want to have around their potential living district.

I will also use the district population data as the feature. I actually wanted to use the land value in each district for another feature too, but I can't find any available data about it.

District population data is available online and I can download it directly, but for the property sales value, I need to form my own dataset based on Jakarta Government Regulation Number 24 Year 2018, it means I need to gather the data from approximately 6000 pdf pages then convert them manually to excel data. It is too much work for me so I will skip it for now.

I will use Jakarta district area size too. Divide each population by the district size and we will get the density of each district. Those density will be categorized as 'Low Density', 'Medium Density' and 'High Density'.

The recommendation will calculate the area with the most venue frequency that match the category determined by the user. For example, user A wants to have Japanese restaurant, Chinese restaurant, mall, and school around their living area with low density, on the other hand district X have 3 Japanese restaurants, 6 Chinese restaurants, 2 malls, and 3 schools, but it is high density area then the total frequency is 3 + 6 + 2 + 3  = 14.

The result given to the user is top 7 district to live in.

# 2. Data

Here is the list of data I use in this project:
- I used Jakarta population dataset to make a choropleth map about each district population distribution, data is taken from [Jakarta OpenData](http://data.jakarta.go.id/sr/dataset/jumlahpendudukmenurutkecamatandanjeniskelamindkijakarta).
- I used Jakarta district size dataset to calculate the density of each district and then categorize each district as 'Low Density', 'Medium Density', and 'High Density', data is taken from [Jakarta OpenData](http://data.jakarta.go.id/dataset/luas-wilayah-dan-kepadatan-provinsi-dki-jakarta-tahun-2015).
- I used Foursquare API data to find venues/places available around each district.
- Jakarta GeoJSON file of each district boundaries for choropleth map is taken from [GIS BPBD DKI Jakarta](http://gis.bpbd.jakarta.go.id/layers/geonode%3Adki_kecamatan).

# 3. Methodology

Here we will describe every step we take from loading the dataset until we give the top 7 district to live in for the user.

## Explore Jakarta Population Dataset

In this part, I will explore data about Jakarta population for each district.

### Import required libraries

In [1]:
import pandas as pd

### Process Dataset

Load dataset

In [2]:
df_jakarta = pd.read_csv('jakarta_pop.csv')
df_jakarta.head()

Unnamed: 0,tahun,kab/kota,kecamatan,jenis_kelamin,jumlah
0,2014,Kepulauan Seribu,Kepulauan Seribu Selatan,laki-laki,4696
1,2014,Kepulauan Seribu,Kepulauan Seribu Selatan,perempuan,4664
2,2014,Kepulauan Seribu,Kepulauan Seribu Utara,laki-laki,6933
3,2014,Kepulauan Seribu,Kepulauan Seribu Utara,perempuan,6718
4,2014,Jakarta Selatan,Jagakarsa,laki-laki,179995


Let's change the dataframe column to English words.

In [3]:
df_jakarta.columns = [
    'Year',
    'City',
    'District',
    'Gender',
    'Total'
]
df_jakarta.head()

Unnamed: 0,Year,City,District,Gender,Total
0,2014,Kepulauan Seribu,Kepulauan Seribu Selatan,laki-laki,4696
1,2014,Kepulauan Seribu,Kepulauan Seribu Selatan,perempuan,4664
2,2014,Kepulauan Seribu,Kepulauan Seribu Utara,laki-laki,6933
3,2014,Kepulauan Seribu,Kepulauan Seribu Utara,perempuan,6718
4,2014,Jakarta Selatan,Jagakarsa,laki-laki,179995


I don't think we need the gender data here, so I will remove it.
Year data is not needed either, because all row has '2014' as its year, so it won't be useful for our dataset.

And then sum the population value by its corresponding city and then district.

In [4]:
df_jakarta.drop(['Year', 'Gender'], axis = 1, inplace = True)
df_jakarta = df_jakarta.groupby(['City', 'District']).sum().sort_values(by='District').reset_index()

df_jakarta.head()

Unnamed: 0,City,District,Total
0,Jakarta Timur,Cakung,523159
1,Jakarta Pusat,Cempaka Putih,84864
2,Jakarta Barat,Cengkareng,555972
3,Jakarta Selatan,Cilandak,197524
4,Jakarta Utara,Cilincing,397467


In [5]:
df_jakarta.shape

(44, 3)

There are 44 district in Jakarta and let's drop the Kepulauan Seribu city, the GeoJSON file I have doesn't include Kepulauan Seribu city, and there aren't many properties sold in Kepulauan Seribu anyway.

In [6]:
df_jakarta = df_jakarta[df_jakarta['City'] != 'Kepulauan Seribu'].reset_index(drop = True)

df_jakarta.head()

Unnamed: 0,City,District,Total
0,Jakarta Timur,Cakung,523159
1,Jakarta Pusat,Cempaka Putih,84864
2,Jakarta Barat,Cengkareng,555972
3,Jakarta Selatan,Cilandak,197524
4,Jakarta Utara,Cilincing,397467


In [7]:
df_jakarta.shape

(42, 3)

Kepulauan Seribu removal is successful!

Let's move to the next step!

## Visualize Jakarta's District Map

In this part, I will visualize the map and its population distribution using Choropleth map.

### Import required libraries

In [8]:
import geocoder
#!pip install folium
import folium
import json
import numpy as np

### Visualize the plain map

I use 'Monumen Nasional' (Monas) in Central Jakarta as our map center.

In [9]:
g = geocoder.arcgis('Monumen Nasional, Jakarta Pusat')

g.latlng

[-6.17536999999993, 106.82852000000008]

Let's see the plain map without any features added.

In [10]:
peta = folium.Map(
    location = [g.latlng[0], g.latlng[1]],
    zoom_start = 11
)

peta

### Find extra data for our choropleth map

Load Jakarta district GeoJSON data.

In [11]:
with open("jakarta_district.json") as json_file:
    jakarta_geo = json.load(json_file)

Put id for each row to match it with the GeoJSON features.id data

In [12]:
id = []

for i in range(df_jakarta.shape[0]):
    id.append('dki_kecamatan.'+str(i + 1))
    
df_jakarta['id'] = id

df_jakarta.head()

Unnamed: 0,City,District,Total,id
0,Jakarta Timur,Cakung,523159,dki_kecamatan.1
1,Jakarta Pusat,Cempaka Putih,84864,dki_kecamatan.2
2,Jakarta Barat,Cengkareng,555972,dki_kecamatan.3
3,Jakarta Selatan,Cilandak,197524,dki_kecamatan.4
4,Jakarta Utara,Cilincing,397467,dki_kecamatan.5


Add latitude and longitude for each district.

I find the mean of the latitude and longitude in the GeoJSON data first, because if I just use the first latitude ang longitude, then the circle marker that I will add later will appear in the border of each district. Instead, I want the circle marker to appear around the center of each district

In [13]:
longitude = []
latitude = []
 
for i in range(df_jakarta.shape[0]):
    original_list = jakarta_geo['features'][i]['geometry']['coordinates'][0][0]
    mean_lat_long = [sum(x)/len(x) for x in zip(*original_list)] 
    longitude.append(mean_lat_long[0])
    latitude.append(mean_lat_long[1])
    
df_jakarta['Latitude'] = latitude
df_jakarta['Longitude'] = longitude

df_jakarta.head()

Unnamed: 0,City,District,Total,id,Latitude,Longitude
0,Jakarta Timur,Cakung,523159,dki_kecamatan.1,-6.190411,106.925974
1,Jakarta Pusat,Cempaka Putih,84864,dki_kecamatan.2,-6.181864,106.869431
2,Jakarta Barat,Cengkareng,555972,dki_kecamatan.3,-6.156295,106.735035
3,Jakarta Selatan,Cilandak,197524,dki_kecamatan.4,-6.294269,106.790463
4,Jakarta Utara,Cilincing,397467,dki_kecamatan.5,-6.125917,106.940374


### Visualize with choropleth map

I still use 'Monumen Nasional' as the map center.

Add choropleth layer. Add circle marker containing the district name and population with thousand separator.

In [14]:
peta_baru = folium.Map(
    location = [g.latlng[0], g.latlng[1]],
    zoom_start = 11
)

folium.Choropleth(
    geo_data = jakarta_geo,
    data = df_jakarta,
    columns = ['id', 'Total'],
    key_on = 'feature.id',
    fill_color = 'YlOrRd',
    fill_opacity = 0.7,
    line_opacity = 0.2,
    legend_name = 'Population in each district'
).add_to(peta_baru)

for lat, long, dis, total in zip(df_jakarta['Latitude'], df_jakarta['Longitude'], df_jakarta['District'], df_jakarta['Total']):
    label = str(dis) + ' ' + '{:,}'.format(total)
    
    folium.CircleMarker(
        location = [lat, long],
        radius = 5,
        color = 'yellow',
        fill = True,
        popup = label,
        fill_color = 'blue',
        fill_opacity = 0.6
    ).add_to(peta_baru)


peta_baru

We can see that most of Jakarta population is spread more to the outskirts, maybe because the property sales value is cheaper compared to the more central area where many business and government office is located.

## Calculate Density of Each District

### Load Dataset

In [15]:
size = pd.read_csv('jakarta_district_size.csv')

size.head()

Unnamed: 0,Kecamatan,Kelurahan,Luas Wilayah (m2)
0,Cakung,Jatinegara,6.6
1,Cakung,Rawa Terate,3.3
2,Cakung,Penggilingan,4.48
3,Cakung,Cakung Timur,9.81
4,Cakung,Pulo Gebang,6.92


### Process Dataset

This dataset still contains the 'Kelurahan' value, we need to group by the 'Kecamatan' (District) and sum the area size.

In [16]:
size = size.groupby('Kecamatan').sum(axis = 1)['Luas Wilayah (m2)'].reset_index()

size.columns = ['District', 'Area Size (m2)']

size.head()

Unnamed: 0,District,Area Size (m2)
0,Cakung,42.27
1,Cempaka Putih,4.7
2,Cengkareng,26.55
3,Cilandak,18.16
4,Cilincing,37.7


In [2]:
# @hidden cell


Let's join df_jakarta dataset with size dataset.

In [18]:
df_jakarta = df_jakarta.join(size.set_index('District'), on = 'District')

### Calculate Density

Calculate the density and categorized it.

If density between 0 - 18000, then Low Density.<br>
If density between 18000 - 30000, then Medium Density.<br>
If density above 30000, then High Density.

In [19]:
df_jakarta['Density Amount'] = round(df_jakarta['Total'] / df_jakarta['Area Size (m2)'], 2)

density_list = []

for i in df_jakarta['Density Amount']:
    if i > 0 and i <= 18000:
        density_list.append('Low Density')
    elif i > 18000 and i <= 30000:
        density_list.append('Medium Density')
    else:
        density_list.append('High Density')
        
df_jakarta['Density'] = density_list

df_jakarta.head(10)

Unnamed: 0,City,District,Total,id,Latitude,Longitude,Area Size (m2),Density Amount,Density
0,Jakarta Timur,Cakung,523159,dki_kecamatan.1,-6.190411,106.925974,42.27,12376.6,Low Density
1,Jakarta Pusat,Cempaka Putih,84864,dki_kecamatan.2,-6.181864,106.869431,4.7,18056.17,Medium Density
2,Jakarta Barat,Cengkareng,555972,dki_kecamatan.3,-6.156295,106.735035,26.55,20940.56,Medium Density
3,Jakarta Selatan,Cilandak,197524,dki_kecamatan.4,-6.294269,106.790463,18.16,10876.87,Low Density
4,Jakarta Utara,Cilincing,397467,dki_kecamatan.5,-6.125917,106.940374,37.7,10542.89,Low Density
5,Jakarta Timur,Cipayung,260578,dki_kecamatan.6,-6.322179,106.910599,28.46,9155.94,Low Density
6,Jakarta Timur,Ciracas,267311,dki_kecamatan.7,-6.320755,106.874709,16.08,16623.82,Low Density
7,Jakarta Timur,Duren Sawit,394657,dki_kecamatan.8,-6.233446,106.915957,22.66,17416.46,Low Density
8,Jakarta Pusat,Gambir,78152,dki_kecamatan.9,-6.171775,106.820533,7.59,10296.71,Low Density
9,Jakarta Barat,Grogol Petamburan,232697,dki_kecamatan.10,-6.163629,106.786118,9.99,23292.99,Medium Density


In [20]:
print('Top 5 most densely populated district in Jakarta')

for i, dis in enumerate(df_jakarta.sort_values(by = 'Density Amount', ascending = False)['District'].head()):
    print(str(i + 1) + '. ' + dis)

Top 5 most densely populated district in Jakarta
1. Johar Baru
2. Tambora
3. Matraman
4. Kemayoran
5. Palmerah


### Visualize density on choropleth map

In [21]:
peta_baru = folium.Map(
    location = [g.latlng[0], g.latlng[1]],
    zoom_start = 11
)

folium.Choropleth(
    geo_data = jakarta_geo,
    data = df_jakarta,
    columns = ['id', 'Density Amount'],
    key_on = 'feature.id',
    fill_color = 'YlOrRd',
    fill_opacity = 0.7,
    line_opacity = 0.2,
    legend_name = 'Population density per km2 in each district'
).add_to(peta_baru)

for lat, long, dis, dense in zip(df_jakarta['Latitude'], df_jakarta['Longitude'], df_jakarta['District'], df_jakarta['Density Amount']):
    label = str(dis) + ' ' + '{:,}'.format(dense) + '/km2'
    
    folium.CircleMarker(
        location = [lat, long],
        radius = 7,
        color = 'yellow',
        fill = True,
        popup = label,
        fill_color = 'blue',
        fill_opacity = 0.6
    ).add_to(peta_baru)

peta_baru

## Load Foursquare API Data

In this part, I use Foursquare API data to find what venue are there in each district.

### Import required libraries

In [22]:
import requests

### Initialize client_id and client_secret

In [1]:
# The code was removed by Watson Studio for sharing.

### Foursquare API Call

Let's try to find some places near Monumen Nasional.

In [24]:
VERSION = '20190605'
radius = 500
limit = 200

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        g.latlng[0], 
        g.latlng[1], 
        VERSION, 
        radius, 
        limit
    )

results = requests.get(url).json()['response']['venues']

# results # Uncomment to see the result as it is too long to be shown.

In [25]:
for i in range(20):
    print(results[i]['name'])

Lapangan Basket Monas
Monumen Nasional (MONAS)
Jogging Track MONAS
Sniper Game Mangga Dua Square
The Art of Liu Kuo Sung
Kungkow
Monas Sirkuit Road Race
Kantin DEPAG
Lapangan Basket Monas
Blue House Tirtosari 115A Gang  Tunjungsari Tembalang Semarang
Blue House Tirtosari 115A  Gang Tunjungsari Tembalang Semarang
Blue  House Tirtosari 115A
cHaNdRa's R00m
Peron 3 Stasiun Gambir
Balai Kartini
Pura Aditya Jaya Rawamangun
GARNISUN TETAP - I JAKARTA
KIA Town - The 21st IIMS 2013
K5 Night market
zha's chamber


Let's now do it for the whole district

Define function to get the category of each places.

In [26]:
def getCatPlaces(result):
    place_cat = 'None'

    for res in result:
        try:
            place_cat = res['name']
        except:
            place_cat = 'None'
            
    return place_cat

In [27]:
def getNearbyPlaces(districts, latitudes, longitudes, radius=500, LIMIT = 200):
    
    places_list=[]
    for district, lat, long in zip(districts, latitudes, longitudes):
        print(district)

        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                lat, 
                long, 
                VERSION, 
                radius, 
                limit
            )

        results = requests.get(url).json()["response"]['venues']
        
        # return only relevant information for each nearby venue
        places_list.append([(
            district,
            lat, 
            long, 
            res['name'], 
            res['location']['lat'], 
            res['location']['lng'],  
            getCatPlaces(res['categories'])) for res in results])


    nearby_places = pd.DataFrame([item for places_list in places_list for item in places_list])
    nearby_places.columns = ['District',
                  'District Latitude', 
                  'District Longitude', 
                  'Place', 
                  'Place Latitude', 
                  'Place Longitude',
                  'Place Category']    
    
    return(nearby_places)

jakarta_places = getNearbyPlaces(districts = df_jakarta['District'],
                                 latitudes = df_jakarta['Latitude'],
                                 longitudes = df_jakarta['Longitude']
                                )

Cakung
Cempaka Putih
Cengkareng
Cilandak
Cilincing
Cipayung
Ciracas
Duren Sawit
Gambir
Grogol Petamburan
Jagakarsa
Jatinegara
Johar Baru
Kalideres
Kebayoran Baru
Kebayoran Lama
Kebon Jeruk
Kelapa Gading
Kemayoran
Kembangan
Koja
Kramat Jati
Makasar
Mampang Prapatan
Matraman
Menteng
Pademangan
Palmerah
Pancoran
Pasar Minggu
Pasar Rebo
Penjaringan
Pesanggrahan
Pulogadung
Sawah Besar
Senen
Setiabudi
Taman Sari
Tambora
Tanah Abang
Tanjung Priok
Tebet


In [28]:
jakarta_places.head(10)

Unnamed: 0,District,District Latitude,District Longitude,Place,Place Latitude,Place Longitude,Place Category
0,Cakung,-6.190411,106.925974,Cuppa Coffee Inc,-6.190733,106.924975,Coffee Shop
1,Cakung,-6.190411,106.925974,Viva Bowling Alley,-6.190028,106.924986,Bowling Alley
2,Cakung,-6.190411,106.925974,PT Yamaha Indonesia Motor Mfg,-6.191781,106.924502,Factory
3,Cakung,-6.190411,106.925974,PT. SHARP ELECTRONIC INDONESIA,-6.189906,106.925482,Building
4,Cakung,-6.190411,106.925974,Yamaha Indonesia Motor Manufacturing,-6.192822,106.924353,Office
5,Cakung,-6.190411,106.925974,SEID Procurement Office,-6.189693,106.925442,
6,Cakung,-6.190411,106.925974,Nasi Goreng Lia,-6.191025,106.92324,
7,Cakung,-6.190411,106.925974,TDR Technology Center,-6.189243,106.924888,Building
8,Cakung,-6.190411,106.925974,Nokia,-6.190535,106.925021,
9,Cakung,-6.190411,106.925974,PT Yamaha Indonesia Motor Manufacturing (Main ...,-6.191749,106.925148,


In [29]:
jakarta_places.tail(10)

Unnamed: 0,District,District Latitude,District Longitude,Place,Place Latitude,Place Longitude,Place Category
5912,Tebet,-6.225883,106.854572,"Villa Ombak, Gili Trawangan",-6.225738,106.854998,Resort
5913,Tebet,-6.225883,106.854572,PB Taxand Auditorium,-6.226355,106.854634,
5914,Tebet,-6.225883,106.854572,Bakul Tukul,-6.225808,106.854906,Restaurant
5915,Tebet,-6.225883,106.854572,Ammar Computer Room,-6.225721,106.854932,Arcade
5916,Tebet,-6.225883,106.854572,Studio TRX,-6.226329,106.854644,
5917,Tebet,-6.225883,106.854572,Tomyam papa,-6.22541,106.854925,
5918,Tebet,-6.225883,106.854572,dr.nina s,-6.225697,106.85503,Doctor's Office
5919,Tebet,-6.225883,106.854572,Klinik gigi Tebet Dr Maya,-6.22537,106.854872,Daycare
5920,Tebet,-6.225883,106.854572,Sop buntut warung Niot,-6.225698,106.854794,Asian Restaurant
5921,Tebet,-6.225883,106.854572,Visi Ruang C,-6.225301,106.854556,College Classroom


Hmmm, seems like there are many places without category, better if I remove it from the dataset.

In [30]:
jakarta_places = jakarta_places[jakarta_places['Place Category'] != 'None'].reset_index(drop = True)
jakarta_places.head(10)

Unnamed: 0,District,District Latitude,District Longitude,Place,Place Latitude,Place Longitude,Place Category
0,Cakung,-6.190411,106.925974,Cuppa Coffee Inc,-6.190733,106.924975,Coffee Shop
1,Cakung,-6.190411,106.925974,Viva Bowling Alley,-6.190028,106.924986,Bowling Alley
2,Cakung,-6.190411,106.925974,PT Yamaha Indonesia Motor Mfg,-6.191781,106.924502,Factory
3,Cakung,-6.190411,106.925974,PT. SHARP ELECTRONIC INDONESIA,-6.189906,106.925482,Building
4,Cakung,-6.190411,106.925974,Yamaha Indonesia Motor Manufacturing,-6.192822,106.924353,Office
5,Cakung,-6.190411,106.925974,TDR Technology Center,-6.189243,106.924888,Building
6,Cakung,-6.190411,106.925974,The Summit,-6.158803,106.908952,Residential Building (Apartment / Condo)
7,Cakung,-6.190411,106.925974,Smoking Room Yamaha Pulogadung Main Building,-6.192241,106.924825,Speakeasy
8,Cakung,-6.190411,106.925974,SEID HQ Pulo Gadung,-6.189927,106.925718,Office
9,Cakung,-6.190411,106.925974,TDR Office n Manufacture,-6.19149,106.925705,Factory


Ah, this one looks better

In [31]:
jakarta_places.shape

(3165, 7)

I removed more than 2000 places without category, around 1/3 of the whole dataset.

In [32]:
print("There are {} unique categories.".format(len(jakarta_places['Place Category'].unique())))

There are 335 unique categories.


Oh wow, that's actually a lot of categories.

### Final Dataframe Processing

Apply one hot encoder to the place category.

In [33]:
jakarta_onehot = pd.get_dummies(jakarta_places[['Place Category']], prefix="", prefix_sep="")

jakarta_onehot['District'] = jakarta_places['District']

fixed_columns = [jakarta_onehot.columns[-1]] + list(jakarta_onehot.columns[:-1])
jakarta_onehot = jakarta_onehot[fixed_columns]

jakarta_onehot.head()

Unnamed: 0,District,Accessories Store,Advertising Agency,Afghan Restaurant,African Restaurant,Airport,Airport Gate,Airport Lounge,Airport Terminal,American Restaurant,...,Vietnamese Restaurant,Vineyard,Volleyball Court,Voting Booth,Warehouse,Water Park,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,Cakung,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Cakung,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Cakung,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Cakung,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Cakung,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Group the one hot encoder result by its district and sum it up

In [34]:
jakarta_grouped = jakarta_onehot.groupby('District').sum().reset_index()

In [35]:
jakarta_grouped.head()

Unnamed: 0,District,Accessories Store,Advertising Agency,Afghan Restaurant,African Restaurant,Airport,Airport Gate,Airport Lounge,Airport Terminal,American Restaurant,...,Vietnamese Restaurant,Vineyard,Volleyball Court,Voting Booth,Warehouse,Water Park,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,Cakung,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Cempaka Putih,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Cengkareng,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Cilandak,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Cilincing,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0


Apply one hot encoder to the density.

In [36]:
density_onehot = pd.get_dummies(df_jakarta[['Density']], prefix="", prefix_sep="")

density_onehot['District'] = df_jakarta['District']

density_onehot

fixed_columns = [density_onehot.columns[-1]] + list(density_onehot.columns[:-1])
density_onehot = density_onehot[fixed_columns]

density_onehot.head(5)

Unnamed: 0,District,High Density,Low Density,Medium Density
0,Cakung,0,1,0
1,Cempaka Putih,0,0,1
2,Cengkareng,0,0,1
3,Cilandak,0,1,0
4,Cilincing,0,1,0


Join jakarta_onehot and density_onehot to the same dataframe

In [37]:
jakarta_grouped = jakarta_grouped.join(density_onehot.set_index('District'), on = 'District', how='left')
jakarta_grouped.head(5)

Unnamed: 0,District,Accessories Store,Advertising Agency,Afghan Restaurant,African Restaurant,Airport,Airport Gate,Airport Lounge,Airport Terminal,American Restaurant,...,Voting Booth,Warehouse,Water Park,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit,High Density,Low Density,Medium Density
0,Cakung,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,Cempaka Putih,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,Cengkareng,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,Cilandak,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,Cilincing,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,1,0


## Giving Recommendation to User

### Initialize list which contains the user chosen category

Let's say the user wants to have Japanese Restaurant, Salon / Barbershop, High School, and Chinese Restaurant near the potential living district. And low density area.

In [38]:
wanted_list = ['Japanese Restaurant', 'Salon / Barbershop', 'High School', 'Chinese Restaurant', 'Low Density']

### Initialize a dataframe with zeros

Now I will initialize a dataframe which contains zero value. The shape will be jakarta_grouped rows and columns except the district and density column.

In [39]:
wanted = pd.DataFrame(np.zeros((jakarta_grouped.shape[0], jakarta_grouped.shape[1] - 1), dtype = int))
wanted.columns = jakarta_grouped.columns[1:]

wanted.head()

Unnamed: 0,Accessories Store,Advertising Agency,Afghan Restaurant,African Restaurant,Airport,Airport Gate,Airport Lounge,Airport Terminal,American Restaurant,Animal Shelter,...,Voting Booth,Warehouse,Water Park,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit,High Density,Low Density,Medium Density
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Change the value for each category the user wanted.

Change to 1 for place category.

Change to 0.05 for density category.

In [40]:
wanted[wanted_list] = 1

wanted[wanted_list].head()

Unnamed: 0,Japanese Restaurant,Salon / Barbershop,High School,Chinese Restaurant,Low Density
0,1,1,1,1,1
1,1,1,1,1,1
2,1,1,1,1,1
3,1,1,1,1,1
4,1,1,1,1,1


### Element-wise multiplication

Do an element-wise multiplication between the frequency table and wanted table

In [41]:
multi_result = jakarta_grouped.iloc[:, 1:].multiply(wanted)

multi_result[wanted_list].head()

Unnamed: 0,Japanese Restaurant,Salon / Barbershop,High School,Chinese Restaurant,Low Density
0,0,0,0,0,1
1,0,2,0,0,0
2,0,0,0,0,0
3,0,1,0,0,1
4,0,0,0,0,1


First 5 district doesn't have either japanese restaurant, high school, or chinese restaurant. Luckily salon / barbershop is available in the 2nd and 4th district.

Add each row district name and sum each row value and label it as 'Total Freq'

In [42]:
multi_result['District'] = jakarta_grouped['District']

multi_result['Total Freq'] = multi_result.sum(axis = 1)

In [43]:
multi_result.head()

Unnamed: 0,Accessories Store,Advertising Agency,Afghan Restaurant,African Restaurant,Airport,Airport Gate,Airport Lounge,Airport Terminal,American Restaurant,Animal Shelter,...,Water Park,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit,High Density,Low Density,Medium Density,District,Total Freq
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,Cakung,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Cempaka Putih,2
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Cengkareng,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,Cilandak,2
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,Cilincing,1


### Join dataframe

Join df_jakarta dataframe and multi_result dataframe by its district

In [44]:
jakarta_joined = df_jakarta.join(multi_result.iloc[:, -2:].set_index('District'), on = 'District')

jakarta_joined.head()

Unnamed: 0,City,District,Total,id,Latitude,Longitude,Area Size (m2),Density Amount,Density,Total Freq
0,Jakarta Timur,Cakung,523159,dki_kecamatan.1,-6.190411,106.925974,42.27,12376.6,Low Density,1
1,Jakarta Pusat,Cempaka Putih,84864,dki_kecamatan.2,-6.181864,106.869431,4.7,18056.17,Medium Density,2
2,Jakarta Barat,Cengkareng,555972,dki_kecamatan.3,-6.156295,106.735035,26.55,20940.56,Medium Density,0
3,Jakarta Selatan,Cilandak,197524,dki_kecamatan.4,-6.294269,106.790463,18.16,10876.87,Low Density,2
4,Jakarta Utara,Cilincing,397467,dki_kecamatan.5,-6.125917,106.940374,37.7,10542.89,Low Density,1


### Value sorting by 'Total Freq'

Sort the value by the highest Total Freq.

In [45]:
jakarta_joined = jakarta_joined.sort_values(by = 'Total Freq', ascending = False).reset_index(drop = True)

jakarta_joined.head(7)

Unnamed: 0,City,District,Total,id,Latitude,Longitude,Area Size (m2),Density Amount,Density,Total Freq
0,Jakarta Barat,Taman Sari,110008,dki_kecamatan.38,-6.147465,106.817341,7.74,14212.92,Low Density,12
1,Jakarta Barat,Grogol Petamburan,232697,dki_kecamatan.10,-6.163629,106.786118,9.99,23292.99,Medium Density,12
2,Jakarta Pusat,Sawah Besar,100461,dki_kecamatan.35,-6.152739,106.834178,6.16,16308.6,Low Density,11
3,Jakarta Utara,Kelapa Gading,156664,dki_kecamatan.18,-6.163512,106.909511,16.12,9718.61,Low Density,8
4,Jakarta Selatan,Tebet,210356,dki_kecamatan.42,-6.225883,106.854572,9.03,23295.24,Medium Density,7
5,Jakarta Utara,Tanjung Priok,386264,dki_kecamatan.41,-6.132658,106.872981,25.12,15376.75,Low Density,6
6,Jakarta Barat,Tambora,239474,dki_kecamatan.39,-6.148,106.804356,5.4,44347.04,High Density,5


I will add a column named 'Recommended', for the top 7 result, the recommended value is 1, else it is 0. This column is just a binary data substitute for 'Yes' and 'No'.

In [46]:
recommended = []

for i in range(jakarta_joined.shape[0]):
    if i < 7:
        recommended.append(1)
    else:
        recommended.append(0)
    
jakarta_joined['Recommended'] = recommended
    
jakarta_joined.head(10)

Unnamed: 0,City,District,Total,id,Latitude,Longitude,Area Size (m2),Density Amount,Density,Total Freq,Recommended
0,Jakarta Barat,Taman Sari,110008,dki_kecamatan.38,-6.147465,106.817341,7.74,14212.92,Low Density,12,1
1,Jakarta Barat,Grogol Petamburan,232697,dki_kecamatan.10,-6.163629,106.786118,9.99,23292.99,Medium Density,12,1
2,Jakarta Pusat,Sawah Besar,100461,dki_kecamatan.35,-6.152739,106.834178,6.16,16308.6,Low Density,11,1
3,Jakarta Utara,Kelapa Gading,156664,dki_kecamatan.18,-6.163512,106.909511,16.12,9718.61,Low Density,8,1
4,Jakarta Selatan,Tebet,210356,dki_kecamatan.42,-6.225883,106.854572,9.03,23295.24,Medium Density,7,1
5,Jakarta Utara,Tanjung Priok,386264,dki_kecamatan.41,-6.132658,106.872981,25.12,15376.75,Low Density,6,1
6,Jakarta Barat,Tambora,239474,dki_kecamatan.39,-6.148,106.804356,5.4,44347.04,High Density,5,1
7,Jakarta Utara,Penjaringan,328053,dki_kecamatan.32,-6.111963,106.765382,35.48,9246.14,Low Density,5,0
8,Jakarta Selatan,Kebayoran Baru,143098,dki_kecamatan.15,-6.246963,106.799512,12.92,11075.7,Low Density,4,0
9,Jakarta Timur,Pasar Rebo,204678,dki_kecamatan.31,-6.326931,106.84121,12.97,15780.88,Low Density,4,0


## Visualize Top 7 District

### Visualize with circle marker.

In [47]:
top5_map = folium.Map(
    location = [g.latlng[0], g.latlng[1]],
    zoom_start = 11
)

i = 1

for lat, long, dis in zip(jakarta_joined['Latitude'].iloc[0:7], jakarta_joined['Longitude'].iloc[0:7], jakarta_joined['District'].iloc[0:7]):
    label = 'Num ' + str(i) + '. ' + dis
    
    i += 1
    
    folium.CircleMarker(
        location = [lat, long],
        radius = 5,
        color = 'yellow',
        fill = True,
        popup = label,
        fill_color = 'blue',
        fill_opacity = 0.6
    ).add_to(top5_map)


top5_map

### Visualize with choropleth and circle marker

In [48]:
top7_map = folium.Map(
    location = [g.latlng[0], g.latlng[1]],
    zoom_start = 11
)

folium.Choropleth(
    geo_data = jakarta_geo,
    data = jakarta_joined,
    columns = ['id', 'Recommended'],
    key_on = 'feature.id',
    fill_color = 'YlGn',
    fill_opacity = 0.7,
    line_opacity = 0.2,
).add_to(top7_map)

i = 1

for lat, long, dis, dense in zip(jakarta_joined['Latitude'].iloc[0:7], jakarta_joined['Longitude'].iloc[0:7], jakarta_joined['District'].iloc[0:7], jakarta_joined['Density'].iloc[0:7]):
    label = 'Num ' + str(i) + '. ' + dis + ' ' + dense
    
    i += 1
    
    folium.CircleMarker(
        location = [lat, long],
        radius = 5,
        color = 'yellow',
        fill = True,
        popup = label,
        fill_color = 'blue',
        fill_opacity = 0.6
    ).add_to(top7_map)


top7_map

# 4. Result

In [49]:
print("Top 7 districts for your chosen categories are: ")

for i, district in enumerate(jakarta_joined['District'].iloc[0:7]):
    print(str(i + 1) + '. ' + district)

Top 7 districts for your chosen categories are: 
1. Taman Sari
2. Grogol Petamburan
3. Sawah Besar
4. Kelapa Gading
5. Tebet
6. Tanjung Priok
7. Tambora


# 5. Discussion

As I have mentioned on the introduction part, I would like to have the land value as another parameter for the recommendation. It would be more interesting to include it, maybe in the future when the data is easier to find, anyone can add it as extra parameter to have a more accurate result.

The methodology here is only a basic calculation for recommendation (sum of user interested places), you can use more sophisticated method to have an even more accurate result too.

# 6. Conclusion

With this recommendation, user can choose their living district to suit their interest better.

This is the end of my Data Science Capstone Final Project notebook.

Thanks for reading.