# Sideka : Keyword Extraction for Each "Desa"

## Background

Indonesia has thousands of villages (Bahasa = "desa"), the lowest level of government administration. While often portrayed as a beautiful and comfortable place for its citizens to live in, there are also some aspects of poverty and lag of infrastructure development that needs to be seriously considered. Since 2014, efforts have been made by Governments of Indonesia to improve the social welfare and quality of life as mandated by the Constitution. Unavailability of information is one major aspect contributing to this problem, and the Government, through the Village and Regional Empowerment Initiatives ("Badan Prakarsa Pemberdayaan Desa dan Kawasan" - BP2DK) launched "Sistem Informasi Desa dan Kawasan" (hereafter referred as "SIDEKA").  

Until May 2018, SIDEKA has been utilized by around 4956 villages all around Indonesia. It provides a platform for the villages to monitor their activities, which afterward, the data can be compiled and used by the local government to form their Village Mid-Term Development Plans ("Rencana Pembangunan Jangka Menengah Desa" - RPJMDes), Village Government Activity Plans ("Rencana Kegiatan Pemerintah Desa" - RKPDes), and Village Revenue and Expenditure Budget Plan ("Rencana Anggaran Pendapatan dan Belanja Desa" - RAPBDes)

Going forward, data from SIDEKA's implementations should be able to be utilized by higher-level administrative government such as District, Province, and Central Government so they can make a more tailored policy for each village according to their "uniqueness".

## Problems to be Tackled

The uniqueness. We are gathering all informations available from each village's website - specifically from their "latest news" section, compiling them, and extract keywords that are unique and can be a defining characteristic for the village.  
  
Example : Desa X defining keywords = "Rengginang Ketan" ; then we can point out to the policy makers higher ups to put more attention to this keyword since it may mean that aforementioned village economy leans heavily toward producing and selling this "Rengginang Ketan". Thus, subsequent policies created for this village should be accomodating to this fact (e.g. put more incentives for the villagers to create and sell more "Rengginang Ketan", more tax leeway, etc)

## Methods

Here, we will use three methods often used to find the most important words across text documents:

1. Word count
2. Term Frequency - Inverse Document Frequency
3. Rapid Automated Keyword Extraction

Document sets used here are collection of articles scraped from Desa Pejeng (http://www.pejeng.desa.id/post/). Python is the language used to implement the aforementioned methods.

In [1]:
# Import required packages
import glob
import nltk
import operator
import csv
import argparse
import os

# Import word tokenizer packages
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
from string import punctuation
from pprint import pprint

# Import Dictionary, TfidfMode from gensim
from gensim.corpora.dictionary import Dictionary
from gensim.models.tfidfmodel import TfidfModel

# Import RAKE package
from rake_nltk import Rake

In [2]:
# Load 'scraped' article body: desa pejeng
source_dir = 'pejeng_articles/'

In [3]:
# Load the news articles, sorted by last modification time: articles
file_list = sorted(glob.glob(source_dir + '/*.txt'), key=os.path.getmtime)
articles = [open(f, 'r').read() for f in file_list]

In [4]:
# Preprocess articles: lowercasing and tokenizing all words
articles_lower_tokenize = [word_tokenize(t.lower())
                           for t in articles]

In [5]:
# Preprocess articles: removing 'indonesian' stopwords: articles_no_stop
stopwords_indonesian = stopwords.words('indonesian')
articles_no_stop = [[t for t in sublist if t not in stopwords_indonesian]
                    for sublist in articles_lower_tokenize]

In [6]:
# Preprocess articles: removing punctuation
articles_no_empty = [[t for t in sublist if t]
                     for sublist in articles_no_stop]
articles_no_empty_intermediate_1 = [[t for t in sublist if '``' not in t]
                                    for sublist in articles_no_empty]
articles_no_empty_intermediate_2 = [[t for t in sublist if '\'\'' not in t]
                                    for sublist in articles_no_empty_intermediate_1]
articles_cleaned = [[t for t in sublist if t not in punctuation]
                    for sublist in articles_no_empty_intermediate_2]

### Simple Bag-of-Words Model

In [7]:
## Looking up top 5 most-common words in the corpora
# Create a counter object: counter
counter = Counter([word for words in articles_cleaned for word in set(words)])
print('-----' * 8)
print("Top 10 Words according to frequency:")
print('-----' * 8)
print(counter.most_common(10), '\n')

----------------------------------------
Top 10 Words according to frequency:
----------------------------------------
[('pejeng', 88), ('dewa', 66), ('desa', 50), ('suamba', 43), ('banjar', 43), ('salah', 40), ('lapangan', 39), ('pura', 37), ('warga', 37), ('anak-anak', 36)] 



### TF-IDF (Using Gensim)

In [8]:
# Create a gensim corpus and then apply Tfidf to that corpus
# Create a (gensim) dictionary object from the articles_cleaned: dictionary
dictionary = Dictionary(articles_cleaned)

In [9]:
# Create a gensim corpus
corpus = [dictionary.doc2bow(article) for article in articles_cleaned]

In [10]:
# Create a tfidf object from corpus
tfidf = TfidfModel(corpus)
print('-----' * 8)
print("TF-IDF Object from Corpus")
print('-----' * 8)
print(tfidf, '\n')

----------------------------------------
TF-IDF Object from Corpus
----------------------------------------
TfidfModel(num_docs=89, num_nnz=11160) 



In [11]:
# Get the TFIDF Weights of all terms found in corpus
#  print as list of tuples, in descending order 
# Create a container for the list of tuples: tfidf_tuples
tfidf_tuples = []

In [12]:
# Loop over the cleaned articles
# Get the top-5 of tfidf weight
for i in range(len(articles_cleaned)):
    doc = corpus[i]
    tfidf_weights = tfidf[doc]
    sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)
    #sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1])
    #for term_id, weight in sorted_tfidf_weights[:5]:
    for term_id, weight in sorted_tfidf_weights:
        tfidf_tuples.append((dictionary.get(term_id), term_id, weight, 'corpus_{}'.format(i+1)))

In [13]:
# Sort the tfidif_tuples based on weight
tfidf_tuples.sort(key=operator.itemgetter(0), reverse=True)
tfidf_tuples.sort(key=operator.itemgetter(2), reverse=True)
print('-----' * 8)
print('Term and Weight for entire corpora')
print('-----' * 8)

# Get the top 5 words based on TF-IDF
pprint(tfidf_tuples[:5])

----------------------------------------
Term and Weight for entire corpora
----------------------------------------
[('elpiji', 2633, 0.85133269822700652, 'corpus_46'),
 ('kulkul', 834, 0.78491690292548111, 'corpus_48'),
 ('pengungsi', 2785, 0.72284623796788605, 'corpus_51'),
 ('ogoh-ogoh', 858, 0.66108899215597017, 'corpus_8'),
 ('topeng', 125, 0.59082726368936045, 'corpus_27')]


## RAKE Approach

In [14]:
# Merge all articles into one
articles_merged = ' '.join(articles)

In [15]:
# Helper function to remove punctuation between the sentences
def preprocess(text):
    
    # Import packages
    import string, nltk, re
    from nltk.tokenize import sent_tokenize, RegexpTokenizer
    
    # Create list of sentences from text
    sent_tokenize_list = sent_tokenize(text)
    
    # Remove punctuation from sentences
    punct_pat = string.punctuation.replace('.', '')
    pattern = r"[{}]".format(punct_pat) # create the pattern
    
    sent_tokenize_list_cleaned = []
    for sent in sent_tokenize_list:
        sent_lower = sent.lower()
        tokenizer = RegexpTokenizer(r'\w+')
        tokens = tokenizer.tokenize(sent_lower)
        filtered_words_sent = [w for w in tokens if not w in set(stopwords.words('indonesian'))]
        filtered_sent = ' '.join(filtered_words_sent)
        sent_cleaned = re.sub(pattern, ' ', filtered_sent) + '.'
        sent_tokenize_list_cleaned.append(sent_cleaned)
    
    # Return cleaned sentence list
    return ' '.join(sent_tokenize_list_cleaned)

# Get the cleaned articles now
articles_cleaned = preprocess(articles_merged)

In [21]:
# Uses stopwords for english from NLTK, and all punctuation characters
r = Rake(language='indonesian')

# Extract keywords from non pre-processed articles
r.extract_keywords_from_text(articles_merged)

# Get keyword phrases ranked highest to lowest
rake_score_precleaned = r.get_ranked_phrases_with_scores()[:5]
pprint(rake_score_precleaned)

[(87.24689240142831,
  'krama desa pakraman jero kuta pejeng melaksanakan upacara bhuta yadnya '
  'mecaru tawur kesanga bertepatan'),
 (84.2042108805315,
  'persembahyangan sekda ida bagus giri putra menghaturkan dana punia diterima '
  'langsung ngakan suardita'),
 (75.36680216802168,
  'guru nabe ida pedanda manobawa griya bitera baleran menayakan terkait '
  'motivasi'),
 (72.83458062709072,
  'anak perguruan smp santi yoga pejeng gotong royong membersihkan areal tugu '
  'pahlawan sapta dharma'),
 (70.14504452499547,
  'diterima langsung bendesa pakraman jero kuta pejeng cokorda gede putra '
  'pemayun')]


In [22]:
# Uses stopwords for english from NLTK, and all punctuation characters
r = Rake(language='indonesian')

# Extract keywords from text from pre-processed articles
r.extract_keywords_from_text(articles_cleaned)

# Get keyword phrases ranked highest to lowest
rake_score_cleaned = r.get_ranked_phrases_with_scores()[:5]
pprint(rake_score_cleaned)

[(1832.8510472913028,
  'susunan pengurus stt yowana kertha yoga massa bhakti 2016 2021 ketua a a '
  'gde juliana saputra wakil ketua i wayan agus widiana bendahara i ni kadek '
  'nilayanti bendahara ii i gede dandy saputra sekretaris i i gusti ayu '
  'manikasari sekretaris ii a a intan restika dewi serangkaian memeriahkan hut '
  '47 yowana kertha yoga digelar lomba tarik tambang panjat pinang lainya'),
 (1095.2418147651183,
  'sekretaris pdhb gianyar ida bagus kaimana sesuai surat keputusan pdhb '
  'madiksa medwijati surat pernyataan guru nabe napak ida pedanda manobawa '
  'girya bitra baleran upacara diksa diksita gelar bhiseka ida pedanda gde '
  'buruan manuaba ida bagus cahyadi ida pedanda istri buruan ida ayu adnyani'),
 (1048.3142144646847,
  'proyek rehabilitasi jalan banjar pande banjar puseh paket rehabilitasi '
  'jalan pura dalem tarukan pejeng rehabilitasi jalan lingkungan desa sanding '
  'rehabilitasi jalan lingkungan pengembungan pejeng kangin nilai anggaran rp '
