<a href="https://colab.research.google.com/github/SalikFillah/Topic-Modelling/blob/main/Latent_Dirichlet_Allocation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Instalasi Modul

In [1]:
!pip install nltk
!pip install Sastrawi
!pip install regex
!pip install unidecode
!pip install textblob
!pip install tqdm
!pip install scapy
!pip install python-crfsuite
!pip install gensim
!pip install scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.7/209.7 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Sastrawi
Successfully installed Sastrawi-1.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting unidecode
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.9/235.9 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.3.6
Looking in indexes: https://pypi.org

## Import Modul

In [2]:
import pandas as pd
import numpy as np
import string
import nltk

# modul stopword & stemmer
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

# modul preprocessing
nltk.download('punkt')
import re
from unidecode import unidecode
from html import unescape
from textblob import TextBlob
from tqdm import tqdm

# modul lemma & pos-tag
import spacy
from spacy.lang.id import Indonesian
from nltk.tag import CRFTagger
ct = CRFTagger()

# modul LDA
from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel

# modul DRT
from sklearn.manifold import TSNE

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Load Data

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/SalikFillah/Topic-Modelling/main/anies_baswedan_2024.csv')
df.head()

Unnamed: 0,title,href,body,date
0,"ANIES BASWEDAN INDONESIA on Instagram: ""Imam B...",https://www.instagram.com/p/CHKVWZrnYOT/,"940 likes, 43 comments - ANIES BASWEDAN INDONE...",2023-01-02
1,"BuddyKu Headlines on Instagram: ""Buddies! Baka...",https://www.instagram.com/p/Cn0MrvFPJdB/,Bakal Calon Presiden (Bacapres) 2024 dari Part...,2023-01-02
2,"SINDOnews on Instagram: ""Meski memiliki elekta...",https://www.instagram.com/p/CeAjzwnP0wi/,"309 likes, 29 comments - SINDOnews (@sindonews...",2023-01-02
3,"ANIES BASWEDAN INDONESIA on Instagram: ""Denger...",https://www.instagram.com/p/ClEVrd9IOEe/,"1,286 likes, 70 comments - ANIES BASWEDAN INDO...",2023-01-02
4,"ANIES BASWEDAN on Instagram: ""Calon Presiden I...",https://www.instagram.com/p/Cm6c9HFr7Cu/,"9 likes, 0 comments - ANIES BASWEDAN (@aniesra...",2023-01-02


## Preprocessing Data

### Load Slang atau Singkatan
Modifikasi sesuka hati jika sekiranya masih terdapat singkatan yang perlu diubah.

In [4]:
slang = {'tdk':'tidak',
         'ketum':'ketua umum',
         'menjadi':'jadi',
         'timnas':'tim nasional',
         'membatalkan':'batal',
         'alas':'alasan', 
         'kelem':'kelemahan'}

### Load Stopword
Modifikasi sesuka hati jika sekiranya masih terdapat kata yang seharusnya dihilangkan.

In [5]:
factory = StopWordRemoverFactory()
stemmer = StemmerFactory().create_stemmer()

# stopword di modul Sastrawi
Sastrawi_StopWords_id = set(factory.get_stop_words())

# stopword tambahan
tambahan = set(['a', 'akan', 
                'b', 'bisa', 'bahwa',
                'c', 'comment', 'comments',
                'd', 'dari', 'di',
                'e', 
                'f', 
                'g', 
                'h',
                'i', 'instagram',
                'j',
                'k',
                'l', 'like', 'likes',
                'm', 'menjadi',
                'n', 
                'o', 
                'p', 
                'q', 
                'r', 
                's', 'sebagai', 'saja',
                't',
                'u', 
                'v', 
                'w', 
                'x', 
                'y',
                'z',])

Sastrawi_StopWords_id = Sastrawi_StopWords_id.union(tambahan)      
print(Sastrawi_StopWords_id)

{'untuk', 'k', 'akan', 'mereka', 'pun', 'pula', 'anda', 'sebetulnya', 'j', 'oleh', 'selain', 'supaya', 'namun', 'demikian', 'h', 'saja', 'melainkan', 'sebelum', 'b', 'anu', 'y', 'f', 'bagi', 'likes', 'ia', 'z', 'kenapa', 'begitu', 'sudah', 'kah', 'di', 'tentu', 'm', 'setiap', 'tolong', 'yaitu', 'sehingga', 'dulunya', 'maka', 'x', 'juga', 'nanti', 'v', 'setelah', 'yang', 'agak', 'adalah', 'secara', 'pasti', 'sementara', 'e', 'p', 'seraya', 'dan', 'itulah', 'mari', 'nggak', 'yakni', 'like', 'dia', 'terhadap', 'daripada', 'seolah', 'menurut', 'masih', 'jika', 'toh', 'dst', 'hal', 'seperti', 'telah', 'a', 'sedangkan', 'harus', 'sebagai', 'lain', 'dahulu', 'n', 'sambil', 'dengan', 'ok', 'kepada', 'ke', 'd', 'atau', 'r', 'ketika', 'g', 'bahwa', 'dari', 'sebab', 'c', 'bagaimanapun', 'setidaknya', 'dua', 'kembali', 'apalagi', 'bisa', 'itu', 'u', 'lagi', 'saat', 'sesudah', 'kita', 'karena', 'amat', 'kami', 'serta', 'mengapa', 'kecuali', 'dalam', 't', 'dll', 'comments', 'hanya', 'tetapi', 's', '

### NLP (Natural Language Preprocessing)
Membersihkan data teks dari karakter-karakter yang tidak diperlukan serta menangani stopword dan slang atau singkatan yang sudah di load sebelumnya dan lain sebagainya.

Note : setiap platform media sosial memiliki cara pembersihan data nya masing-masing (modifikasi sesuka hati).

In [6]:
def cleanbody(text):
    
    # menghapus url
    url_pattern = re.compile(r'(\w+:\/\/\S+)')
    text = url_pattern.sub(' ', text)
    
    # menghapus hashtag
    hashtag_pattern = re.compile(r'#\w+\b')
    text = hashtag_pattern.sub(' ', text)
    
    # menghapus nama pengguna instagram (memuat "@")
    username_pattern = re.compile(r'@\w+\b\s*')
    text = username_pattern.sub(' ', text)
    
    # menghapus angka
    text = re.sub(r'\b\d+\b|[^\w\s]', '', text)
    
    # mengahpus simbol
    symbol_pattern = re.compile(r'[^\w\s]+')
    text = symbol_pattern.sub(' ', text)
    
    # menghapus karakter yang tidak diperlukan (tergantung media sosial)
    text = re.sub(r'\n', ' ', text)
    text = re.sub(r'\r', ' ', text)
    text = re.sub(r'&\w+', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    
    # menangani huruf kapital dan spasi
    text = unidecode(unescape(text.lower().strip()))
    
    # menangani slang atau singkatan 
    Tokens = TextBlob(text).words
    for i,t in enumerate(Tokens):
        if t in slang.keys():            
            Tokens[i] = slang[t]
        
    # menangani stopword
    text = ' '.join([t for t in Tokens if str(t) not in Sastrawi_StopWords_id and len(t)>2])
    
    # stemming
    text = stemmer.stem(text)
    
    
    return text

In [7]:
# aplikasikan fungsi ke dalam kolom variabel baru
df['clean_body'] = ''
for idx, post in tqdm(df.iterrows()):
    df.at[idx, 'clean_body'] = cleanbody(post.body)

34it [00:11,  2.86it/s]


In [8]:
df.head()

Unnamed: 0,title,href,body,date,clean_body
0,"ANIES BASWEDAN INDONESIA on Instagram: ""Imam B...",https://www.instagram.com/p/CHKVWZrnYOT/,"940 likes, 43 comments - ANIES BASWEDAN INDONE...",2023-01-02,anies baswedan indonesia imam besar front bela...
1,"BuddyKu Headlines on Instagram: ""Buddies! Baka...",https://www.instagram.com/p/Cn0MrvFPJdB/,Bakal Calon Presiden (Bacapres) 2024 dari Part...,2023-01-02,bakal calon presiden bacapres partai nasdem an...
2,"SINDOnews on Instagram: ""Meski memiliki elekta...",https://www.instagram.com/p/CeAjzwnP0wi/,"309 likes, 29 comments - SINDOnews (@sindonews...",2023-01-02,sindonews meski milik elektabilitas tinggi jum...
3,"ANIES BASWEDAN INDONESIA on Instagram: ""Denger...",https://www.instagram.com/p/ClEVrd9IOEe/,"1,286 likes, 70 comments - ANIES BASWEDAN INDO...",2023-01-02,anies baswedan indonesia dengerin
4,"ANIES BASWEDAN on Instagram: ""Calon Presiden I...",https://www.instagram.com/p/Cm6c9HFr7Cu/,"9 likes, 0 comments - ANIES BASWEDAN (@aniesra...",2023-01-02,anies baswedan calon presiden indonesia


### Lemmatisasi & Pos-Tag
Bagian preprocessing yang paling vital atau krusial untuk proses LDA dari yang lainnya, yaitu identifikasi kata dasar serta pengelompokkan kata berdasarkan kategori kata tersebut.

In [9]:
# lemma & pos-tag bahasa indonesia
nlp_id = Indonesian()
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/all_indo_man_tag_corpus_model.crf.tagger
ct.set_model_file('data/all_indo_man_tag_corpus_model.crf.tagger')

--2023-05-07 04:35:51--  https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/all_indo_man_tag_corpus_model.crf.tagger
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1722780 (1.6M) [application/octet-stream]
Saving to: ‘data/all_indo_man_tag_corpus_model.crf.tagger’


2023-05-07 04:35:51 (50.0 MB/s) - ‘data/all_indo_man_tag_corpus_model.crf.tagger’ saved [1722780/1722780]



In [10]:
def NLPfilter(t, filters):

  # lemmatisasi
  tokens = nlp_id(t)

  # tokenisasi
  tokens = [str(k) for k in tokens if len(k)>2]

  # pos-tag
  hasil = ct.tag_sents([tokens])
  
  return [k[0] for k in hasil[0] if k[1] in filters]

  Note :
- NN : kata benda tunggal (meja, buku, kucing, cinta, ...)
- NNP : kata benda tunggal khusus (indonesia, google, nike, tokyo, ...)
- NNS : kata benda jamak (buku-buku, meja-meja, ...)
- NNPS : kata benda jamak khusus (beatles, avengers, simpsons, ...)
- JJ : kata sifat (marah, tinggi, besar, indah, ...)

In [11]:
# ambil variabel kolom hasil preprocessing
data = df['clean_body'].values

# pilih kategori kata 
filters = set(['NN', 'NNP', 'NNS', 'NNPS', 'JJ'])

# aplikasikan fungsi ke dalam dataframe baru
data_postTag = []
for i, d in tqdm(enumerate(data)):
    data_postTag.append(NLPfilter(d,filters))

' '.join(data_postTag[0])

34it [00:00, 752.53it/s]


'anies baswedan indonesia imam besar front bela islam fpi umum indones anies baswedan indonesia imam besar front bela islam fpi umum indonesia'

In [12]:
# tokenisasi data kembali
data = [d for d in data_postTag if d]

## LDA (Latent Dirichlet Allocation)


In [13]:
# membuat representasi dictionary dari dokumen

# membuang token yang langka dan umum
dictionary_t = Dictionary(data)
dictionary_t.filter_extremes(no_below=2, no_above=0.90)

# membuat dictionary dan corpus yang diperlukan topic modelling
corpus_t = [dictionary_t.doc2bow(doc) for doc in data]
corpus_t = [t for t in corpus_t if t] # membuang corpus atau dokumen yang kosong

print('Number of unique tokens: %d' % len(dictionary_t))
print('Number of documents: %d' % len(corpus_t))
print(corpus_t[:1])

Number of unique tokens: 36
Number of documents: 30
[[(0, 2), (1, 3)]]


Pembuatan dataframe hasil keluaran dari algoritma LDA (Latent Dhiriclet Allocation).

In [14]:
def format_topics_sentences(ldamodel, corpus, texts, dates):
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = pd.concat([
                    sent_topics_df,                   
                pd.DataFrame([[int(topic_num), round(prop_topic, 4), topic_keywords]],
                             columns=["Dominant_Topic", "Perc_Contribution", "Topic_Keywords"])],
                    ignore_index=True,
                )
            else:
                break
    sent_topics_df.columns = ["Dominant_Topic", "Perc_Contribution", "Topic_Keywords"]

    # Add original text to the end of the output
    contents = pd.Series(texts)

    sent_topics_df = pd.concat([sent_topics_df, contents, pd.Series(dates)], axis=1)
    return sent_topics_df

Model DRT (Dimensional Reduction Technique).

In [15]:
def tsne_analysis(ldamodel, corpus):
    topic_weights = []
    for i, row_list in enumerate(ldamodel[corpus]):
        topic_weights.append([w for i, w in row_list])

    # Array of topic weights
    df_topics = pd.DataFrame(topic_weights).fillna(0).values

    # Keep the well separated points (optional)
    # arr = arr[np.amax(arr, axis=1) > 0.35]

    # Dominant topic number in each doc
    topic_nums = np.argmax(df_topics, axis=1)

    # tSNE Dimension Reduction
    try:
        tsne_model = TSNE(
            n_components=2, verbose=1, random_state=0, angle=0.99, init="pca"
        )
        tsne_lda = tsne_model.fit_transform(df_topics)
    except:
        print("TSNE_ANALYSIS WENT WRONG, PLEASE RE-CHECK YOUR BANK DATASET")
        return (topic_nums, None)

    return (topic_nums, tsne_lda)

Penggabungan dataframe sedemikian sehingga data ini lah yang nantinya di aplikasikan ke dalah dashboard LDA.

In [16]:
def lda_analysis(df):
    
    docs = list(df["clean_body"].values)

    punctuations = string.punctuation

    processed_docs = data
    print("Jumlah corpus atau dokumen", len(processed_docs))
    if len(processed_docs) < 11:
        print("INSUFFICIENT DOCS TO RUN LINEAR DISCRIMINANT ANALYSIS")
        return (None, None, None, None)

    print("Jumlah BoW (Bag of Words) corpus", len(corpus_t))
    print("Jumlah dictionary", len(list(dictionary_t.keys())))
    if len(list(dictionary_t.keys())) < 1:
        print("INSUFFICIENT DICTS TO RUN LINEAR DISCRIMINANT ANALYSIS")
        return (None, None, None, None)

    lda_model = LdaModel(
        corpus_t, num_topics=5, id2word=dictionary_t, passes=10
    )

    df_topic_sents_keywords = format_topics_sentences(
        ldamodel=lda_model,
        corpus=corpus_t,
        texts=docs,
        dates=list(df["date"].values),
    )
    print("Jumlah data", len(df_topic_sents_keywords))
    print("Data", df_topic_sents_keywords.head())
    df_dominant_topic = df_topic_sents_keywords.reset_index()
    df_dominant_topic.columns = [
        "Document_No",
        "Dominant_Topic",
        "Topic_Perc_Contrib",
        "Keywords",
        "Text",
        "Date",
    ]

    print("Hasil DRT")
    topic_num, tsne_lda = tsne_analysis(lda_model, corpus_t)

    return (tsne_lda, lda_model, topic_num, df_dominant_topic)

In [17]:
# aplikasikan fungsi dan simpan hasilnya ke dalam beberapa variabel berikut
tsne_lda, lda_model, topic_num, df_dominant_topic = lda_analysis(df)

Jumlah corpus atau dokumen 33
Jumlah BoW (Bag of Words) corpus 30
Jumlah dictionary 36
Jumlah data 34
Data    Dominant_Topic  Perc_Contribution  \
0             4.0             0.8659   
1             2.0             0.9420   
2             0.0             0.9190   
3             4.0             0.7313   
4             4.0             0.8374   

                                      Topic_Keywords  \
0  anies, indonesia, presiden, calon, sama, orang...   
1  partai, calon, presiden, bakal, nasdem, bacapr...   
2  gubernur, anies, jakarta, capres, kamil, barat...   
3  anies, indonesia, presiden, calon, sama, orang...   
4  anies, indonesia, presiden, calon, sama, orang...   

                                                   0           1  
0  anies baswedan indonesia imam besar front bela...  2023-01-02  
1  bakal calon presiden bacapres partai nasdem an...  2023-01-02  
2  sindonews meski milik elektabilitas tinggi jum...  2023-01-02  
3                  anies baswedan indonesia den

In [18]:
# data untuk pembuatan dashboard
df_dominant_topic.head()

Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,Text,Date
0,0,4.0,0.8659,"anies, indonesia, presiden, calon, sama, orang...",anies baswedan indonesia imam besar front bela...,2023-01-02
1,1,2.0,0.942,"partai, calon, presiden, bakal, nasdem, bacapr...",bakal calon presiden bacapres partai nasdem an...,2023-01-02
2,2,0.0,0.919,"gubernur, anies, jakarta, capres, kamil, barat...",sindonews meski milik elektabilitas tinggi jum...,2023-01-02
3,3,4.0,0.7313,"anies, indonesia, presiden, calon, sama, orang...",anies baswedan indonesia dengerin,2023-01-02
4,4,4.0,0.8374,"anies, indonesia, presiden, calon, sama, orang...",anies baswedan calon presiden indonesia,2023-01-02


## Referensi
1.   https://taudata.blogspot.com/2022/05/nlptm-07.html
2.   https://github.com/plotly/dash-sample-apps/blob/main/apps/dash-nlp/ldacomplaints.py