# Modelling LDA

Modeling LDA (Latent Dirichlet Allocation) adalah suatu pendekatan dalam pemodelan tema atau topik dalam koleksi dokumen. LDA adalah salah satu model generatif yang memungkinkan pemahaman terhadap topik yang muncul dalam korpus teks.



## Import Data

In [None]:
import numpy as np
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/AderisaDyta/ppw/main/PTA_Utm.csv')
df


Unnamed: 0.1,Unnamed: 0,Judul,Penulis,Pembimbing I,Pembimbing II,Abstrak
0,0,OPTIMASI PEMILIHAN PORTOFOLIO SAHAM PERUSAHAAN...,Siliwangi Fitra Rachmawanto S.T.,"Heri Awalul Ilhamsah S.T., M.T.","Retno Indriartiningtias S.T., M.T.",Portofolio adalah sekumpulan saham yang dimili...
1,1,PERANCANGAN TATA LETAK FASILITAS LANTAI PRODUK...,AHMAD MAS'UD,"SABARUDIN AKHMAD, S.T., M.T.","SUGENG PURWOKO, S.T., M.T.",PT. ABC merupakan perusahaan yang bergerak dib...
2,2,PERUMUSAN STRATEGI BISNIS UD. BUDI JAYA BANGKA...,Yulianto Fauzanta,"Fitri Agustina, S.T., M.T","Retno Indriartiningtias, S.T., M.T",Bangkalan merupakan salah satu kabupaten yang ...
3,3,USULAN PERBAIKAN UTILITAS RESOURCES PADA LANTA...,M Mundir Muhlisin,Mu'alim ST MT,Sugeng Purwoko ST MT,Simulasi adalah duplikasi atau abstraksi dari ...
4,4,Peningkatan Kepuasan Masyarakat Terhadap Pelay...,Muhibbin,Rahmad Hidayat,Retno Indriartiningtias,Kepuasan adalah tingkat perasaan seseorang ter...
...,...,...,...,...,...,...
705,705,PENENTUAN LOKASI PENGELOMPOKAN TANI GARAM DI ...,Ridho Mukasafa,"Dr. Kukuh Winarso, S.Si., M.T, IPM ASEAN Eng","Dr. Sabarudin Akhmad, S.T., M.T, IPM ASEAN Eng",Penelitian ini bertujuan untuk mengetahui dima...
706,706,USULAN PERBAIKAN UPAYA MINIMASI REJECT PART BO...,YOLANDA ABIGAIL LAPIAN,"Prof. Dr. Rachmad Hidayat, M.T.,IPU, Asean Eng","Imron Kuswandi S.T., M.T",Kualitas dan pengendalian kualitas memiliki pe...
707,707,PERENCANAAN KEBIJAKAN PENGENDALIAN PERSEDIAA...,Nisfu Laylatus Sabihis,"Sugeng Purwoko, S.T., M.T","Mu'alim, S.T., M.T",PT. Apie Indo Karunia merupakan suatu perusaha...
708,708,PENGEMBANGAN DAN KARAKTERISASI HASIL ASSESSMEN...,Muhammad Asri Wahyu Dianto,"Dr.Sabarudin Akhmad, S.T., M.T., IPM. Asean Eng","Anis Arendra S.T,. M.Eng",MSDs atau Musculoskeletal Disorders adalah gan...


In [None]:
df = pd.DataFrame (df[['Abstrak','Judul']])
# df1 = pd.DataFrame (df['Abstrak'])
df

Unnamed: 0,Abstrak,Judul
0,Portofolio adalah sekumpulan saham yang dimili...,OPTIMASI PEMILIHAN PORTOFOLIO SAHAM PERUSAHAAN...
1,PT. ABC merupakan perusahaan yang bergerak dib...,PERANCANGAN TATA LETAK FASILITAS LANTAI PRODUK...
2,Bangkalan merupakan salah satu kabupaten yang ...,PERUMUSAN STRATEGI BISNIS UD. BUDI JAYA BANGKA...
3,Simulasi adalah duplikasi atau abstraksi dari ...,USULAN PERBAIKAN UTILITAS RESOURCES PADA LANTA...
4,Kepuasan adalah tingkat perasaan seseorang ter...,Peningkatan Kepuasan Masyarakat Terhadap Pelay...
...,...,...
705,Penelitian ini bertujuan untuk mengetahui dima...,PENENTUAN LOKASI PENGELOMPOKAN TANI GARAM DI ...
706,Kualitas dan pengendalian kualitas memiliki pe...,USULAN PERBAIKAN UPAYA MINIMASI REJECT PART BO...
707,PT. Apie Indo Karunia merupakan suatu perusaha...,PERENCANAAN KEBIJAKAN PENGENDALIAN PERSEDIAA...
708,MSDs atau Musculoskeletal Disorders adalah gan...,PENGEMBANGAN DAN KARAKTERISASI HASIL ASSESSMEN...


## **Preprocessing Data**



1. Cleaning Data : Menghapus karakter atau elemen teks yang tidak relevan atau tidak diinginkan, seperti tanda baca, angka, atau karakter khusus.
2. Tokenizing : Memisahkan teks menjadi unit-unit yang lebih kecil, yang disebut token. Token bisa berupa kata, frasa, atau bahkan karakter, tergantung pada tingkat granularitas yang diinginkan
3. Stopword : Menghilangkan kata-kata umum yang sering muncul namun tidak membawa banyak informasi, seperti "dan," "atau," dan "yang."
4. Stemming : Mereduksi kata-kata ke bentuk dasar mereka. Stemming menghapus akhiran kata secara kasar.

## **Cleaning Data**

In [None]:
#import library
import re, string

# Text Cleaning
def cleaning(text):
    # HTML Tag Removal
    text = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});').sub('', str(text))

    # Case folding
    text = text.lower()

    # Trim text
    text = text.strip()

    # Remove punctuations, karakter spesial, and spasi ganda
    text = re.compile('<.*?>').sub('', text)
    text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
    text = re.sub('\s+', ' ', text)

    # Number removal
    text = re.sub(r'\[[0-9]*\]', ' ', text)
    text = re.sub(r'[^\w\s]', '', str(text).lower().strip())
    text = re.sub(r'\d', ' ', text)
    text = re.sub(r'\s+', ' ', text)

    return text

In [None]:
df['Abstrak'] = df['Abstrak'].apply(lambda x: cleaning(x))
df.head()

Unnamed: 0,Abstrak,Judul
0,portofolio adalah sekumpulan saham yang dimili...,OPTIMASI PEMILIHAN PORTOFOLIO SAHAM PERUSAHAAN...
1,pt abc merupakan perusahaan yang bergerak dibi...,PERANCANGAN TATA LETAK FASILITAS LANTAI PRODUK...
2,bangkalan merupakan salah satu kabupaten yang ...,PERUMUSAN STRATEGI BISNIS UD. BUDI JAYA BANGKA...
3,simulasi adalah duplikasi atau abstraksi dari ...,USULAN PERBAIKAN UTILITAS RESOURCES PADA LANTA...
4,kepuasan adalah tingkat perasaan seseorang ter...,Peningkatan Kepuasan Masyarakat Terhadap Pelay...


In [None]:
# Checkpoint: Export Hasil Text Cleaning pada abstrak
df.to_csv('DataCleaning.csv')

## **Tokenize**

Proses membagi teks menjadi kata

In [None]:
import nltk
nltk.download('popular')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

In [None]:
# Tokenizing Abstrak
df['Abstrak_Tokens'] = df['Abstrak'].apply(lambda x: word_tokenize(x))
df[["Abstrak", "Abstrak_Tokens"]].head()

Unnamed: 0,Abstrak,Abstrak_Tokens
0,portofolio adalah sekumpulan saham yang dimili...,"[portofolio, adalah, sekumpulan, saham, yang, ..."
1,pt abc merupakan perusahaan yang bergerak dibi...,"[pt, abc, merupakan, perusahaan, yang, bergera..."
2,bangkalan merupakan salah satu kabupaten yang ...,"[bangkalan, merupakan, salah, satu, kabupaten,..."
3,simulasi adalah duplikasi atau abstraksi dari ...,"[simulasi, adalah, duplikasi, atau, abstraksi,..."
4,kepuasan adalah tingkat perasaan seseorang ter...,"[kepuasan, adalah, tingkat, perasaan, seseoran..."


## **Stopword**


Proses menghapus kata yang tidak memiliki makna seperti yang, di, dan, ke dll.

In [None]:
# import library kumpulan kata2 tidak penting
from nltk.corpus import stopwords
nltk.download('stopwords')
# Initialize the stopwords
stoplist = stopwords.words('indonesian') #Ini menginisialisasi daftar kata-kata tidak penting (stopwords) untuk bahasa Indonesia

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
print(stoplist) #Menampilkan kata kata tidak penting

['ada', 'adalah', 'adanya', 'adapun', 'agak', 'agaknya', 'agar', 'akan', 'akankah', 'akhir', 'akhiri', 'akhirnya', 'aku', 'akulah', 'amat', 'amatlah', 'anda', 'andalah', 'antar', 'antara', 'antaranya', 'apa', 'apaan', 'apabila', 'apakah', 'apalagi', 'apatah', 'artinya', 'asal', 'asalkan', 'atas', 'atau', 'ataukah', 'ataupun', 'awal', 'awalnya', 'bagai', 'bagaikan', 'bagaimana', 'bagaimanakah', 'bagaimanapun', 'bagi', 'bagian', 'bahkan', 'bahwa', 'bahwasanya', 'baik', 'bakal', 'bakalan', 'balik', 'banyak', 'bapak', 'baru', 'bawah', 'beberapa', 'begini', 'beginian', 'beginikah', 'beginilah', 'begitu', 'begitukah', 'begitulah', 'begitupun', 'bekerja', 'belakang', 'belakangan', 'belum', 'belumlah', 'benar', 'benarkah', 'benarlah', 'berada', 'berakhir', 'berakhirlah', 'berakhirnya', 'berapa', 'berapakah', 'berapalah', 'berapapun', 'berarti', 'berawal', 'berbagai', 'berdatangan', 'beri', 'berikan', 'berikut', 'berikutnya', 'berjumlah', 'berkali-kali', 'berkata', 'berkehendak', 'berkeinginan'

In [None]:
#menghilangkan kata-kata yang tidak penting (stopwords) dari kolom 'Abstrak_Tokens' pada sebuah DataFrame (df)
from nltk.corpus import stopwords
from itertools import chain

stop_words = set(chain(stopwords.words('indonesian'), stopwords.words('english')))

df['Abstrak_Tokens'] = df['Abstrak_Tokens'].apply(lambda x: [w for w in x if not w in stop_words])

In [None]:
df[["Abstrak", "Abstrak_Tokens"]].head()

Unnamed: 0,Abstrak,Abstrak_Tokens
0,portofolio adalah sekumpulan saham yang dimili...,"[portofolio, sekumpulan, saham, dimiliki, inve..."
1,pt abc merupakan perusahaan yang bergerak dibi...,"[pt, abc, perusahaan, bergerak, dibidang, manu..."
2,bangkalan merupakan salah satu kabupaten yang ...,"[bangkalan, salah, kabupaten, memiliki, potens..."
3,simulasi adalah duplikasi atau abstraksi dari ...,"[simulasi, duplikasi, abstraksi, kehidupan, ny..."
4,kepuasan adalah tingkat perasaan seseorang ter...,"[kepuasan, tingkat, perasaan, pelayanan, memba..."


## **Stemming**


Proses merubah kata berimbuhan menjadi kata dasar
(Mengubah hasil dari stopword menjadi kata dasar)

In [None]:
pip install Sastrawi




In [None]:
#pemrosesan teks untuk mengubah kata-kata ke bentuk dasarnya atau kata dasar
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from tqdm.auto import tqdm
tqdm.pandas()

factory = StemmerFactory()
stemmer = factory.create_stemmer()

In [None]:
# # Stemming abstrak
# df['Abstrak_Tokens'] = df['Abstrak_Tokens'].progress_apply(lambda x: stemmer.stem(' '.join(x)).split(' '))

In [None]:
# Hasil stemming abstrak
df[["Abstrak", "Abstrak_Tokens"]]
df

Unnamed: 0,Abstrak,Judul,Abstrak_Tokens
0,portofolio adalah sekumpulan saham yang dimili...,OPTIMASI PEMILIHAN PORTOFOLIO SAHAM PERUSAHAAN...,"[portofolio, sekumpulan, saham, dimiliki, inve..."
1,pt abc merupakan perusahaan yang bergerak dibi...,PERANCANGAN TATA LETAK FASILITAS LANTAI PRODUK...,"[pt, abc, perusahaan, bergerak, dibidang, manu..."
2,bangkalan merupakan salah satu kabupaten yang ...,PERUMUSAN STRATEGI BISNIS UD. BUDI JAYA BANGKA...,"[bangkalan, salah, kabupaten, memiliki, potens..."
3,simulasi adalah duplikasi atau abstraksi dari ...,USULAN PERBAIKAN UTILITAS RESOURCES PADA LANTA...,"[simulasi, duplikasi, abstraksi, kehidupan, ny..."
4,kepuasan adalah tingkat perasaan seseorang ter...,Peningkatan Kepuasan Masyarakat Terhadap Pelay...,"[kepuasan, tingkat, perasaan, pelayanan, memba..."
...,...,...,...
705,penelitian ini bertujuan untuk mengetahui dima...,PENENTUAN LOKASI PENGELOMPOKAN TANI GARAM DI ...,"[penelitian, bertujuan, dimana, lokasi, pengel..."
706,kualitas dan pengendalian kualitas memiliki pe...,USULAN PERBAIKAN UPAYA MINIMASI REJECT PART BO...,"[kualitas, pengendalian, kualitas, memiliki, p..."
707,pt apie indo karunia merupakan suatu perusahaa...,PERENCANAAN KEBIJAKAN PENGENDALIAN PERSEDIAA...,"[pt, apie, indo, karunia, perusahaan, bidang, ..."
708,msds atau musculoskeletal disorders adalah gan...,PENGEMBANGAN DAN KARAKTERISASI HASIL ASSESSMEN...,"[msds, musculoskeletal, disorders, gangguan, c..."


## **Feature Extraction**

## **One Hot Encoder**

In [None]:
#mengubah variabel kategorikal menjadi representasi numerik biner.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
X_encoder = encoder.fit_transform(df[['Abstrak']])

encoded_features = encoder.get_feature_names_out(input_features=["Abstrak"])
one_hot_df = pd.DataFrame(X_encoder.toarray(), columns=encoded_features)
print(one_hot_df)


     Abstrak_abstrak achmad agung ferrianto penerapan rekayasa nilai pada pembangunan gedung sdn sambongrejo bojonegoro studi kasus cv jasa karya engineering setiap tahun pembangunan gedung dilakukan oleh pemerintah kota di setiap daerah pembangunan ini dilakukan bertujuan untuk pembaruan sebuah bangunan agar dapat berdiri kokoh dan tidak terjadi kerusakan yang menyebabkan sebuah kecelakaan adapun beberapa komponen material yang menimbulkan biaya besar yaitu material rangka atap dan kusen hal ini dikarenakan harga material pada komponen tersebut sangat besar maka dari itu dibutuhkan adanya penerepan rekayasa nilai agar biaya biaya yang tidak diperlukan dapat diminimalisir dengan memunculkan kriteria dan alternatif yang kemudian dilakukan pembobotan dengan metode ahp dari penerapan rekayasa nilai didapatkan alternatif material dimana hal tersebut dapat dilihat dari alternatif terpilih yang mempunyai value nilai tertinggi pada kusen didapatkan alternatif material aluminium sebesar dan ra

## **Term Frequensi**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
X_count = count_vectorizer.fit_transform(df['Abstrak'].tolist())

terms_count = count_vectorizer.get_feature_names_out()
df_countvect = pd.DataFrame(data = X_count.toarray(),columns = terms_count)
df_countvect

Unnamed: 0,aadalah,ab,abad,abadi,abc,abdoel,abghmnqr,abms,abnormal,abon,...,zam,zaman,zat,zeid,zero,zona,zoo,zscore,zz,µm
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,3,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
705,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
706,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
707,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
708,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


menghitung jumlah kemunculan kata-kata setelah menggunakan CountVectorizer.

In [None]:
#menghitung jumlah kemunculan kata
token_counts = df_countvect.sum(axis=0)

non_zero_token_counts = token_counts[token_counts != 0]

print("Token Counts yang Tidak Sama dengan 0:")
print(non_zero_token_counts)

Token Counts yang Tidak Sama dengan 0:
aadalah     1
ab          7
abad        1
abadi       2
abc        29
           ..
zona        3
zoo         1
zscore      3
zz          1
µm          1
Length: 9099, dtype: int64


In [None]:
df_countvect.to_csv('Dt-TermFrequensi.csv', index=False)

## **Log Frekuensi**

metode yang melibatkan penggunaan logaritma untuk mengubah skala frekuensi kata-kata dalam suatu dokumen. Tujuannya adalah untuk mengurangi dampak perbedaan besar dalam frekuensi dan memberikan representasi yang lebih seimbang.

menggunakan TfidfVectorizer dari scikit-learn untuk menghitung Term Frequency (TF) tanpa menggunakan Inverse Document Frequency (IDF).

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
log_vectorizer = TfidfVectorizer(use_idf=False, smooth_idf=False, norm=None)
X_log = log_vectorizer.fit_transform(df['Abstrak'].tolist())
# X_log = log_vectorizer.fit_transform(df['Abstrak'])

log_terms = log_vectorizer.get_feature_names_out()
df_log = pd.DataFrame(data = X_log.toarray(),columns = log_terms)
df_log

Unnamed: 0,aadalah,ab,abad,abadi,abc,abdoel,abghmnqr,abms,abnormal,abon,...,zam,zaman,zat,zeid,zero,zona,zoo,zscore,zz,µm
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
705,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
706,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
707,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
708,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


menghitung jumlah kemunculan kata-kata setelah menggunakan TfidfVectorizer dengan use_idf=False (sehingga hanya menghitung Term Frequency tanpa IDF) dan parameter lainnya.

In [None]:
token_counts = df_log.sum(axis=0)

non_zero_token_counts = token_counts[token_counts != 0]

print("Token Counts yang Tidak Sama dengan 0:")
print(non_zero_token_counts)

Token Counts yang Tidak Sama dengan 0:
aadalah     1.0
ab          7.0
abad        1.0
abadi       2.0
abc        29.0
           ... 
zona        3.0
zoo         1.0
zscore      3.0
zz          1.0
µm          1.0
Length: 9099, dtype: float64


In [None]:
df_log.to_csv('Dt_LogFrekuensi.csv', index=False)

## **Binary Frekuensi**

metode representasi di mana kita hanya memperhatikan keberadaan atau ketiadaan suatu kata dalam suatu dokumen, tanpa memperhitungkan berapa kali kata tersebut muncul. Representasi ini menghasilkan matriks biner di mana nilai setiap elemen dapat menjadi 0 (tidak ada) atau 1 (ada).

menggunakan CountVectorizer dari scikit-learn dengan parameter binary=True untuk membuat representasi binary frequency atau binary term presence dari dokumen-dokumen dalam kolom 'Abstrak'.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(df['Abstrak'].tolist())

feature_names = vectorizer.get_feature_names_out()
df_vsm_binary = pd.DataFrame(data=X.toarray(), columns=feature_names)
df_vsm_binary


Unnamed: 0,aadalah,ab,abad,abadi,abc,abdoel,abghmnqr,abms,abnormal,abon,...,zam,zaman,zat,zeid,zero,zona,zoo,zscore,zz,µm
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
705,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
706,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
707,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
708,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


menghitung jumlah kemunculan kata-kata dalam representasi binary frequency (df_vsm_binary)

In [None]:
#jumlah kata yang muncul
token_counts = df_vsm_binary.sum(axis=0)

non_zero_token_counts = token_counts[token_counts != 0]

print("Token Counts yang Tidak Sama dengan 0:")
print(non_zero_token_counts)

Token Counts yang Tidak Sama dengan 0:
aadalah     1
ab          5
abad        1
abadi       2
abc        15
           ..
zona        2
zoo         1
zscore      2
zz          1
µm          1
Length: 9099, dtype: int64


In [None]:
df_vsm_binary.to_csv('Dt_Binary.csv', index=False)

Pemodelan LDA

mengimpor pustaka pandas dan menggunakan modul warnings untuk mengabaikan pesan peringatan.

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

melakukan analisis topic modeling menggunakan metode Latent Dirichlet Allocation (LDA).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

##LDA
Latent Dirichlet Allocation (LDA) adalah sebuah model probabilistik yang digunakan untuk mengidentifikasi topik-topik utama yang muncul dalam kumpulan dokumen atau teks.

Langkah-langkah utama termasuk pengonversian teks menjadi matriks kata menggunakan CountVectorizer, konfigurasi LDA dengan jumlah topik tertentu, dan pelatihan model LDA pada data. Hasilnya mencakup proporsi topik pada setiap dokumen dan distribusi kata-kata pada setiap topik. Proses ini membantu mengidentifikasi topik utama dalam dataset, memahami fokus setiap dokumen, dan mengekstrak kata-kata kunci yang mendefinisikan masing-masing topik. Analisis topik semacam ini berguna untuk mengungkap struktur tematik dalam teks dan memfasilitasi pemahaman yang lebih dalam terhadap kontennya.

In [None]:
# Menggunakan CountVectorizer untuk mengonversi teks menjadi representasi matriks kata
vectorizer_lda = CountVectorizer()
X = vectorizer_lda.fit_transform(df['Abstrak'])
# Konfigurasi LDA dengan k=4, alpha=0.1, dan beta=0.2
k = 4
alpha = 0.1
beta = 0.2
lda = LatentDirichletAllocation(n_components=k, doc_topic_prior=alpha, topic_word_prior=beta)

# Melatih model LDA pada data
lda.fit(X)

proporsi_topik_dokumen = lda.transform(X) # Proporsi topik pada dokumen
proporsi_topik_df = pd.DataFrame(proporsi_topik_dokumen, columns=[f"Topik {i+1}" for i in range(k)]) # Membuat DataFrame untuk proporsi topik pada dokumen

# Gabung
result = pd.concat([df['Judul'], proporsi_topik_df], axis=1)



distribusi_kata_pada_topik = lda.components_ # Distribusi kata pada topik
distribusi_kata_df = pd.DataFrame(distribusi_kata_pada_topik, columns=vectorizer.get_feature_names_out()) # Membuat DataFrame Untuk Distribusi Kata pada topik
distribusi_kata_df = pd.DataFrame(distribusi_kata_pada_topik, index=["Topik 1", "Topik 2", "Topik 3", "Topik 4"], columns=vectorizer.get_feature_names_out())

Menampilkan Hasil Distribusi Kata Pada Topik

In [None]:

print("Distribusi Kata pada Topik:\n")
distribusi_kata_df

Distribusi Kata pada Topik:



Unnamed: 0,aadalah,ab,abad,abadi,abc,abdoel,abghmnqr,abms,abnormal,abon,...,zam,zaman,zat,zeid,zero,zona,zoo,zscore,zz,µm
Topik 1,0.2,1.267704,0.2,0.200875,0.200406,0.2,0.2,2.2,0.2,0.2,...,0.2,1.118576,0.201637,1.2,0.2,1.2,0.20153,3.017125,0.2,0.2
Topik 2,0.211395,4.392769,1.1935,0.2,6.193302,0.2,0.2,0.2,0.2,0.238767,...,0.2,0.2,12.495552,0.2,1.31339,2.199806,1.195358,0.2,0.2,0.27726
Topik 3,1.188605,1.935289,0.2,2.199125,19.206294,1.195297,0.2,0.2,0.2,2.161233,...,0.2,1.281424,4.901809,0.2,2.08661,0.200194,0.203112,0.200219,1.192454,1.12274
Topik 4,0.2,0.204238,0.2065,0.2,4.199999,0.204703,1.2,0.2,4.2,0.2,...,2.2,1.2,0.201002,0.2,0.2,0.2,0.2,0.382655,0.207546,0.2


Menyimpan file csv

In [None]:
distribusi_kata_df.to_csv('PTA_Distribusi.csv', index=False)

Menampilkan Hasil Proporsi Topik pada Dokumen

In [None]:
print("Proporsi Topik pada Dokumen:\n")
result

#proporsi topik pd dokumen *presentase paling tinggi berarti itu yang dominan.

Proporsi Topik pada Dokumen:



Unnamed: 0,Judul,Topik 1,Topik 2,Topik 3,Topik 4
0,OPTIMASI PEMILIHAN PORTOFOLIO SAHAM PERUSAHAAN...,0.203850,0.000594,0.794962,0.000594
1,PERANCANGAN TATA LETAK FASILITAS LANTAI PRODUK...,0.000739,0.000739,0.997784,0.000739
2,PERUMUSAN STRATEGI BISNIS UD. BUDI JAYA BANGKA...,0.040535,0.000679,0.958108,0.000679
3,USULAN PERBAIKAN UTILITAS RESOURCES PADA LANTA...,0.000693,0.000693,0.997922,0.000693
4,Peningkatan Kepuasan Masyarakat Terhadap Pelay...,0.000456,0.000456,0.000456,0.998632
...,...,...,...,...,...
705,PENENTUAN LOKASI PENGELOMPOKAN TANI GARAM DI ...,0.997306,0.000898,0.000898,0.000898
706,USULAN PERBAIKAN UPAYA MINIMASI REJECT PART BO...,0.000811,0.000811,0.000811,0.997568
707,PERENCANAAN KEBIJAKAN PENGENDALIAN PERSEDIAA...,0.000434,0.000434,0.000434,0.998698
708,PENGEMBANGAN DAN KARAKTERISASI HASIL ASSESSMEN...,0.000448,0.000448,0.998657,0.000448


Menyimpan dalam file csv

In [None]:
proporsi_topik_df.to_csv('Proporsi_topik.csv', index=False)

Membuat Cluster dari abstrak

In [None]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# Menggunakan K-Means untuk clustering dengan inisialisasi yang dikendalikan
random_state = 100  # Anda dapat mengganti angka ini sesuai keinginan
kmeans = KMeans(n_clusters=2, random_state=random_state)  # Ubah jumlah cluster sesuai kebutuhan Anda
clusters = kmeans.fit_predict(proporsi_topik_dokumen)

# Tambahkan kolom Cluster ke DataFrame proporsi topik
proporsi_topik_df['Cluster'] = clusters

# Gabungkan DataFrame dengan judul dan proporsi topik
con_abstrak = pd.concat([df['Judul'], proporsi_topik_df], axis=1)

# Pisahkan data berdasarkan cluster
cluster_0 = con_abstrak[con_abstrak['Cluster'] == 0]
cluster_1 = con_abstrak[con_abstrak['Cluster'] == 1]

# Tampilkan DataFrames untuk masing-masing cluster
# Tampilkan DataFrames untuk masing-masing cluster
print("Cluster 0:")
display(cluster_0[['Judul','Cluster']])

print("Cluster 1:")
display(cluster_1[['Judul','Cluster']])

Cluster 0:


Unnamed: 0,Judul,Cluster
4,Peningkatan Kepuasan Masyarakat Terhadap Pelay...,0
5,Perencanaan Penjadwalan Dan Rute Terpendek Dis...,0
6,USULAN PERBAIKAN PADA PROSES PRODUKSI BOTOL KA...,0
8,Pengaruh analisa jabatan terhadap motivasi ker...,0
10,Analisis Pengendalian dan Perbaikan Kualitas d...,0
...,...,...
703,PENGARUH KUALITAS LAYANAN DAN KEPERCAYAAN TERH...,0
705,PENENTUAN LOKASI PENGELOMPOKAN TANI GARAM DI ...,0
706,USULAN PERBAIKAN UPAYA MINIMASI REJECT PART BO...,0
707,PERENCANAAN KEBIJAKAN PENGENDALIAN PERSEDIAA...,0


Cluster 1:


Unnamed: 0,Judul,Cluster
0,OPTIMASI PEMILIHAN PORTOFOLIO SAHAM PERUSAHAAN...,1
1,PERANCANGAN TATA LETAK FASILITAS LANTAI PRODUK...,1
2,PERUMUSAN STRATEGI BISNIS UD. BUDI JAYA BANGKA...,1
3,USULAN PERBAIKAN UTILITAS RESOURCES PADA LANTA...,1
7,PERENCANAAN AGREGAT PRODUKSI PLYWOOD DENGAN TE...,1
...,...,...
695,Pengaruh Kualitas Pelayanan Terhadap Kepuasan ...,1
699,PENDEKATAN BACKPROPAGATION NEURAL NETWORK UNTU...,1
701,USULAN PERBAIKAN PADA PROSES PRODUKSI PLASTIC ...,1
704,PENGEMBANGAN DAN KARAKTERISASI INSTRUMEN ESMOC...,1


## **TF-IDF**


menghitung kemunculan term(kata) dari dokumen teks menggunakan library scikit-learn.

menggunakan modul TfidfVectorizer dari scikit-learn (sklearn) untuk mengonversi teks dalam kolom 'Abstrak' suatu DataFrame menjadi representasi numerik dengan skema TF-IDF (Term Frequency-Inverse Document Frequency)`

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# untuk mengubah teks menjadi representasi numerik
#untuk menghitung TF-IDF dari kolom 'Abstrak'
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['Abstrak'].tolist())

terms = vectorizer.get_feature_names_out()
df_tfidfvect = pd.DataFrame(data = X_tfidf.toarray(),columns = terms)
df_tfidfvect

Unnamed: 0,aadalah,ab,abad,abadi,abc,abdoel,abghmnqr,abms,abnormal,abon,...,zam,zaman,zat,zeid,zero,zona,zoo,zscore,zz,µm
0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.22743,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
705,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.135905,0.0,0.0,0.0,0.0
706,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
707,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
708,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0


Menyimpan hasil TF-IDF

In [None]:
df_tfidfvect.to_csv('Dt_Tf-Idf.csv', index=False)