# Topic Modelling

## Crawling Data Dari Komentar Youtube


Program ini bertujuan untuk melakukan pengambilan data komentar pada sebuah video YouTube menggunakan Youtube Data API v3. Sebelum menggunakan program ini, pastikan bahwa layanan Youtube Data API telah diaktifkan dan API Key sudah di-generate.

Jika Anda belum memiliki API Key, berikut adalah langkah-langkah untuk memperolehnya:



1.   Login ke Google Developer Console (https://console.developers.google.com/) dengan akun Google Anda.

2.   Buatlah project baru dan isi informasi yang diminta.
3.   Aktifkan layanan API pada halaman project dan cari Youtube Data API v3.
4.   Buat kredensial agar API tersebut dapat digunakan. Klik tombol "Create Credential" dan lengkapi form yang diminta.

Anda dapat mengakses/melihat API Key pada tab Credentials di dashboard.

In [38]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [39]:
#import library
import pandas as pd
from googleapiclient.discovery import build
import numpy as np
from string import punctuation
import re
import nltk

In [40]:
#Membuat function untuk crawling data
def video_comments(video_id):
	# empty list for storing reply
	replies = []

	# creating youtube resource object
	youtube = build('youtube', 'v3', developerKey=api_key)

	# retrieve youtube video results
	video_response = youtube.commentThreads().list(part='snippet,replies', videoId=video_id).execute()

	# iterate video response
	while video_response:
		
		# extracting required info
		# from each result object
		for item in video_response['items']:
			
			# Extracting comments ()
			published = item['snippet']['topLevelComment']['snippet']['publishedAt']
			user = item['snippet']['topLevelComment']['snippet']['authorDisplayName']

			# Extracting comments
			comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
			likeCount = item['snippet']['topLevelComment']['snippet']['likeCount']

			replies.append([published, user, comment, likeCount])
			
			# counting number of reply of comment
			replycount = item['snippet']['totalReplyCount']

			# if reply is there
			if replycount>0:
				# iterate through all reply
				for reply in item['replies']['comments']:
					
					# Extract reply
					published = reply['snippet']['publishedAt']
					user = reply['snippet']['authorDisplayName']
					repl = reply['snippet']['textDisplay']
					likeCount = reply['snippet']['likeCount']
					
					# Store reply is list
					#replies.append(reply)
					replies.append([published, user, repl, likeCount])

			# print comment with list of reply
			#print(comment, replies, end = '\n\n')

			# empty reply list
			#replies = []

		# Again repeat
		if 'nextPageToken' in video_response:
			video_response = youtube.commentThreads().list(
					part = 'snippet,replies',
					pageToken = video_response['nextPageToken'], 
					videoId = video_id
				).execute()
		else:
			break
	#endwhile
	return replies


In [42]:
# isikan dengan api key Anda
api_key = 'AIzaSyAV3tFWjw2nnIOtVL3u0cTLQiTJ7k59V_U'

# url video =https://youtu.be/4F1xyLpwxsw
video_id = "4F1xyLpwxsw" #isikan dengan kode / ID video

# Call function
comments = video_comments(video_id)

comments

[['2023-05-10T12:06:58Z',
  'Ahmad Husen',
  'Tapi elit politik seperti cak imin erlangga susah untuk memberi tikat untuk mafud karna dah busuk',
  0],
 ['2023-05-10T12:05:21Z', 'Ciung Wanara', 'Ngabalin cocok untk ganjar', 0],
 ['2023-05-10T12:05:21Z',
  'Ahmad Husen',
  'Jika alloh meridhoi kita berharap Prabowo dengan mafud MD dulu memang kita dukung jokowi tapi insa alloh dukung prabowo mafud',
  0],
 ['2023-05-10T12:04:48Z',
  'Nisa Nisa',
  'Setiap kita akan mempertanggung jawabkan pilihan kita masing2<br>Jika pilihan kita yg kita pilih amanah,jujur dan adil (bersyukurlah)<br>Tapi...jika pilihan kita,tidak amanah,tidak jujur dan tidak adil kepada Rakyat dan berdusta <br>Maka seluruh pendukungnya akan ikut menanggung dosa2 pilihannya selama dia menjabat<br>Semoga Allah SWT...menolong indonesia dari calon2 yg pendusta.Aamiin',
  0],
 ['2023-05-10T11:41:14Z',
  'Deny Saputro',
  'Siapapun Capresnya, Wakilnya tetap PAK JENDERAL FREDY NUMBERI',
  0],
 ['2023-05-10T10:46:11Z',
  'ARIEF

In [43]:
#menjadikan dataframe
df = pd.DataFrame(comments, columns=['publishedAt', 'authorDisplayName', 'text', 'likeCount'])
df

Unnamed: 0,publishedAt,authorDisplayName,text,likeCount
0,2023-05-10T12:06:58Z,Ahmad Husen,Tapi elit politik seperti cak imin erlangga su...,0
1,2023-05-10T12:05:21Z,Ciung Wanara,Ngabalin cocok untk ganjar,0
2,2023-05-10T12:05:21Z,Ahmad Husen,Jika alloh meridhoi kita berharap Prabowo deng...,0
3,2023-05-10T12:04:48Z,Nisa Nisa,Setiap kita akan mempertanggung jawabkan pilih...,0
4,2023-05-10T11:41:14Z,Deny Saputro,"Siapapun Capresnya, Wakilnya tetap PAK JENDERA...",0
...,...,...,...,...
191,2023-05-10T02:25:37Z,Langit Biru,Pengaruh cawapres hanya 1-3% ..,0
192,2023-05-10T12:13:44Z,Langit Biru,@James Bond <br>bisa jadi.. kita lihat aja sur...,0
193,2023-05-10T12:09:20Z,James Bond,Klo persaingannya ketat 1suara aja berpengaruh,0
194,2023-05-10T02:17:25Z,Brunei Brunei,Akal busuk presiden ; 3 periode - tunda pemilu...,1


In [44]:
cd /content/drive/MyDrive/prosaindata

/content/drive/MyDrive/prosaindata


In [45]:
df.to_csv('capres2024.csv', index=False)

## Preprocessing

### 1. Symbol & Punctuation Removal, case folding

Pada Tahap ini preprocessing yang dilakukan yaitu menghilangkan simbol dan tanda baca, serta melakukan case folding yaitu merubah seluruh huruf yang ada pada data menjadi huruf kecil

In [46]:
#proses menghilangkan simbol dan emoji
def remove_text_special (text):
  text = text.replace('\\t',"").replace('\\n',"").replace('\\u',"").replace('\\',"")
  text = text.encode('ascii', 'replace').decode('ascii')
  return text.replace("http://"," ").replace("https://", " ")
df['text'] = df['text'].apply(remove_text_special)
print(df['text'])

0      Tapi elit politik seperti cak imin erlangga su...
1                             Ngabalin cocok untk ganjar
2      Jika alloh meridhoi kita berharap Prabowo deng...
3      Setiap kita akan mempertanggung jawabkan pilih...
4      Siapapun Capresnya, Wakilnya tetap PAK JENDERA...
                             ...                        
191                      Pengaruh cawapres hanya 1-3% ..
192    @James Bond <br>bisa jadi.. kita lihat aja sur...
193       Klo persaingannya ketat 1suara aja berpengaruh
194    Akal busuk presiden ; 3 periode - tunda pemilu...
195    PAKET KOMPLIT_nya MASA DEPAN INDONESIA!!! <br>...
Name: text, Length: 196, dtype: object


In [47]:
#menghilangkan tanda baca
def remove_tanda_baca(text):
  text = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",text)
  return text

df['text'] = df['text'].apply(remove_tanda_baca)
df['text'].head(20)

0     Tapi elit politik seperti cak imin erlangga su...
1                            Ngabalin cocok untk ganjar
2     Jika alloh meridhoi kita berharap Prabowo deng...
3     Setiap kita akan mempertanggung jawabkan pilih...
4     Siapapun Capresnya  Wakilnya tetap PAK JENDERA...
5     Siapa capres yg mengganden pak mahfud siap kit...
6     BENER JUGA PERKIRAAN KEBANYAKAN MASYARAKAT   b...
7     SINGKAT NYA GINI  JANGAN PILIH PETUGAS PARTAI ...
8                                Musiknya mantap min   
9                               ANIES  SAID AQIL SIRADJ
10    Probowo vs Mahfud MD  Utk NKRI masa depan supa...
11    NU menjaga negeri ini dari keterpecahan  Wajar...
12                    GANJAR PRABOWO  br VS br ANIS AHY
13    Siapa saja capres yg bergabung dg TGB insaallo...
14    Jg pilih dr jawa saja klu dr jawa indo ngak ma...
15    Pak Ganjar di dampingi cawapres manapun tetap ...
16    Penguasa jg jurang klu Tgb masuk cawapres akan...
17    Yang pantas cawapres adalah antara ridwan 

In [48]:
#proses menghilangkan angka
def remove_numbers (text):
  return re.sub(r"\d+", "", text)
df['text'] = df['text'].apply(remove_numbers)
df['text']

0      Tapi elit politik seperti cak imin erlangga su...
1                             Ngabalin cocok untk ganjar
2      Jika alloh meridhoi kita berharap Prabowo deng...
3      Setiap kita akan mempertanggung jawabkan pilih...
4      Siapapun Capresnya  Wakilnya tetap PAK JENDERA...
                             ...                        
191                        Pengaruh cawapres hanya      
192      Bond  br bisa jadi   kita lihat aja survei k...
193        Klo persaingannya ketat suara aja berpengaruh
194    Akal busuk presiden    periode   tunda pemilu ...
195    PAKET KOMPLIT nya MASA DEPAN INDONESIA     br ...
Name: text, Length: 196, dtype: object

In [49]:
# proses casefolding
def casefolding(Comment):
  Comment = Comment.lower()
  return Comment
df['text'] = df['text'].apply(casefolding)
df['text']

0      tapi elit politik seperti cak imin erlangga su...
1                             ngabalin cocok untk ganjar
2      jika alloh meridhoi kita berharap prabowo deng...
3      setiap kita akan mempertanggung jawabkan pilih...
4      siapapun capresnya  wakilnya tetap pak jendera...
                             ...                        
191                        pengaruh cawapres hanya      
192      bond  br bisa jadi   kita lihat aja survei k...
193        klo persaingannya ketat suara aja berpengaruh
194    akal busuk presiden    periode   tunda pemilu ...
195    paket komplit nya masa depan indonesia     br ...
Name: text, Length: 196, dtype: object

### 2. Tokenizing
Pada tahap ini preprocessing yang dilakukan adalah tokenizing. Tokenizing adalah metode untuk melakukan pemisahan kata dalam suatu kalimat dengan tujuan untuk proses analisis teks lebih lanjut

In [50]:
#proses tokenisasi
# from nltk.tokenize import TweetTokenizer
nltk.download('punkt')
# def word_tokenize(text):
#   tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
#   return tokenizer.tokenize(text)

df['review_token'] = df['text'].apply(lambda sentence: nltk.word_tokenize(sentence))
df['review_token']

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


0      [tapi, elit, politik, seperti, cak, imin, erla...
1                        [ngabalin, cocok, untk, ganjar]
2      [jika, alloh, meridhoi, kita, berharap, prabow...
3      [setiap, kita, akan, mempertanggung, jawabkan,...
4      [siapapun, capresnya, wakilnya, tetap, pak, je...
                             ...                        
191                          [pengaruh, cawapres, hanya]
192    [bond, br, bisa, jadi, kita, lihat, aja, surve...
193    [klo, persaingannya, ketat, suara, aja, berpen...
194    [akal, busuk, presiden, periode, tunda, pemilu...
195    [paket, komplit, nya, masa, depan, indonesia, ...
Name: review_token, Length: 196, dtype: object

In [51]:
cd /content/drive/MyDrive/prosaindata

/content/drive/MyDrive/prosaindata


In [52]:
df['review_token'].to_csv('normalisasidata.csv', index=False)

### 3. Word Normalization
Pada tahap ini yang dilakukan yaitu normalisasi pada data. Hal tersebut dilakukan untuk merubah kata yang tidak baku menjadi kata baku

In [53]:
#Normalisasi kata tidak baku
normalize = pd.read_csv("/content/drive/MyDrive/prosaindata/normalisasidata.csv")

normalize_word_dict = {}

for row in normalize.iterrows():
  if row[0] not in normalize_word_dict:
    normalize_word_dict[row[0]] = row[1]

def normalized_term(comment):
  return [normalize_word_dict[term] if term in normalize_word_dict else term for term in comment]

df['comment_normalize'] = df['review_token'].apply(normalized_term)
df['comment_normalize'].head(20)

0     [tapi, elit, politik, seperti, cak, imin, erla...
1                       [ngabalin, cocok, untk, ganjar]
2     [jika, alloh, meridhoi, kita, berharap, prabow...
3     [setiap, kita, akan, mempertanggung, jawabkan,...
4     [siapapun, capresnya, wakilnya, tetap, pak, je...
5     [siapa, capres, yg, mengganden, pak, mahfud, s...
6     [bener, juga, perkiraan, kebanyakan, masyaraka...
7     [singkat, nya, gini, jangan, pilih, petugas, p...
8                               [musiknya, mantap, min]
9                           [anies, said, aqil, siradj]
10    [probowo, vs, mahfud, md, utk, nkri, masa, dep...
11    [nu, menjaga, negeri, ini, dari, keterpecahan,...
12             [ganjar, prabowo, br, vs, br, anis, ahy]
13    [siapa, saja, capres, yg, bergabung, dg, tgb, ...
14    [jg, pilih, dr, jawa, saja, klu, dr, jawa, ind...
15    [pak, ganjar, di, dampingi, cawapres, manapun,...
16    [penguasa, jg, jurang, klu, tgb, masuk, cawapr...
17    [yang, pantas, cawapres, adalah, antara, r

### 4. Stopwords Removal
Pada tahap ini preprocessing yang dilakukan adalah menghilangkan kata yang tidak penting. Stopwords removal dilakukan 2 kali, yang pertama berdasarkan korpus yang ada di library python yaitu nltk, yang kedua berdasarkan file 'list_stopwords'

In [54]:
#Stopword Removal
nltk.download('stopwords')
from nltk.corpus import stopwords
txt_stopwords = stopwords.words('indonesian')

def stopwords_removal(filtering) :
  filtering = [word for word in filtering if word not in txt_stopwords]
  return filtering

df['stopwords_removal'] = df['comment_normalize'].apply(stopwords_removal)
df['stopwords_removal'].head(20)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0     [elit, politik, cak, imin, erlangga, susah, ti...
1                       [ngabalin, cocok, untk, ganjar]
2     [alloh, meridhoi, berharap, prabowo, mafud, md...
3     [mempertanggung, jawabkan, pilihan, br, piliha...
4       [capresnya, wakilnya, jenderal, fredy, numberi]
5     [capres, yg, mengganden, mahfud, dukung, perju...
6     [bener, perkiraan, kebanyakan, masyarakat, br,...
7     [singkat, nya, gini, pilih, petugas, partai, y...
8                               [musiknya, mantap, min]
9                           [anies, said, aqil, siradj]
10    [probowo, vs, mahfud, md, utk, nkri, negara, a...
11    [nu, menjaga, negeri, keterpecahan, wajar, cap...
12             [ganjar, prabowo, br, vs, br, anis, ahy]
13    [capres, yg, bergabung, dg, tgb, insaalloh, su...
14    [jg, pilih, dr, jawa, klu, dr, jawa, indo, nga...
15    [ganjar, dampingi, cawapres, manapun, kalah, p...
16    [penguasa, jg, jurang, klu, tgb, masuk, cawapr...
17             [cawapres, ridwan, kamil, sandiag

In [55]:
cd /content/drive/MyDrive/prosaindata

/content/drive/MyDrive/prosaindata


In [56]:
df['stopwords_removal'].to_csv('stopwords1.csv', index=False)

In [57]:
#stopword removal 2
data_stopwords = pd.read_csv("/content/drive/MyDrive/prosaindata/stopwords1.csv")
print(data_stopwords)

def stopwords_removal2(filter) :
  filter = [word for word in filter if word not in data_stopwords]
  return filter

df['stopwords_removal_final'] = df['stopwords_removal'].apply(stopwords_removal2)
df['stopwords_removal_final'].head(20)

                                     stopwords_removal
0    ['elit', 'politik', 'cak', 'imin', 'erlangga',...
1              ['ngabalin', 'cocok', 'untk', 'ganjar']
2    ['alloh', 'meridhoi', 'berharap', 'prabowo', '...
3    ['mempertanggung', 'jawabkan', 'pilihan', 'br'...
4    ['capresnya', 'wakilnya', 'jenderal', 'fredy',...
..                                                 ...
191                           ['pengaruh', 'cawapres']
192  ['bond', 'br', 'lihat', 'aja', 'survei', 'kred...
193  ['klo', 'persaingannya', 'ketat', 'suara', 'aj...
194  ['akal', 'busuk', 'presiden', 'periode', 'tund...
195  ['paket', 'komplit', 'nya', 'indonesia', 'br',...

[196 rows x 1 columns]


0     [elit, politik, cak, imin, erlangga, susah, ti...
1                       [ngabalin, cocok, untk, ganjar]
2     [alloh, meridhoi, berharap, prabowo, mafud, md...
3     [mempertanggung, jawabkan, pilihan, br, piliha...
4       [capresnya, wakilnya, jenderal, fredy, numberi]
5     [capres, yg, mengganden, mahfud, dukung, perju...
6     [bener, perkiraan, kebanyakan, masyarakat, br,...
7     [singkat, nya, gini, pilih, petugas, partai, y...
8                               [musiknya, mantap, min]
9                           [anies, said, aqil, siradj]
10    [probowo, vs, mahfud, md, utk, nkri, negara, a...
11    [nu, menjaga, negeri, keterpecahan, wajar, cap...
12             [ganjar, prabowo, br, vs, br, anis, ahy]
13    [capres, yg, bergabung, dg, tgb, insaalloh, su...
14    [jg, pilih, dr, jawa, klu, dr, jawa, indo, nga...
15    [ganjar, dampingi, cawapres, manapun, kalah, p...
16    [penguasa, jg, jurang, klu, tgb, masuk, cawapr...
17             [cawapres, ridwan, kamil, sandiag

### 5. Stemming
Pada tahap ini preprocessing yang dilakukan adalah stemming. Stemming adalah proses pemetaan dan penguraian bentuk dari suatu kata menjadi bentuk kata dasarnya.

In [58]:
pip install sastrawi

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [59]:
pip install swifter

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [60]:
#proses stem
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
import string
import swifter
factory = StemmerFactory()
stemmer = factory.create_stemmer()

def stemming (term):
  return stemmer.stem(term)

term_dict = {}
for document in df['stopwords_removal_final']:
  for term in document:
    if term not in term_dict:
      term_dict[term] = ''


In [61]:
print(len(term_dict))
print("-----------------------------")

1238
-----------------------------


In [62]:
for term in term_dict:
  term_dict[term] = stemming(term)
  print(term,":",term_dict[term])

print(term_dict)
print("-----------------------------")

elit : elit
politik : politik
cak : cak
imin : imin
erlangga : erlangga
susah : susah
tikat : tikat
mafud : mafud
karna : karna
dah : dah
busuk : busuk
ngabalin : ngabalin
cocok : cocok
untk : untk
ganjar : ganjar
alloh : alloh
meridhoi : meridhoi
berharap : harap
prabowo : prabowo
md : md
dukung : dukung
jokowi : jokowi
insa : insa
mempertanggung : tanggung
jawabkan : jawab
pilihan : pilih
br : br
yg : yg
pilih : pilih
amanah : amanah
jujur : jujur
adil : adil
bersyukurlah : syukur
rakyat : rakyat
berdusta : dusta
pendukungnya : dukung
menanggung : tanggung
dosa : dosa
pilihannya : pilih
menjabat : jabat
semoga : moga
allah : allah
swt : swt
menolong : tolong
indonesia : indonesia
calon : calon
pendusta : dusta
aamiin : aamiin
capresnya : capresnya
wakilnya : wakil
jenderal : jenderal
fredy : fredy
numberi : numberi
capres : capres
mengganden : ganden
mahfud : mahfud
perjuangkan : juang
bener : bener
perkiraan : kira
kebanyakan : banyak
masyarakat : masyarakat
nyinyirin : nyinyirin
ju

In [63]:
def get_stemming(document):
  return [term_dict[term] for term in document]

In [64]:
df['stemming'] = df['stopwords_removal_final'].swifter.apply(get_stemming)

Pandas Apply:   0%|          | 0/196 [00:00<?, ?it/s]

In [65]:
print(df['stemming'])

0      [elit, politik, cak, imin, erlangga, susah, ti...
1                        [ngabalin, cocok, untk, ganjar]
2      [alloh, meridhoi, harap, prabowo, mafud, md, d...
3      [tanggung, jawab, pilih, br, pilih, yg, pilih,...
4           [capresnya, wakil, jenderal, fredy, numberi]
                             ...                        
191                                 [pengaruh, cawapres]
192      [bond, br, lihat, aja, survei, kredibel, depan]
193            [klo, saing, ketat, suara, aja, pengaruh]
194    [akal, busuk, presiden, periode, tunda, milu, ...
195    [paket, komplit, nya, indonesia, br, br, terus...
Name: stemming, Length: 196, dtype: object


In [66]:
df.head(20)

Unnamed: 0,publishedAt,authorDisplayName,text,likeCount,review_token,comment_normalize,stopwords_removal,stopwords_removal_final,stemming
0,2023-05-10T12:06:58Z,Ahmad Husen,tapi elit politik seperti cak imin erlangga su...,0,"[tapi, elit, politik, seperti, cak, imin, erla...","[tapi, elit, politik, seperti, cak, imin, erla...","[elit, politik, cak, imin, erlangga, susah, ti...","[elit, politik, cak, imin, erlangga, susah, ti...","[elit, politik, cak, imin, erlangga, susah, ti..."
1,2023-05-10T12:05:21Z,Ciung Wanara,ngabalin cocok untk ganjar,0,"[ngabalin, cocok, untk, ganjar]","[ngabalin, cocok, untk, ganjar]","[ngabalin, cocok, untk, ganjar]","[ngabalin, cocok, untk, ganjar]","[ngabalin, cocok, untk, ganjar]"
2,2023-05-10T12:05:21Z,Ahmad Husen,jika alloh meridhoi kita berharap prabowo deng...,0,"[jika, alloh, meridhoi, kita, berharap, prabow...","[jika, alloh, meridhoi, kita, berharap, prabow...","[alloh, meridhoi, berharap, prabowo, mafud, md...","[alloh, meridhoi, berharap, prabowo, mafud, md...","[alloh, meridhoi, harap, prabowo, mafud, md, d..."
3,2023-05-10T12:04:48Z,Nisa Nisa,setiap kita akan mempertanggung jawabkan pilih...,0,"[setiap, kita, akan, mempertanggung, jawabkan,...","[setiap, kita, akan, mempertanggung, jawabkan,...","[mempertanggung, jawabkan, pilihan, br, piliha...","[mempertanggung, jawabkan, pilihan, br, piliha...","[tanggung, jawab, pilih, br, pilih, yg, pilih,..."
4,2023-05-10T11:41:14Z,Deny Saputro,siapapun capresnya wakilnya tetap pak jendera...,0,"[siapapun, capresnya, wakilnya, tetap, pak, je...","[siapapun, capresnya, wakilnya, tetap, pak, je...","[capresnya, wakilnya, jenderal, fredy, numberi]","[capresnya, wakilnya, jenderal, fredy, numberi]","[capresnya, wakil, jenderal, fredy, numberi]"
5,2023-05-10T10:46:11Z,ARIEF PRASETYAWAN,siapa capres yg mengganden pak mahfud siap kit...,0,"[siapa, capres, yg, mengganden, pak, mahfud, s...","[siapa, capres, yg, mengganden, pak, mahfud, s...","[capres, yg, mengganden, mahfud, dukung, perju...","[capres, yg, mengganden, mahfud, dukung, perju...","[capres, yg, ganden, mahfud, dukung, juang]"
6,2023-05-10T10:41:24Z,shasa flower,bener juga perkiraan kebanyakan masyarakat b...,0,"[bener, juga, perkiraan, kebanyakan, masyaraka...","[bener, juga, perkiraan, kebanyakan, masyaraka...","[bener, perkiraan, kebanyakan, masyarakat, br,...","[bener, perkiraan, kebanyakan, masyarakat, br,...","[bener, kira, banyak, masyarakat, br, nyinyiri..."
7,2023-05-10T10:27:36Z,Pantang menyerah,singkat nya gini jangan pilih petugas partai ...,0,"[singkat, nya, gini, jangan, pilih, petugas, p...","[singkat, nya, gini, jangan, pilih, petugas, p...","[singkat, nya, gini, pilih, petugas, partai, y...","[singkat, nya, gini, pilih, petugas, partai, y...","[singkat, nya, gin, pilih, tugas, partai, yg, ..."
8,2023-05-10T10:03:47Z,Inisial A,musiknya mantap min,0,"[musiknya, mantap, min]","[musiknya, mantap, min]","[musiknya, mantap, min]","[musiknya, mantap, min]","[musik, mantap, min]"
9,2023-05-10T09:57:03Z,Orna 13,anies said aqil siradj,0,"[anies, said, aqil, siradj]","[anies, said, aqil, siradj]","[anies, said, aqil, siradj]","[anies, said, aqil, siradj]","[anies, said, aqil, siradj]"


## Feature Extraction (TF-IDF)

In [67]:
def joinkata(data):
  kalimat = ""
  for i in data:
    kalimat += i
    kalimat += " "
  return kalimat

text = df['stemming'].swifter.apply(joinkata)
text

Pandas Apply:   0%|          | 0/196 [00:00<?, ?it/s]

0      elit politik cak imin erlangga susah tikat maf...
1                            ngabalin cocok untk ganjar 
2      alloh meridhoi harap prabowo mafud md dukung j...
3      tanggung jawab pilih br pilih yg pilih amanah ...
4                capresnya wakil jenderal fredy numberi 
                             ...                        
191                                   pengaruh cawapres 
192             bond br lihat aja survei kredibel depan 
193                  klo saing ketat suara aja pengaruh 
194    akal busuk presiden periode tunda milu begal d...
195    paket komplit nya indonesia br br terus amp ti...
Name: stemming, Length: 196, dtype: object

In [68]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorize document using TF-IDF
tfidf = TfidfVectorizer(lowercase=True,
                        stop_words='english',
                        ngram_range = (1,1)
                        )

# Fit and Transform the documents
X = tfidf.fit_transform(text)

In [69]:
print(X)

  (0, 146)	0.31621329452572544
  (0, 187)	0.29866728401409676
  (0, 433)	0.29866728401409676
  (0, 556)	0.31621329452572544
  (0, 993)	0.3409430073935891
  (0, 940)	0.29866728401409676
  (0, 259)	0.3409430073935891
  (0, 347)	0.28505754092655167
  (0, 151)	0.28505754092655167
  (0, 768)	0.2369687566133016
  (0, 249)	0.28505754092655167
  (1, 280)	0.344285341571776
  (1, 1040)	0.6029064204677796
  (1, 179)	0.39303345900350933
  (1, 657)	0.6029064204677796
  (2, 355)	0.24023269072015174
  (2, 401)	0.21277904671477038
  (2, 238)	0.41245652125917354
  (2, 597)	0.1382194004484567
  (2, 778)	0.3006597354908991
  (2, 324)	0.22928570932336564
  (2, 612)	0.2742371208107168
  (2, 31)	0.4804653814403035
  (2, 556)	0.5086916086986842
  (3, 1)	0.14594049703574266
  :	:
  (195, 1022)	0.06673052477371308
  (195, 780)	0.06673052477371308
  (195, 25)	0.06673052477371308
  (195, 18)	0.06673052477371308
  (195, 841)	0.06673052477371308
  (195, 562)	0.06673052477371308
  (195, 629)	0.06673052477371308
  (

In [70]:
df_tfidf = pd.DataFrame(
    X.toarray().T, columns=[f'D{i+1}' for i in range(len(text))], index=tfidf.get_feature_names_out()
)
df_tfidf

Unnamed: 0,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,...,D187,D188,D189,D190,D191,D192,D193,D194,D195,D196
aah,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000
aamiin,0.0,0.0,0.0,0.145940,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000
abadi,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000
abu,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000
adil,0.0,0.0,0.0,0.263123,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.055792
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
yg,0.0,0.0,0.0,0.140689,0.0,0.249535,0.0,0.135608,0.0,0.0,...,0.0,0.10772,0.0,0.0,0.106456,0.0,0.0,0.0,0.107657,0.000000
yohanes,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000
yudikatif,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.061890
yusuf,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.238131,0.0,0.0,0.0,0.000000,0.000000


## Latent Semantic Analysis (LSA)

In [71]:
from sklearn.decomposition import TruncatedSVD

In [72]:
# Define the number of topics or components
num_components=10

# Create SVD object
lsa = TruncatedSVD(n_components=num_components, n_iter=100, random_state=42)

# Fit SVD model on data
lsa.fit_transform(X)

# Get Singular values and Components 
Sigma = lsa.singular_values_ 
V_transpose = lsa.components_.T

In [73]:
# Print the topics with their terms
terms = tfidf.get_feature_names_out()

for index, component in enumerate(lsa.components_):
    zipped = zip(terms, component)
    top_terms_key=sorted(zipped, key = lambda t: t[1], reverse=True)[:5]
    top_terms_list=list(dict(top_terms_key).keys())
    print("Topic "+str(index)+": ",top_terms_list)

Topic 0:  ['mahfud', 'md', 'ahy', 'anis', 'prabowo']
Topic 1:  ['ahy', 'anis', 'anies', 'br', 'mantap']
Topic 2:  ['br', 'yg', 'partai', 'tugas', 'pilih']
Topic 3:  ['anies', 'juara', 'ahok', 'menang', 'ideal']
Topic 4:  ['prabowo', 'mahfudz', 'anies', 'airlangga', 'best']
Topic 5:  ['br', 'ganjar', 'pranowo', 'md', 'vs']
Topic 6:  ['menang', 'ganjar', 'pranowo', 'wakil', 'ahok']
Topic 7:  ['mahmud', 'md', 'menang', 'pilih', 'partai']
Topic 8:  ['ganjar', 'partai', 'pranowo', 'cocok', 'tugas']
Topic 9:  ['cocok', 'mahmud', 'pilih', 'presiden', 'gatot']
