# Topic Modelling

## Crawling Data Dari Komentar Youtube


Program ini bertujuan untuk melakukan pengambilan data komentar pada sebuah video YouTube menggunakan Youtube Data API v3. Sebelum menggunakan program ini, pastikan bahwa layanan Youtube Data API telah diaktifkan dan API Key sudah di-generate.

Jika Anda belum memiliki API Key, berikut adalah langkah-langkah untuk memperolehnya:



1.   Login ke Google Developer Console (https://console.developers.google.com/) dengan akun Google Anda.

2.   Buatlah project baru dan isi informasi yang diminta.
3.   Aktifkan layanan API pada halaman project dan cari Youtube Data API v3.
4.   Buat kredensial agar API tersebut dapat digunakan. Klik tombol "Create Credential" dan lengkapi form yang diminta.

Anda dapat mengakses/melihat API Key pada tab Credentials di dashboard.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
#import library
import pandas as pd
from googleapiclient.discovery import build
import numpy as np
from string import punctuation
import re
import nltk

In [3]:
#Membuat function untuk crawling data
def video_comments(video_id):
	# empty list for storing reply
	replies = []

	# creating youtube resource object
	youtube = build('youtube', 'v3', developerKey=api_key)

	# retrieve youtube video results
	video_response = youtube.commentThreads().list(part='snippet,replies', videoId=video_id).execute()

	# iterate video response
	while video_response:
		
		# extracting required info
		# from each result object
		for item in video_response['items']:
			
			# Extracting comments ()
			published = item['snippet']['topLevelComment']['snippet']['publishedAt']
			user = item['snippet']['topLevelComment']['snippet']['authorDisplayName']

			# Extracting comments
			comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
			likeCount = item['snippet']['topLevelComment']['snippet']['likeCount']

			replies.append([published, user, comment, likeCount])
			
			# counting number of reply of comment
			replycount = item['snippet']['totalReplyCount']

			# if reply is there
			if replycount>0:
				# iterate through all reply
				for reply in item['replies']['comments']:
					
					# Extract reply
					published = reply['snippet']['publishedAt']
					user = reply['snippet']['authorDisplayName']
					repl = reply['snippet']['textDisplay']
					likeCount = reply['snippet']['likeCount']
					
					# Store reply is list
					#replies.append(reply)
					replies.append([published, user, repl, likeCount])

			# print comment with list of reply
			#print(comment, replies, end = '\n\n')

			# empty reply list
			#replies = []

		# Again repeat
		if 'nextPageToken' in video_response:
			video_response = youtube.commentThreads().list(
					part = 'snippet,replies',
					pageToken = video_response['nextPageToken'], 
					videoId = video_id
				).execute()
		else:
			break
	#endwhile
	return replies


In [4]:
# isikan dengan api key Anda
api_key = 'AIzaSyAq6xFuf3fYN-2-ppUzOEJVOkfHuIi9ipI'

# url video =https://youtu.be/lcpdHr51a2s
video_id = "lcpdHr51a2s" #isikan dengan kode / ID video

# Call function
comments = video_comments(video_id)

comments

[['2023-05-10T12:08:32Z',
  'Putra Putri',
  'Ora pengaruh bro,,bah jokowi endorse bah cawe cawe. <br>ingat 2024 joe biden tidak tinggal diam.!! Catat.',
  0],
 ['2023-05-10T08:55:05Z',
  'Taufik Hasan',
  'Jelas dong JK &amp; Paloh ketar ketir,, kalau sampe ANIS kalah hilang kesempatan mereka ngeruk <b>DUIT</b> Rakyat,, melakui Proyek2 BUMN, karna jaman JokoWi mereka kekunci,, mangkanya <b>BOSOWA</b> Oleng,, Karna itu Keluarga JK turun bela ANIS,,,, 😅🤣',
  1],
 ['2023-05-10T08:04:32Z', 'a vian', 'jusufkala anda itu orqngtua loh', 0],
 ['2023-05-10T07:12:57Z',
  'ilham bobotoh',
  'MURTAD YOHANES BUNGLON  PENGHIANAT BANGSA CAPRES KO HINA NENARA TAMU',
  0],
 ['2023-05-10T07:12:25Z',
  'ilham bobotoh',
  'MURTAD YOHANES BINGLON TAKUT KALAH MAKANYA DIA TAKUT KALAH',
  1],
 ['2023-05-10T07:11:39Z',
  'ilham bobotoh',
  'JK ITU PENGHIANAT BANGSA SI BUNGLON DARI YAMAN  SMUA KADRUN BADJINGAN KABEH PENGEN KORUPSI',
  0],
 ['2023-05-10T05:05:43Z', 'Is Kandar', 'Barangkali ada modal yg masuk', 

In [5]:
#menjadikan dataframe
df = pd.DataFrame(comments, columns=['publishedAt', 'authorDisplayName', 'text', 'likeCount'])
df

Unnamed: 0,publishedAt,authorDisplayName,text,likeCount
0,2023-05-10T12:08:32Z,Putra Putri,"Ora pengaruh bro,,bah jokowi endorse bah cawe ...",0
1,2023-05-10T08:55:05Z,Taufik Hasan,"Jelas dong JK &amp; Paloh ketar ketir,, kalau ...",1
2,2023-05-10T08:04:32Z,a vian,jusufkala anda itu orqngtua loh,0
3,2023-05-10T07:12:57Z,ilham bobotoh,MURTAD YOHANES BUNGLON PENGHIANAT BANGSA CAPR...,0
4,2023-05-10T07:12:25Z,ilham bobotoh,MURTAD YOHANES BINGLON TAKUT KALAH MAKANYA DIA...,1
...,...,...,...,...
102,2023-05-09T14:42:19Z,Muhamad satim Da'wa,Beritakan jg dgn ayat ini dlm buku bibel yehez...,0
103,2023-05-09T10:04:15Z,Agus S Singa terbangun,Lll,0
104,2023-05-09T10:03:51Z,Agus S Singa terbangun,NAMA NYA AJA AGAMA TERORIS BAJINGAN YG RATUSAN...,0
105,2023-05-08T08:52:27Z,vertikal cell,biiasanya calon yg suka saling serang orang la...,0


In [6]:
cd /content/drive/MyDrive/prosaindata

/content/drive/MyDrive/prosaindata


In [7]:
df.to_csv('capres2024.csv', index=False)

## Preprocessing

### 1. Symbol & Punctuation Removal, case folding

Pada Tahap ini preprocessing yang dilakukan yaitu menghilangkan simbol dan tanda baca, serta melakukan case folding yaitu merubah seluruh huruf yang ada pada data menjadi huruf kecil

In [8]:
#proses menghilangkan simbol dan emoji
def remove_text_special (text):
  text = text.replace('\\t',"").replace('\\n',"").replace('\\u',"").replace('\\',"")
  text = text.encode('ascii', 'replace').decode('ascii')
  return text.replace("http://"," ").replace("https://", " ")
df['text'] = df['text'].apply(remove_text_special)
print(df['text'])

0      Ora pengaruh bro,,bah jokowi endorse bah cawe ...
1      Jelas dong JK &amp; Paloh ketar ketir,, kalau ...
2                        jusufkala anda itu orqngtua loh
3      MURTAD YOHANES BUNGLON  PENGHIANAT BANGSA CAPR...
4      MURTAD YOHANES BINGLON TAKUT KALAH MAKANYA DIA...
                             ...                        
102    Beritakan jg dgn ayat ini dlm buku bibel yehez...
103                                                  Lll
104    NAMA NYA AJA AGAMA TERORIS BAJINGAN YG RATUSAN...
105    biiasanya calon yg suka saling serang orang la...
106    Kalau calon ga serang menyerang?yang kepanasan...
Name: text, Length: 107, dtype: object


In [9]:
#menghilangkan tanda baca
def remove_tanda_baca(text):
  text = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",text)
  return text

df['text'] = df['text'].apply(remove_tanda_baca)
df['text'].head(20)

0     Ora pengaruh bro  bah jokowi endorse bah cawe ...
1     Jelas dong JK  amp  Paloh ketar ketir   kalau ...
2                       jusufkala anda itu orqngtua loh
3     MURTAD YOHANES BUNGLON  PENGHIANAT BANGSA CAPR...
4     MURTAD YOHANES BINGLON TAKUT KALAH MAKANYA DIA...
5     JK ITU PENGHIANAT BANGSA SI BUNGLON DARI YAMAN...
6                         Barangkali ada modal yg masuk
7                  Saya tidak pro Jokowi dari awal yaa 
8     Semua ingin jadi presiden  meskipun Tdk jadi p...
9     Biarlah urusan politik jgn di kaitkan dg agama...
10    Kumis tipis masih belum puas karena pa di era ...
11     Si caplin jk ketar ketir  Maju terus pak jokowi 
12     Jokowi  petugas partai bukan negarawan sejati   
13                             rekam jejak Anis benar  
14              Awas JK udh terpilih cmn duduk di kursi
15                Chaplin nih tiap pilpres pasti nongol
16    Marilah dukung channel TANPARAGI untuk pendidi...
17    Marilah dukung channel TANPARAGI untuk pen

In [10]:
#proses menghilangkan angka
def remove_numbers (text):
  return re.sub(r"\d+", "", text)
df['text'] = df['text'].apply(remove_numbers)
df['text']

0      Ora pengaruh bro  bah jokowi endorse bah cawe ...
1      Jelas dong JK  amp  Paloh ketar ketir   kalau ...
2                        jusufkala anda itu orqngtua loh
3      MURTAD YOHANES BUNGLON  PENGHIANAT BANGSA CAPR...
4      MURTAD YOHANES BINGLON TAKUT KALAH MAKANYA DIA...
                             ...                        
102    Beritakan jg dgn ayat ini dlm buku bibel yehez...
103                                                  Lll
104    NAMA NYA AJA AGAMA TERORIS BAJINGAN YG RATUSAN...
105    biiasanya calon yg suka saling serang orang la...
106    Kalau calon ga serang menyerang yang kepanasan...
Name: text, Length: 107, dtype: object

In [11]:
# proses casefolding
def casefolding(Comment):
  Comment = Comment.lower()
  return Comment
df['text'] = df['text'].apply(casefolding)
df['text']

0      ora pengaruh bro  bah jokowi endorse bah cawe ...
1      jelas dong jk  amp  paloh ketar ketir   kalau ...
2                        jusufkala anda itu orqngtua loh
3      murtad yohanes bunglon  penghianat bangsa capr...
4      murtad yohanes binglon takut kalah makanya dia...
                             ...                        
102    beritakan jg dgn ayat ini dlm buku bibel yehez...
103                                                  lll
104    nama nya aja agama teroris bajingan yg ratusan...
105    biiasanya calon yg suka saling serang orang la...
106    kalau calon ga serang menyerang yang kepanasan...
Name: text, Length: 107, dtype: object

### 2. Tokenizing
Pada tahap ini preprocessing yang dilakukan adalah tokenizing. Tokenizing adalah metode untuk melakukan pemisahan kata dalam suatu kalimat dengan tujuan untuk proses analisis teks lebih lanjut

In [12]:
#proses tokenisasi
# from nltk.tokenize import TweetTokenizer
nltk.download('punkt')
# def word_tokenize(text):
#   tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
#   return tokenizer.tokenize(text)

df['review_token'] = df['text'].apply(lambda sentence: nltk.word_tokenize(sentence))
df['review_token']

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


0      [ora, pengaruh, bro, bah, jokowi, endorse, bah...
1      [jelas, dong, jk, amp, paloh, ketar, ketir, ka...
2                  [jusufkala, anda, itu, orqngtua, loh]
3      [murtad, yohanes, bunglon, penghianat, bangsa,...
4      [murtad, yohanes, binglon, takut, kalah, makan...
                             ...                        
102    [beritakan, jg, dgn, ayat, ini, dlm, buku, bib...
103                                                [lll]
104    [nama, nya, aja, agama, teroris, bajingan, yg,...
105    [biiasanya, calon, yg, suka, saling, serang, o...
106    [kalau, calon, ga, serang, menyerang, yang, ke...
Name: review_token, Length: 107, dtype: object

In [13]:
cd /content/drive/MyDrive/prosaindata

/content/drive/MyDrive/prosaindata


In [14]:
df['review_token'].to_csv('normalisasidata.csv', index=False)

### 3. Word Normalization
Pada tahap ini yang dilakukan yaitu normalisasi pada data. Hal tersebut dilakukan untuk merubah kata yang tidak baku menjadi kata baku

In [15]:
#Normalisasi kata tidak baku
normalize = pd.read_csv("/content/drive/MyDrive/prosaindata/normalisasidata.csv")

normalize_word_dict = {}

for row in normalize.iterrows():
  if row[0] not in normalize_word_dict:
    normalize_word_dict[row[0]] = row[1]

def normalized_term(comment):
  return [normalize_word_dict[term] if term in normalize_word_dict else term for term in comment]

df['comment_normalize'] = df['review_token'].apply(normalized_term)
df['comment_normalize'].head(20)

0     [ora, pengaruh, bro, bah, jokowi, endorse, bah...
1     [jelas, dong, jk, amp, paloh, ketar, ketir, ka...
2                 [jusufkala, anda, itu, orqngtua, loh]
3     [murtad, yohanes, bunglon, penghianat, bangsa,...
4     [murtad, yohanes, binglon, takut, kalah, makan...
5     [jk, itu, penghianat, bangsa, si, bunglon, dar...
6                   [barangkali, ada, modal, yg, masuk]
7           [saya, tidak, pro, jokowi, dari, awal, yaa]
8     [semua, ingin, jadi, presiden, meskipun, tdk, ...
9     [biarlah, urusan, politik, jgn, di, kaitkan, d...
10    [kumis, tipis, masih, belum, puas, karena, pa,...
11    [si, caplin, jk, ketar, ketir, maju, terus, pa...
12    [jokowi, petugas, partai, bukan, negarawan, se...
13                          [rekam, jejak, anis, benar]
14     [awas, jk, udh, terpilih, cmn, duduk, di, kursi]
15         [chaplin, nih, tiap, pilpres, pasti, nongol]
16    [marilah, dukung, channel, tanparagi, untuk, p...
17    [marilah, dukung, channel, tanparagi, untu

### 4. Stopwords Removal
Pada tahap ini preprocessing yang dilakukan adalah menghilangkan kata yang tidak penting. Stopwords removal dilakukan 2 kali, yang pertama berdasarkan korpus yang ada di library python yaitu nltk, yang kedua berdasarkan file 'list_stopwords'

In [16]:
#Stopword Removal
nltk.download('stopwords')
from nltk.corpus import stopwords
txt_stopwords = stopwords.words('indonesian')

def stopwords_removal(filtering) :
  filtering = [word for word in filtering if word not in txt_stopwords]
  return filtering

df['stopwords_removal'] = df['comment_normalize'].apply(stopwords_removal)
df['stopwords_removal'].head(20)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0     [ora, pengaruh, bro, bah, jokowi, endorse, bah...
1     [jk, amp, paloh, ketar, ketir, sampe, anis, ka...
2                            [jusufkala, orqngtua, loh]
3     [murtad, yohanes, bunglon, penghianat, bangsa,...
4     [murtad, yohanes, binglon, takut, kalah, takut...
5     [jk, penghianat, bangsa, si, bunglon, yaman, s...
6                        [barangkali, modal, yg, masuk]
7                                    [pro, jokowi, yaa]
8     [presiden, tdk, presiden, negara, presiden, ru...
9     [biarlah, urusan, politik, jgn, kaitkan, dg, a...
10    [kumis, tipis, puas, pa, era, jokowi, bisnis, ...
11         [si, caplin, jk, ketar, ketir, maju, jokowi]
12         [jokowi, petugas, partai, negarawan, sejati]
13                                 [rekam, jejak, anis]
14         [awas, jk, udh, terpilih, cmn, duduk, kursi]
15                      [chaplin, nih, pilpres, nongol]
16    [marilah, dukung, channel, tanparagi, pendidik...
17    [marilah, dukung, channel, tanparagi, pend

In [17]:
cd /content/drive/MyDrive/prosaindata

/content/drive/MyDrive/prosaindata


In [18]:
df['stopwords_removal'].to_csv('stopwords1.csv', index=False)

In [19]:
#stopword removal 2
data_stopwords = pd.read_csv("/content/drive/MyDrive/prosaindata/stopwords1.csv")
print(data_stopwords)

def stopwords_removal2(filter) :
  filter = [word for word in filter if word not in data_stopwords]
  return filter

df['stopwords_removal_final'] = df['stopwords_removal'].apply(stopwords_removal2)
df['stopwords_removal_final'].head(20)

                                     stopwords_removal
0    ['ora', 'pengaruh', 'bro', 'bah', 'jokowi', 'e...
1    ['jk', 'amp', 'paloh', 'ketar', 'ketir', 'samp...
2                     ['jusufkala', 'orqngtua', 'loh']
3    ['murtad', 'yohanes', 'bunglon', 'penghianat',...
4    ['murtad', 'yohanes', 'binglon', 'takut', 'kal...
..                                                 ...
102  ['beritakan', 'jg', 'dgn', 'ayat', 'dlm', 'buk...
103                                            ['lll']
104  ['nama', 'nya', 'aja', 'agama', 'teroris', 'ba...
105  ['biiasanya', 'calon', 'yg', 'suka', 'serang',...
106  ['calon', 'ga', 'serang', 'menyerang', 'kepana...

[107 rows x 1 columns]


0     [ora, pengaruh, bro, bah, jokowi, endorse, bah...
1     [jk, amp, paloh, ketar, ketir, sampe, anis, ka...
2                            [jusufkala, orqngtua, loh]
3     [murtad, yohanes, bunglon, penghianat, bangsa,...
4     [murtad, yohanes, binglon, takut, kalah, takut...
5     [jk, penghianat, bangsa, si, bunglon, yaman, s...
6                        [barangkali, modal, yg, masuk]
7                                    [pro, jokowi, yaa]
8     [presiden, tdk, presiden, negara, presiden, ru...
9     [biarlah, urusan, politik, jgn, kaitkan, dg, a...
10    [kumis, tipis, puas, pa, era, jokowi, bisnis, ...
11         [si, caplin, jk, ketar, ketir, maju, jokowi]
12         [jokowi, petugas, partai, negarawan, sejati]
13                                 [rekam, jejak, anis]
14         [awas, jk, udh, terpilih, cmn, duduk, kursi]
15                      [chaplin, nih, pilpres, nongol]
16    [marilah, dukung, channel, tanparagi, pendidik...
17    [marilah, dukung, channel, tanparagi, pend

### 5. Stemming
Pada tahap ini preprocessing yang dilakukan adalah stemming. Stemming adalah proses pemetaan dan penguraian bentuk dari suatu kata menjadi bentuk kata dasarnya.

In [20]:
pip install sastrawi

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [21]:
pip install swifter

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [22]:
#proses stem
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
import string
import swifter
factory = StemmerFactory()
stemmer = factory.create_stemmer()

def stemming (term):
  return stemmer.stem(term)

term_dict = {}
for document in df['stopwords_removal_final']:
  for term in document:
    if term not in term_dict:
      term_dict[term] = ''


In [23]:
print(len(term_dict))
print("-----------------------------")

749
-----------------------------


In [24]:
for term in term_dict:
  term_dict[term] = stemming(term)
  print(term,":",term_dict[term])

print(term_dict)
print("-----------------------------")

ora : ora
pengaruh : pengaruh
bro : bro
bah : bah
jokowi : jokowi
endorse : endorse
cawe : cawe
br : br
joe : joe
biden : biden
tinggal : tinggal
diam : diam
catat : catat
jk : jk
amp : amp
paloh : paloh
ketar : ketar
ketir : tir
sampe : sampe
anis : anis
kalah : kalah
hilang : hilang
kesempatan : sempat
ngeruk : ngeruk
b : b
duit : duit
rakyat : rakyat
melakui : laku
proyek : proyek
bumn : bumn
karna : karna
jaman : jaman
kekunci : kunci
mangkanya : mangkanya
bosowa : bosowa
oleng : oleng
keluarga : keluarga
turun : turun
bela : bela
jusufkala : jusufkala
orqngtua : orqngtua
loh : loh
murtad : murtad
yohanes : yohanes
bunglon : bunglon
penghianat : penghianat
bangsa : bangsa
capres : capres
ko : ko
hina : hina
nenara : nenara
tamu : tamu
binglon : binglon
takut : takut
si : si
yaman : yaman
smua : smua
kadrun : kadrun
badjingan : badjingan
kabeh : kabeh
pengen : ken
korupsi : korupsi
barangkali : barangkali
modal : modal
yg : yg
masuk : masuk
pro : pro
yaa : yaa
presiden : presiden
td

In [25]:
def get_stemming(document):
  return [term_dict[term] for term in document]

In [26]:
df['stemming'] = df['stopwords_removal_final'].swifter.apply(get_stemming)

Pandas Apply:   0%|          | 0/107 [00:00<?, ?it/s]

In [27]:
print(df['stemming'])

0      [ora, pengaruh, bro, bah, jokowi, endorse, bah...
1      [jk, amp, paloh, ketar, tir, sampe, anis, kala...
2                             [jusufkala, orqngtua, loh]
3      [murtad, yohanes, bunglon, penghianat, bangsa,...
4      [murtad, yohanes, binglon, takut, kalah, takut...
                             ...                        
102    [berita, jg, dgn, ayat, dlm, buku, bibel, yehe...
103                                                [lll]
104    [nama, nya, aja, agama, teroris, bajing, yg, r...
105    [biiasanya, calon, yg, suka, serang, orang, jl...
106    [calon, ga, serang, serang, panas, dukung, bos...
Name: stemming, Length: 107, dtype: object


In [28]:
df.head(20)

Unnamed: 0,publishedAt,authorDisplayName,text,likeCount,review_token,comment_normalize,stopwords_removal,stopwords_removal_final,stemming
0,2023-05-10T12:08:32Z,Putra Putri,ora pengaruh bro bah jokowi endorse bah cawe ...,0,"[ora, pengaruh, bro, bah, jokowi, endorse, bah...","[ora, pengaruh, bro, bah, jokowi, endorse, bah...","[ora, pengaruh, bro, bah, jokowi, endorse, bah...","[ora, pengaruh, bro, bah, jokowi, endorse, bah...","[ora, pengaruh, bro, bah, jokowi, endorse, bah..."
1,2023-05-10T08:55:05Z,Taufik Hasan,jelas dong jk amp paloh ketar ketir kalau ...,1,"[jelas, dong, jk, amp, paloh, ketar, ketir, ka...","[jelas, dong, jk, amp, paloh, ketar, ketir, ka...","[jk, amp, paloh, ketar, ketir, sampe, anis, ka...","[jk, amp, paloh, ketar, ketir, sampe, anis, ka...","[jk, amp, paloh, ketar, tir, sampe, anis, kala..."
2,2023-05-10T08:04:32Z,a vian,jusufkala anda itu orqngtua loh,0,"[jusufkala, anda, itu, orqngtua, loh]","[jusufkala, anda, itu, orqngtua, loh]","[jusufkala, orqngtua, loh]","[jusufkala, orqngtua, loh]","[jusufkala, orqngtua, loh]"
3,2023-05-10T07:12:57Z,ilham bobotoh,murtad yohanes bunglon penghianat bangsa capr...,0,"[murtad, yohanes, bunglon, penghianat, bangsa,...","[murtad, yohanes, bunglon, penghianat, bangsa,...","[murtad, yohanes, bunglon, penghianat, bangsa,...","[murtad, yohanes, bunglon, penghianat, bangsa,...","[murtad, yohanes, bunglon, penghianat, bangsa,..."
4,2023-05-10T07:12:25Z,ilham bobotoh,murtad yohanes binglon takut kalah makanya dia...,1,"[murtad, yohanes, binglon, takut, kalah, makan...","[murtad, yohanes, binglon, takut, kalah, makan...","[murtad, yohanes, binglon, takut, kalah, takut...","[murtad, yohanes, binglon, takut, kalah, takut...","[murtad, yohanes, binglon, takut, kalah, takut..."
5,2023-05-10T07:11:39Z,ilham bobotoh,jk itu penghianat bangsa si bunglon dari yaman...,0,"[jk, itu, penghianat, bangsa, si, bunglon, dar...","[jk, itu, penghianat, bangsa, si, bunglon, dar...","[jk, penghianat, bangsa, si, bunglon, yaman, s...","[jk, penghianat, bangsa, si, bunglon, yaman, s...","[jk, penghianat, bangsa, si, bunglon, yaman, s..."
6,2023-05-10T05:05:43Z,Is Kandar,barangkali ada modal yg masuk,1,"[barangkali, ada, modal, yg, masuk]","[barangkali, ada, modal, yg, masuk]","[barangkali, modal, yg, masuk]","[barangkali, modal, yg, masuk]","[barangkali, modal, yg, masuk]"
7,2023-05-10T04:55:47Z,RDS88 official,saya tidak pro jokowi dari awal yaa,1,"[saya, tidak, pro, jokowi, dari, awal, yaa]","[saya, tidak, pro, jokowi, dari, awal, yaa]","[pro, jokowi, yaa]","[pro, jokowi, yaa]","[pro, jokowi, yaa]"
8,2023-05-10T04:45:58Z,Aki Genit 79,semua ingin jadi presiden meskipun tdk jadi p...,1,"[semua, ingin, jadi, presiden, meskipun, tdk, ...","[semua, ingin, jadi, presiden, meskipun, tdk, ...","[presiden, tdk, presiden, negara, presiden, ru...","[presiden, tdk, presiden, negara, presiden, ru...","[presiden, tdk, presiden, negara, presiden, ru..."
9,2023-05-10T03:56:32Z,Sahi Rudin,biarlah urusan politik jgn di kaitkan dg agama...,0,"[biarlah, urusan, politik, jgn, di, kaitkan, d...","[biarlah, urusan, politik, jgn, di, kaitkan, d...","[biarlah, urusan, politik, jgn, kaitkan, dg, a...","[biarlah, urusan, politik, jgn, kaitkan, dg, a...","[biar, urus, politik, jgn, kait, dg, agama, bi..."


## Feature Extraction (TF-IDF)

In [29]:
def joinkata(data):
  kalimat = ""
  for i in data:
    kalimat += i
    kalimat += " "
  return kalimat

text = df['stemming'].swifter.apply(joinkata)
text

Pandas Apply:   0%|          | 0/107 [00:00<?, ?it/s]

0      ora pengaruh bro bah jokowi endorse bah cawe c...
1      jk amp paloh ketar tir sampe anis kalah hilang...
2                                jusufkala orqngtua loh 
3      murtad yohanes bunglon penghianat bangsa capre...
4        murtad yohanes binglon takut kalah takut kalah 
                             ...                        
102    berita jg dgn ayat dlm buku bibel yehezkiel a ...
103                                                 lll 
104    nama nya aja agama teroris bajing yg ratus bun...
105         biiasanya calon yg suka serang orang jla tk 
106       calon ga serang serang panas dukung bos yaman 
Name: stemming, Length: 107, dtype: object

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorize document using TF-IDF
tfidf = TfidfVectorizer(lowercase=True,
                        stop_words='english',
                        ngram_range = (1,1)
                        )

# Fit and Transform the documents
X = tfidf.fit_transform(text)

In [31]:
print(X)

  (0, 122)	0.2509770726148333
  (0, 150)	0.2509770726148333
  (0, 628)	0.2509770726148333
  (0, 80)	0.2509770726148333
  (0, 265)	0.2509770726148333
  (0, 97)	0.1879552670570942
  (0, 123)	0.39142001901273193
  (0, 174)	0.20488192310072553
  (0, 268)	0.15308563254945073
  (0, 44)	0.5019541452296666
  (0, 99)	0.23057964401400946
  (0, 472)	0.2509770726148333
  (0, 447)	0.2509770726148333
  (1, 67)	0.1934840655358963
  (1, 646)	0.1934840655358963
  (1, 294)	0.15794824213755035
  (1, 444)	0.1934840655358963
  (1, 95)	0.1934840655358963
  (1, 361)	0.17775921317768006
  (1, 325)	0.17775921317768006
  (1, 256)	0.1934840655358963
  (1, 287)	0.3869681310717926
  (1, 108)	0.1934840655358963
  (1, 512)	0.1666022528746911
  (1, 327)	0.17775921317768006
  :	:
  (104, 8)	0.05957632240178965
  (104, 144)	0.2206832822883817
  (104, 184)	0.05517082057209542
  (104, 405)	0.062368355791993686
  (104, 438)	0.13088316042295325
  (104, 246)	0.10673423262777654
  (104, 3)	0.10350729574879614
  (104, 148)	0.

In [32]:
df_tfidf = pd.DataFrame(
    X.toarray().T, columns=[f'D{i+1}' for i in range(len(text))], index=tfidf.get_feature_names_out()
)
df_tfidf

Unnamed: 0,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,...,D98,D99,D100,D101,D102,D103,D104,D105,D106,D107
aamiiin,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
abu,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
abud,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
agama,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.241344,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.103507,0.0,0.0
agus,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.057216,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
yoh,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.028979,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
yohanes,0.0,0.0,0.0,0.294951,0.304746,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
youtube,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.049905,0.0,0.0,0.182345,0.0,0.000000,0.0,0.0
zombi,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.076400,0.0,0.0


## Latent Semantic Analysis (LSA)

In [33]:
from sklearn.decomposition import TruncatedSVD

In [34]:
# Define the number of topics or components
num_components=10

# Create SVD object
lsa = TruncatedSVD(n_components=num_components, n_iter=100, random_state=42)

# Fit SVD model on data
lsa.fit_transform(X)

# Get Singular values and Components 
Sigma = lsa.singular_values_ 
V_transpose = lsa.components_.T

In [35]:
# Print the topics with their terms
terms = tfidf.get_feature_names_out()

for index, component in enumerate(lsa.components_):
    zipped = zip(terms, component)
    top_terms_key=sorted(zipped, key = lambda t: t[1], reverse=True)[:5]
    top_terms_list=list(dict(top_terms_key).keys())
    print("Topic "+str(index)+": ",top_terms_list)

Topic 0:  ['bajing', 'teroris', 'tembak', 'mati', 'dasar']
Topic 1:  ['ll', 'sarak', 'kemiri', 'lll', 'yana']
Topic 2:  ['anies', 'yg', 'maju', 'jk', 'anis']
Topic 3:  ['lll', 'sarak', 'yana', 'yg', 'anies']
Topic 4:  ['maju', 'anies', 'channel', 'didik', 'filosof']
Topic 5:  ['anies', 'presiden', 'banget', 'jd', 'negara']
Topic 6:  ['anis', 'presiden', 'ri', 'mantap', 'bos']
Topic 7:  ['presiden', 'yg', 'channel', 'didik', 'filosof']
Topic 8:  ['cawe', 'jokowi', 'presiden', 'br', 'si']
Topic 9:  ['cawe', 'br', 'mantap', 'ri', 'anis']
