# TF-IDF & Vector Space Model


## Apa itu VSM (Vector Space Model)


Vector Space Model (VSM) adalah kerangka kerja matriks yang digunakan dalam temu kembali informasi dan pemrosesan bahasa alami (NLP) untuk merepresentasikan dan menganalisis data visual. VSM sangat penting dalam penggalian teks, pencarian dokumen, dan tugas-tugas pembelajaran mesin berbasis teks seperti klasifikasi dokumen, pencarian informasi, dan analisis kemiripan teks.


![image.png](https://i0.wp.com/spotintelligence.com/wp-content/uploads/2023/09/vector-space-model.jpg?resize=960%2C540&ssl=1)


Setiap dimensi berhubungan dengan term yang unik, sementara dokumen dan query dapat direpresentasikan sebagai vektor di dalam ruang tersebut.


## Term Frequency-Inverse Document Frequency (TF-IDF)


TF-IDF adalah singkatan dari Term Frequency Inverse Document Frequency. Hal ini dapat didefinisikan sebagai perhitungan seberapa relevan sebuah kata dalam kumpulan atau corpus terhadap sebuah teks. Nilai relevansi meningkat secara relatif terhadap berapa kali sebuah kata muncul di dalam teks, namun dikompensasi oleh frekuensi kata di dalam corpus (kumpulan data).


## Term Frequency


Pada dokumen d, frekuensi merepresentasikan jumlah kemunculan kata t. Oleh karena itu, kita dapat melihat bahwa frekuensi akan menjadi lebih relevan ketika sebuah kata muncul dalam teks, yang mana hal ini bersifat relatif. Karena urutan istilah tidak signifikan, kita dapat menggunakan vektor untuk mendeskripsikan teks dalam kumpulan model term. Untuk setiap istilah tertentu dalam teks, ada sebuah entri dengan nilai yang merupakan frekuensi term.


**tf(t,d) = count of t in d / number of words in d**


## Inverse Document Frequency


Pada dasarnya, ini menguji seberapa relevan kata tersebut. Tujuan utama dari pencarian ini adalah untuk menemukan dokumen yang sesuai dengan pencarian. Karena tf menganggap semua istilah sama pentingnya, maka frekuensi term tidak hanya dapat digunakan untuk mengukur bobot term dalam dokumen. Pertama, cari frekuensi dokumen dari suatu istilah t dengan menghitung jumlah dokumen yang mengandung term tersebut:


**idf(t) = log(N/ df(t))**


Penjelasan:<br>
**df(t)** = **N(t)**<br>
dimana<br>
**df(t)** = **Document frequency of a term t**<br>
**N(t)** = **Number of documents containing the term t**<br>


## TF-IDF (Term Frequency-Inverse Document Frequency)


**tf-idf(t, d) = tf(t, d) \* idf(t)**


## Proses TF-IDF dan menjadikannya VSM (Vector Space Model)


### Import Library/Tool yang dibutuhkan


In [15]:
# Library untuk data manipulation
import pandas as pd
from tqdm import tqdm
import re
import string

# Library untuk text preprocessing
import nltk
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# nltk.download('stopwords')
# nltk.download('punkt_tab')

# Library untuk text vectorization/TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

# Library untuk save model
import pickle

- **pandas** digunakan untuk membuat dataframe agar mudah dibaca.
- **tqdm** Untuk mentracking proses program.
- **re** (regular expression) digunakan untuk mengenali pola kata/kalimat.
- **nltk** (natural language toolkit) berfungsi untuk melakukan proses yang berkaitan dengan bahasa (teks).
- **sklearn** digunakan dalam pemrosesan data untuk kebutuhan machine learning atau data science. Dalam tugas ini, menggunakan TfidfTransformer untuk menghitung TF-IDF.
- **sastrawi** berfungsi untuk membersihkan mengurangi kata-kata imbuhan menjadi kata basic (sastrawi untuk stemming bahasa Indonesia).
- **pickle** unutk menyimpan model.


### Import data berita CSV


In [16]:
data = pd.read_csv("../tugas1/data_100new.csv")
data.columns = data.columns.str.strip()
data

Unnamed: 0,judul,tanggal,isi,kategori
0,Putin Ungkap Tetap Bikin Ukraina Target Eksper...,"Minggu, 24 Nov 2024 13:50 WIB",Presiden Rusia Vladimir Putin menyebut akan te...,internasional
1,VIDEO: Detik-detik Rudal Israel Hantam Masjid ...,"Minggu, 24 Nov 2024 13:40 WIB",Rekaman menunjukkan detik-detik rudal Israel m...,internasional
2,RI-Australia Sepakat Tukar Tahanan Pelaku Jari...,"Minggu, 24 Nov 2024 13:20 WIB",Australia menyebut Indonesia telah setuju untu...,internasional
3,Daftar Negara Anggota ICC yang Wajib Tangkap N...,"Minggu, 24 Nov 2024 12:22 WIB",Mahkamah Kriminal Internasional (International...,internasional
4,"RUDAL: Beda Zionisme, Yahudi, dan Antisemit ya...","Minggu, 24 Nov 2024 10:45 WIB",Agresi brutal Israel ke Jalur Gaza Palestina m...,internasional
...,...,...,...,...
95,Ahok Bersyukur Anak Abah-Ahoker Kompak Dukung ...,"Sabtu, 23 Nov 2024 17:20 WIB",Mantan Gubernur DKI Jakarta Basuki Tjahaja Pur...,nasional
96,"Ahok Hadiri Kampanye, Pramono Janji Tuntaskan ...","Sabtu, 23 Nov 2024 17:14 WIB",Calon Gubernur DKI Jakarta nomor urut 3 Pramon...,nasional
97,"Sebelum Kampanye Terakhir, Cagub Ahmad Luthfi ...","Sabtu, 23 Nov 2024 17:08 WIB",Calon gubernur (cagub) Ahmad Luthfi berkesempa...,nasional
98,"Kampanye Akbar RK-Suswono, Elite PKB dan NasDe...","Sabtu, 23 Nov 2024 16:54 WIB",Elite Partai NasDem dan PKB tak tampak hadir k...,nasional


Import data berita csv yang telah disimpan sebelumnya,<br>
dan membersihkan whitespace pada col header.


### Mengacak data


In [17]:
data = data.sample(frac = 1, ignore_index=True)

Mengacak data yang sebelumnya berdasarkan kategori 50:50, menjadi acak-acak.


### Fungsi clean_text()


In [18]:
def clean_text(text):
	text = re.sub(r'((www\.[^\s]+)|(https?://[^\s]+))', ' ', text) # Menghapus https* and www*
	text = re.sub(r'@[^\s]+', ' ', text) # Menghapus username
	text = re.sub(r'[\s]+', ' ', text) # Menghapus tambahan spasi
	text = re.sub(r'#([^\s]+)', ' ', text) # Menghapus hashtags
	text = re.sub(r'rt', ' ', text) # Menghapus retweet
	text = text.translate(str.maketrans("","",string.punctuation)) # Menghapus tanda baca
	text = re.sub(r'\d', ' ', text) # Menghapus angka
	text = text.lower()
	text = text.encode('ascii','ignore').decode('utf-8') #Menghapus ASCII dan unicode
	text = re.sub(r'[^\x00-\x7f]',r'', text)
	text = text.replace('\n','') #Menghapus baris baru
	text = text.strip()
	return text

Fungsi ini untuk membersihkan text, lebih tepatnya membersihkan teks seperti; menghapus hastag, unicode, dsb.


### Fungsi stemming_indo()


In [19]:
def stemming_indo(text):
	factory = StemmerFactory()
	stemmer = factory.create_stemmer()
	text = ' '.join(stemmer.stem(word) for word in text)
	return text

Fungsi ini digunakan untuk menstemming atau membersihkan kata seperti:

katanya = kata<br>
menggunakan = guna


### Fungsi clean_stopword()


In [20]:
def clean_stopword(tokens):
	listStopword =  set(stopwords.words('indonesian'))
	removed = []
	for t in tokens:
		if t not in listStopword:
			removed.append(t)
	return removed

Fungsi untuk membuang kata yang tidak digunakan seperti:

di, dan, dsb.


### Proses preprosesing text setiap dokumen


In [21]:
def preprocess_text(content):
	result = []
	for text in tqdm(content):
		cleaned_text = clean_text(text)
		tokens = word_tokenize(cleaned_text)
		cleaned_stopword = clean_stopword(tokens)
		stemmed_text = stemming_indo(cleaned_stopword)
		result.append(stemmed_text)
	return result

data['cleaned_text'] = preprocess_text(data['isi'])

100%|██████████| 100/100 [00:02<00:00, 39.41it/s]


Proses untuk mempersiapkan teks pada setiap dokumen yang diproses menggunakan fungsi-fungsi yang sudah dibuat sebelumnya, yang selanjutnya akan di tf-idf dan vsm.


### Proses TF-IDF dan pembuatan VSM


#### Split data


In [22]:
data_train = data[:80]
data_test = data[80:]
data_train

Unnamed: 0,judul,tanggal,isi,kategori,cleaned_text
0,Perkembangan Terkini Pengusutan Kasus Firli Ba...,"Minggu, 24 Nov 2024 06:16 WIB",Kasus yang menjerat eks Ketua KPK Komjen Pol (...,nasional,jerat eks ketua kpk komjen pol purn firli bahu...
1,"Israel Gempur Beirut Pakai Rudal, 11 Orang Tew...","Sabtu, 23 Nov 2024 19:29 WIB",Serangan udara Israel di jantung ibukota Leban...,internasional,serang udara israel jantung ibukota lebanon be...
2,FOTO: Gunung Berapi Islandia Meletus 7 Kali Us...,"Sabtu, 23 Nov 2024 12:20 WIB","Gunung berapi di Semenanjung Reykjanes Lyings,...",internasional,gunung rap semenanjung reykjanes lyings island...
3,RI-Australia Sepakat Tukar Tahanan Pelaku Jari...,"Minggu, 24 Nov 2024 13:20 WIB",Australia menyebut Indonesia telah setuju untu...,internasional,australia sebut indonesia tuju be ukar tahan p...
4,"ICC Rilis Surat Penangkapan, Netanyahu Jadi Bu...","Jumat, 22 Nov 2024 16:20 WIB",Perdana Menteri Israel Benjamin Netanyahu menj...,internasional,perdana menteri israel benjamin netanyahu buro...
...,...,...,...,...,...
75,FOTO: Wajah AKP Dadang di Kasus Polisi Tembak ...,"Sabtu, 23 Nov 2024 18:40 WIB",Polda Sumbar menetapkan Kabag Ops Polres Solok...,nasional,polda sumbar tetap kabag ops polres solok sela...
76,Tujuh Orang Ditangkap Terkait OTT di Bengkulu,"Minggu, 24 Nov 2024 08:30 WIB",Komisi Pemberantasan Korupsi (KPK) sedikitnya ...,nasional,komisi berantas korupsi kpk aman tujuh orang o...
77,AS Sebut Tentara Korut Kumpul di Rusia untuk S...,"Sabtu, 23 Nov 2024 09:54 WIB",Kepala Departemen Pertahanan Amerika Serikat a...,internasional,kepala depa emen pe ahanan amerika serikat pen...
78,Beda Keuntungan Negara Partner dan Negara Angg...,"Sabtu, 23 Nov 2024 10:10 WIB",Indonesia telah resmi menjadi negara partner a...,internasional,indonesia resmi negara pa ner mitra forum ekon...


Split data menjadi 80 data untuk train dan 20 data untuk testing dari 100 data yang ada.


#### TF-IDF & VSM


In [23]:
def tfidf_vsm(data, kategori):
	tfidf = TfidfVectorizer()
	tfidf_matrix = tfidf.fit_transform(data)
	feature_names = tfidf.get_feature_names_out()
	
	df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
	df_tfidf.insert(0, 'Kategori Berita', kategori.reset_index(drop=True))

	return tfidf, df_tfidf

tfidf_model, df_tfidf = tfidf_vsm(data_train['cleaned_text'], data_train['kategori'])

In [24]:
def model_tf_idf(data, model, kategori):
	tfidf_matrix = model.transform(data)
	feature_names = model.get_feature_names_out()
	
	df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
	df_tfidf.insert(0, 'Kategori Berita', kategori.reset_index(drop=True))

	return df_tfidf

df_tfidf_test = model_tf_idf(data_test['cleaned_text'], tfidf_model, data_test['kategori'])

In [25]:
# df_tfidf_test.head()

In [26]:
df_tfidf

Unnamed: 0,Kategori Berita,abad,abah,abai,abat,abdi,abdul,abdulla,abdullah,abdulloh,...,zaki,zaman,zambia,zayed,zelenskiy,zelensky,zikir,zionis,zulhas,zulkifli
0,nasional,0.000000,0.0,0.026348,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,internasional,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,internasional,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,internasional,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,internasional,0.046459,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.046459,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,nasional,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
76,nasional,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
77,internasional,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
78,internasional,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Proses untuk membuat TF-IDF dan membentuk VSM dalam dataframe.


## Save Dataset & Model


In [27]:
df_tfidf.to_csv("data_train_vsm.csv", index=False)
df_tfidf_test.to_csv("data_test_vsm.csv", index=False)

In [28]:
with open('tfidf_model.pkl', 'wb') as f:
    pickle.dump(tfidf_model, f)