# TF-IDF & Vector Space Model

## Apa itu VSM (Vector Space Model)

Vector Space Model (VSM) adalah kerangka kerja matriks yang digunakan dalam temu kembali informasi dan pemrosesan bahasa alami (NLP) untuk merepresentasikan dan menganalisis data visual. VSM sangat penting dalam penggalian teks, pencarian dokumen, dan tugas-tugas pembelajaran mesin berbasis teks seperti klasifikasi dokumen, pencarian informasi, dan analisis kemiripan teks.

![image.png](https://i0.wp.com/spotintelligence.com/wp-content/uploads/2023/09/vector-space-model.jpg?resize=960%2C540&ssl=1)

Setiap dimensi berhubungan dengan term yang unik, sementara dokumen dan query dapat direpresentasikan sebagai vektor di dalam ruang tersebut.

## Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF adalah singkatan dari Term Frequency Inverse Document Frequency. Hal ini dapat didefinisikan sebagai perhitungan seberapa relevan sebuah kata dalam kumpulan atau corpus terhadap sebuah teks. Nilai relevansi meningkat secara relatif terhadap berapa kali sebuah kata muncul di dalam teks, namun dikompensasi oleh frekuensi kata di dalam corpus (kumpulan data).

## Term Frequency

Pada dokumen d, frekuensi merepresentasikan jumlah kemunculan kata t. Oleh karena itu, kita dapat melihat bahwa frekuensi akan menjadi lebih relevan ketika sebuah kata muncul dalam teks, yang mana hal ini bersifat relatif. Karena urutan istilah tidak signifikan, kita dapat menggunakan vektor untuk mendeskripsikan teks dalam kumpulan model term. Untuk setiap istilah tertentu dalam teks, ada sebuah entri dengan nilai yang merupakan frekuensi term.

**tf(t,d) = count of t in d / number of words in d**

## Inverse Document Frequency

Pada dasarnya, ini menguji seberapa relevan kata tersebut. Tujuan utama dari pencarian ini adalah untuk menemukan dokumen yang sesuai dengan pencarian. Karena tf menganggap semua istilah sama pentingnya, maka frekuensi term tidak hanya dapat digunakan untuk mengukur bobot term dalam dokumen. Pertama, cari frekuensi dokumen dari suatu istilah t dengan menghitung jumlah dokumen yang mengandung term tersebut:

**idf(t) = log(N/ df(t))**

Penjelasan:<br>
**df(t)** = **N(t)**<br>
dimana<br>
**df(t)** = **Document frequency of a term t**<br>
**N(t)** = **Number of documents containing the term t**<br>

## TF-IDF (Term Frequency-Inverse Document Frequency)

**tf-idf(t, d) = tf(t, d) * idf(t)**

## Proses TF-IDF dan menjadikannya VSM (Vector Space Model)

### Import Library/Tool yang dibutuhkan

In [1]:
# Library untuk data manipulation
import pandas as pd
from tqdm import tqdm
import re
import string

# Library untuk text preprocessing
import nltk
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt_tab')

# Library untuk text vectorization/TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package stopwords to C:\Users\LAB
[nltk_data]     SISTER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\LAB
[nltk_data]     SISTER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


* **pandas** digunakan untuk membuat dataframe agar mudah dibaca.
* **tqdm** Untuk mentracking proses program.
* **re** (regular expression) digunakan untuk mengenali pola kata/kalimat.
* **nltk** (natural language toolkit) berfungsi untuk melakukan proses yang berkaitan dengan bahasa (teks).
* **sklearn** digunakan dalam pemrosesan data untuk kebutuhan machine learning atau data science. Dalam tugas ini, menggunakan TfidfTransformer untuk menghitung TF-IDF.
* **sastrawi** berfungsi untuk membersihkan mengurangi kata-kata imbuhan menjadi kata basic (sastrawi untuk stemming bahasa Indonesia).

### Import data berita CSV

In [3]:
data = pd.read_csv("../tugas1/data_100.csv")
data.columns = data.columns.str.strip()
data

Unnamed: 0,judul,isi,tanggal,kategori
0,FOTO: Momen Rudal Yaman Hantam Israel Tengah,Sebuah rudal yang ditembakkan dari Yaman jatuh...,"Minggu, 15 Sep 2024 15:34 WIB",Internasional
1,Ribuan Orang Kepung Gedung Pemerintahan Israel...,Ribuan massa anti-pemerintah Israel berkumpul ...,"Minggu, 15 Sep 2024 14:38 WIB",Internasional
2,Israel Klaim Tak Ada Korban Jiwa Imbas Seranga...,MiliterIsrael mengatakan sebuah rudal yang dit...,"Minggu, 15 Sep 2024 13:25 WIB",Internasional
3,"FOTO: Banjir Terjang Eropa Tengah, 4 Warga Rum...",Badai menghantam Eropa pada Sabtu waktu setemp...,"Minggu, 15 Sep 2024 13:00 WIB",Internasional
4,"VIDEO: Banjir Hantam Rumania Timur, 4 Warga Di...",Setidaknya 4 orang tewas akibat banjir yang me...,"Minggu, 15 Sep 2024 12:09 WIB",Internasional
...,...,...,...,...
95,"Adu Kuat Cak Lontong, Riza Patria & Siti Fadil...",Tiga pasangan calon di Pemilihan Gubernur (Pil...,"Sabtu, 14 Sep 2024 12:45 WIB",Nasional
96,Mahyeldi-Vasko dan Epyardi-Ekos Dinyatakan KPU...,KPUSumatera Barat menetapkan dua pasangan calo...,"Sabtu, 14 Sep 2024 12:34 WIB",Nasional
97,Polisi Sulit Tangkap Terduga Pelaku Pembunuhan...,"Memasuki hari ketujuh, polisi belum berhasil m...","Sabtu, 14 Sep 2024 12:05 WIB",Nasional
98,Polisi: Nikita Laporkan Vadel di Kasus Persetu...,Polisi menyebut artis Nikita Mirzanimelaporkan...,"Sabtu, 14 Sep 2024 11:29 WIB",Nasional


Import data berita csv yang telah disimpan sebelumnya,<br>
dan membersihkan whitespace pada col header.

### Mengacak data

In [4]:
data = data.sample(frac = 1, ignore_index=True)

Mengacak data yang sebelumnya berdasarkan kategori 50:50, menjadi acak-acak.

### Fungsi clean_text()

In [5]:
def clean_text(text):
	text = re.sub(r'((www\.[^\s]+)|(https?://[^\s]+))', ' ', text) # Menghapus https* and www*
	text = re.sub(r'@[^\s]+', ' ', text) # Menghapus username
	text = re.sub(r'[\s]+', ' ', text) # Menghapus tambahan spasi
	text = re.sub(r'#([^\s]+)', ' ', text) # Menghapus hashtags
	text = re.sub(r'rt', ' ', text) # Menghapus retweet
	text = text.translate(str.maketrans("","",string.punctuation)) # Menghapus tanda baca
	text = re.sub(r'\d', ' ', text) # Menghapus angka
	text = text.lower()
	text = text.encode('ascii','ignore').decode('utf-8') #Menghapus ASCII dan unicode
	text = re.sub(r'[^\x00-\x7f]',r'', text)
	text = text.replace('\n','') #Menghapus baris baru
	text = text.strip()
	return text

Fungsi ini untuk membersihkan text, lebih tepatnya membersihkan teks seperti; menghapus hastag, unicode, dsb.

### Fungsi stemming_indo()

In [6]:
def stemming_indo(text):
	factory = StemmerFactory()
	stemmer = factory.create_stemmer()
	text = ' '.join(stemmer.stem(word) for word in text)
	return text

Fungsi ini digunakan untuk menstemming atau membersihkan kata seperti:

katanya = kata<br>
menggunakan = guna

### Fungsi clean_stopword()

In [7]:
def clean_stopword(tokens):
	listStopword =  set(stopwords.words('indonesian'))
	removed = []
	for t in tokens:
		if t not in listStopword:
			removed.append(t)
	return removed

Fungsi untuk membuang kata yang tidak digunakan seperti:

di, dan, dsb.

### Proses preprosesing text setiap dokumen

In [8]:
def preprocess_text(content):
	result = []
	for text in tqdm(content):
		cleaned_text = clean_text(text)
		tokens = nltk.tokenize.word_tokenize(cleaned_text)
		cleaned_stopword = clean_stopword(tokens)
		stemmed_text = stemming_indo(cleaned_stopword)
		result.append(stemmed_text)
	return result

data['cleaned_text'] = preprocess_text(data['isi'])

100%|██████████| 100/100 [00:02<00:00, 38.70it/s]


Proses untuk mempersiapkan teks pada setiap dokumen yang diproses menggunakan fungsi-fungsi yang sudah dibuat sebelumnya, yang selanjutnya akan di tf-idf dan vsm.

### Proses TF-IDF dan pembuatan VSM

In [9]:
def tfidf_vsm(data, kategori):
	tfidf = TfidfVectorizer()
	tfidf_matrix = tfidf.fit_transform(data)
	feature_names = tfidf.get_feature_names_out()
	
	df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
	df_tfidf.insert(0, 'Kategori Berita', kategori.reset_index(drop=True))

	return tfidf, df_tfidf

tfidf, df_tfidf = tfidf_vsm(data['cleaned_text'], data['kategori'])

In [10]:
df_tfidf

Unnamed: 0,Kategori Berita,aa,abah,abai,abar,abasuki,abbas,abdelmadjid,abdullah,abdulrahman,...,zaman,zat,zelenskiy,zelensky,ziarah,zimbabwe,zimbabweakan,zimparks,zionis,zona
0,Nasional,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.043998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
1,Internasional,0.0,0.000000,0.0,0.0,0.0,0.0,0.052613,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.048279,0.0
2,Internasional,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
3,Nasional,0.0,0.226969,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
4,Internasional,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Nasional,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
96,Internasional,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
97,Internasional,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
98,Nasional,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0


Proses untuk membuat TF-IDF dan membentuk VSM dalam dataframe.

## Save Dataset

In [12]:
df_tfidf.to_csv("data_vsm.csv", index=False)