# Vector Space Model

Vector Space Model adalah sebuah model yang mempresentasikan dokumen sebagai vector dari term dalam ruang. Didalam model ini setiap kata dalam dokumen direpresentasikan sebagai sebuah fitur dalam vektor dan bobot tiap kata ditentukan oleh Tf-Idf.

## Term Frequency (TF)

Term Frequency (TF) adalah ukuran yang menghitung jumlah suatu term dalam suatu dokumen. dalam prosesnya, term frequency memberikan bobot pada term yang relevan dengan dokumen. semakin sering term muncul maka semakin penting term tersebut untuk dokumen.

## Inverse Document Frequency (IDF)

Inverse Document Frequency (IDF) adalah suatu ukuran yang menunjukkan seberapa informatif suatu term dalam koleksi dokumen, semakin tinggi nilai IDF maka semakin jarang term t muncul dikoleksi dokumen.

## Code Program

### Library

In [4]:
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

### Read Data

In [5]:
# membaca file csv
df = pd.read_csv('../SindoNews.csv')

In [6]:
# menampilkan 5 dataframe awal
df.head()

Unnamed: 0,Judul,Isi Berita,Tanggal Berita,Kategori
0,LRT Jabodebek Dukung Konferensi Kereta Api se-...,Indonesia menjadi tuan rumah Asean Railway CEO...,"Senin, 02 September 2024 - 23:51 WIB",EKONOMI BISNIS
1,Kementerian PUPR Sebut Capaian 10 Tahun Infras...,"Jubir Kementerian PUPR, Endra S Atmawidjaja me...","Senin, 02 September 2024 - 23:28 WIB",NASIONAL
2,"Tingkatkan Produktivitas, Kementan Tanam Bersa...",Pemerintah terus berupaya meningkatkan produks...,"Senin, 02 September 2024 - 23:14 WIB",DAERAH
3,6 Fakta Pernikahan Putri Norwegia Martha Louis...,Fakta pernikahan Putri Norwegia Martha Louise ...,"Senin, 02 September 2024 - 23:40 WIB",LIFESTYLE
4,6 Tersangka Pengeroyokan Tahanan Rutan Depok h...,"Kapolres Depok, Kombes Pol Arya Perdana mengat...","Senin, 02 September 2024 - 23:54 WIB",METRO


In [7]:
# mengetahui dimensi dataframe
df.shape

(100, 4)

Jumlah data sebanyak 100 data dan terdiri dari 4 fitur.

### Prepocessing

In [8]:
# Menghilangkan karakter \t, \n, \r dalam kolom 'Isi Berita'
df['Isi Berita'] = df['Isi Berita'].replace(r'[\t\n\r]|\d+', ' ', regex=True)

# Mengatasi spasi berlebih dalam kolom 'isi'
df['Isi Berita'] = df['Isi Berita'].apply(lambda x: re.sub(r'\s+', ' ', x))

# Ambil teks dari kolom 'Isi Berita' dalam DataFrame
isi = df['Isi Berita']

Prepocessing data dengan menghilangkan karakter newline, tab dan lain sebagainya pada dataframe dengan kolom "ISI BERITA" 

# TF-IDF

### Vector Space Model

In [67]:
# Inisialisasi penghitung TF-IDF
tfidf_vectorizer = TfidfVectorizer()

# Hitung TF-IDF
tfidf_matrix = tfidf_vectorizer.fit_transform(isi)

# Daftar kata kunci
feature_names = tfidf_vectorizer.get_feature_names_out()

# Konversi matriks TF-IDF menjadi bentuk yang lebih mudah dibaca
tfidf_values = tfidf_matrix.toarray()

# Membuat DataFrame untuk menyimpan data TF-IDF
df_tfidf = pd.DataFrame(tfidf_values, columns=feature_names)

# Menampilkan DataFrame
df_tfidf

Unnamed: 0,abdullah,acara,ada,adalah,adanya,addin,address,adik,afc,agama,...,yaqut,yon,yong,yudha,yuk,zainal,zakat,zaman,zarnubi,zona
0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000
1,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000
2,0.000000,0.0,0.0,0.164877,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000
3,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000
4,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.200976,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.18442
96,0.000000,0.0,0.0,0.000000,0.0,0.230636,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000
97,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000
98,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000


Diketahui jumlah data df_tfidf berjumlah 100 rows dengan 1270 fitur.

### Sampling dan Splitting Data

In [68]:
# Ambil 16 sampel secara acak dari data fitur (X) dan sesuaikan dengan labelnya (Y)
sampled_data = df_tfidf.sample(n=16, random_state=42)  # Mengambil 16 sampel secara acak dari df_tfidf
X_sampled = sampled_data  # Data fitur yang diambil secara acak
Y_sampled = df.loc[sampled_data.index, 'Kategori']  # Mengambil label yang sesuai

# Bagi data menjadi training dan testing
X_train, X_test, y_train, y_test = train_test_split(X_sampled, Y_sampled, test_size=0.2, random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(12, 1270)
(4, 1270)
(12,)
(4,)


### Permodelan Regresion Logistic

In [79]:
# Inisialisasi Regression Logistic dan melatih model
clf = LogisticRegression(random_state=0).fit(X_train, y_train)

# Melakukan prediksi pada data testing
y_pred = clf.predict(X_test)

# Evaluasi Model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)

Accuracy: 0.0
Confusion Matrix:
 [[0 0 0 1]
 [0 0 0 1]
 [0 0 0 2]
 [0 0 0 0]]
Classification Report:
                 precision    recall  f1-score   support

        DAERAH       0.00      0.00      0.00       1.0
EKONOMI BISNIS       0.00      0.00      0.00       1.0
     LIFESTYLE       0.00      0.00      0.00       2.0
        SPORTS       0.00      0.00      0.00       0.0

      accuracy                           0.00       4.0
     macro avg       0.00      0.00      0.00       4.0
  weighted avg       0.00      0.00      0.00       4.0



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
