# **Lanjutan Setelah Crawling**

# **Import Library**

Modul dan Pustaka yang Diimpor:

1. **`sklearn.metrics`**: Modul yang berisi fungsi-fungsi untuk evaluasi performa model, seperti `accuracy_score`, `classification_report`, dan `confusion_matrix`.

2. **`sklearn.feature_extraction.text.TfidfVectorizer`**: Digunakan untuk mengubah teks menjadi representasi numerik menggunakan skema TF-IDF.

3. **`sklearn.decomposition.LatentDirichletAllocation`**: Digunakan untuk melakukan analisis topic modelling dengan pendekatan Latent Dirichlet Allocation (LDA) pada teks.

4. **`sklearn.model_selection.train_test_split`**: Untuk membagi dataset menjadi subset pelatihan dan pengujian.

5. **`sklearn.naive_bayes.GaussianNB`**: Model Naive Bayes yang digunakan untuk klasifikasi, dalam hal ini Gaussian Naive Bayes.

6. **`sklearn.svm.SVC`**: Model Support Vector Machine (SVM) yang digunakan untuk klasifikasi.

7. **`pandas`**: Digunakan untuk manipulasi dan analisis data.

8. **`warnings`**: Untuk mengelola pesan peringatan yang muncul selama eksekusi kode.

9. **`joblib`**: Digunakan untuk menyimpan dan memuat model yang telah dilatih.

10. **`nltk`**: Library untuk memproses teks, termasuk `word_tokenize` untuk membagi teks menjadi kata-kata, `stopwords` untuk kata-kata umum yang tidak informatif, dan penambahan modul data dengan `nltk.download`.

In [31]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.svm import SVC

import pandas as pd
import warnings
import joblib
import nltk
import re

nltk.download('stopwords')
nltk.download('punkt')
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# **Load Dataset**

Membaca dataset yang sudah di crawling pada collab sebelumnya.

In [32]:
df = pd.read_csv('/content/drive/MyDrive/prosaindata/tugas1/Berita3Kategori.csv')
df

Unnamed: 0,Judul,Isi,Label
0,"Kapasitas EBT Terus Meningkat, Lamhot Optimist...",KOMPAS.com - Anggota Komisi VII Dewan Perwakil...,Politik
1,Jubir TKN Sebut Prabowo-Gibran Fokus Dorong Ke...,KOMPAS.com- Juru Bicara Tim Kampanye Nasional ...,Politik
2,"Tingkatkan Pertahanan Negara, Prabowo Serahkan...",KOMPAS.com- Menteri Pertahanan (Menhan) Prabow...,Politik
3,Prabowo Sebut Pesawat Produksi Indonesia Dimin...,KOMPAS.com - Menteri Pertahanan (Menhan) Prabo...,Politik
4,"Tema Debat Dinilai Terlalu Banyak, Fahira Idri...",KOMPAS.com - Anggota Dewan Perwakilan Daerah (...,Politik
...,...,...,...
5665,"Resep Tempe Mendoan Daun Jeruk, Sajikan dengan...",KOMPAS.com - Tempe mendoan salah satu gorengan...,Food
5666,Resep Bubur Tahu Buncis untuk Makan Siang Buah...,KOMPAS.com - Bubur MPASI saat usia anak delapa...,Food
5667,"Resep Brownies Pisang, Aromanya Harum",KOMPAS.com - Brownies pisang mirip dengan pemb...,Food
5668,"Resep Jamu Asam Urat, Racikan Herbal untuk Min...",KOMPAS.com - Asam urat merupakan sisa metaboli...,Food


# **Pre Processing**

## Mengecek Data Null

Kode **`df.isnull().sum()`** digunakan untuk menghitung jumlah nilai null atau NaN di setiap kolom DataFrame df. Langkah ini penting dalam proses pembersihan dan persiapan data sebelum melakukan analisis atau pemodelan lebih lanjut.

In [33]:
df.isnull().sum()

Judul    0
Isi      0
Label    0
dtype: int64

## Cleaning

Penggunaan fungsi cleaning di sini untuk membersihkan teks dari karakter yang tidak diinginkan, meninggalkan hanya huruf dan spasi.

In [34]:
def cleaning(text):
  text = re.sub(r'[^a-zA-Z\s]', '', text).strip()
  return text

df['Cleaning'] = df['Isi'].apply(cleaning)
df['Cleaning']

0       KOMPAScom  Anggota Komisi VII Dewan Perwakilan...
1       KOMPAScom Juru Bicara Tim Kampanye Nasional TK...
2       KOMPAScom Menteri Pertahanan Menhan Prabowo Su...
3       KOMPAScom  Menteri Pertahanan Menhan Prabowo S...
4       KOMPAScom  Anggota Dewan Perwakilan Daerah DPD...
                              ...                        
5665    KOMPAScom  Tempe mendoan salah satu gorengan k...
5666    KOMPAScom  Bubur MPASI saat usia anak delapan ...
5667    KOMPAScom  Brownies pisang mirip dengan pembua...
5668    KOMPAScom  Asam urat merupakan sisa metabolism...
5669    KOMPAScom  Wonton merupakan makanan tradisiona...
Name: Cleaning, Length: 5670, dtype: object

## Tokenisasi

Fungsi tokenizer menerima teks sebagai input text. Langkah-langkahnya adalah:

- Mengonversi semua huruf dalam teks menjadi huruf kecil menggunakan lower() untuk konsistensi.
- Melakukan tokenisasi teks menggunakan word_tokenize dari library seperti NLTK (Natural Language Toolkit) untuk memecah teks menjadi kata-kata.

In [35]:
def tokenizer(text):
  text = text.lower()
  return word_tokenize(text)

df['Tokenizing'] = df['Cleaning'].apply(tokenizer)
df['Tokenizing']

0       [kompascom, anggota, komisi, vii, dewan, perwa...
1       [kompascom, juru, bicara, tim, kampanye, nasio...
2       [kompascom, menteri, pertahanan, menhan, prabo...
3       [kompascom, menteri, pertahanan, menhan, prabo...
4       [kompascom, anggota, dewan, perwakilan, daerah...
                              ...                        
5665    [kompascom, tempe, mendoan, salah, satu, goren...
5666    [kompascom, bubur, mpasi, saat, usia, anak, de...
5667    [kompascom, brownies, pisang, mirip, dengan, p...
5668    [kompascom, asam, urat, merupakan, sisa, metab...
5669    [kompascom, wonton, merupakan, makanan, tradis...
Name: Tokenizing, Length: 5670, dtype: object

## Stopword

Pada kode ini, membersihkan teks dari kata-kata yang umum dan tidak informatif, seperti kata-kata penghubung, kata-kata yang sering muncul, dll. Hal ini sering dilakukan dalam tahap pre-processing teks untuk meningkatkan kualitas analisis yang akan dilakukan.

In [36]:
corpus = stopwords.words('indonesian')

def stopwordText(words):
 return [word for word in words if word not in corpus]

df['Stopword Removal'] = df['Tokenizing'].apply(stopwordText)

# Gabungkan kembali token menjadi kalimat utuh
df['Full Text'] = df['Stopword Removal'].apply(lambda x: ' '.join(x))
df['Full Text']

0       kompascom anggota komisi vii dewan perwakilan ...
1       kompascom juru bicara tim kampanye nasional tk...
2       kompascom menteri pertahanan menhan prabowo su...
3       kompascom menteri pertahanan menhan prabowo su...
4       kompascom anggota dewan perwakilan daerah dpd ...
                              ...                        
5665    kompascom tempe mendoan salah gorengan khas ba...
5666    kompascom bubur mpasi usia anak delapan menikm...
5667    kompascom brownies pisang pembuatan brownies b...
5668    kompascom asam urat sisa metabolisme zat purin...
5669    kompascom wonton makanan tradisional khas tion...
Name: Full Text, Length: 5670, dtype: object

## TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) adalah metode yang digunakan dalam pemrosesan teks dan pengambilan informasi untuk mengevaluasi seberapa penting sebuah kata (term) dalam sebuah dokumen dalam korpus (kumpulan dokumen).

In [37]:
def tfidf(dokumen, category):
  vectorizer = TfidfVectorizer()
  x = vectorizer.fit_transform(dokumen).toarray()
  terms = vectorizer.get_feature_names_out()

  final_tfidf = pd.DataFrame(x, columns=terms)
  final_tfidf.insert(0, 'Dokumen', dokumen)
  final_tfidf.insert(len(final_tfidf.columns),'Label', category)

  return (vectorizer, final_tfidf)

tfidf_vectorizer, final_tfidf = tfidf(df['Full Text'], df['Label'])
final_tfidf

Unnamed: 0,Dokumen,abdullah,abdulrachman,ac,acara,aceh,achmad,ad,adasaid,adat,...,yusril,zahrim,zaitun,zaman,zat,zhafirah,zikrillah,zoni,zoom,Label
0,kompascom anggota komisi vii dewan perwakilan ...,0.0,0.000000,0.046752,0.000000,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,Politik
1,kompascom juru bicara tim kampanye nasional tk...,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,Politik
2,kompascom menteri pertahanan menhan prabowo su...,0.0,0.049097,0.000000,0.000000,0.0,0.0,0.043789,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,Politik
3,kompascom menteri pertahanan menhan prabowo su...,0.0,0.052855,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,Politik
4,kompascom anggota dewan perwakilan daerah dpd ...,0.0,0.000000,0.000000,0.026416,0.0,0.0,0.000000,0.0,0.049818,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,Politik
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5665,kompascom tempe mendoan salah gorengan khas ba...,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,Food
5666,kompascom bubur mpasi usia anak delapan menikm...,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,Food
5667,kompascom brownies pisang pembuatan brownies b...,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,Food
5668,kompascom asam urat sisa metabolisme zat purin...,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.063463,0.0,0.0,0.0,0.0,Food


## Deklarasi X (fitur) dan y (target)

Di sini terdapat pemrosesan data yang menggunakan variabel **`final_tfidf`**. Terlihat bahwa Anda menggunakan fungsi  **`drop`** dari pandas untuk menghapus kolom 'Dokumen' dan 'Label' dari DataFrame.

In [38]:
X = final_tfidf.drop(['Dokumen', 'Label'], axis=1)
X

Unnamed: 0,abdullah,abdulrachman,ac,acara,aceh,achmad,ad,adasaid,adat,adi,...,yusminkemudian,yusril,zahrim,zaitun,zaman,zat,zhafirah,zikrillah,zoni,zoom
0,0.0,0.000000,0.046752,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
1,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2,0.0,0.049097,0.000000,0.000000,0.0,0.0,0.043789,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
3,0.0,0.052855,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
4,0.0,0.000000,0.000000,0.026416,0.0,0.0,0.000000,0.0,0.049818,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5665,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
5666,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
5667,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
5668,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.063463,0.0,0.0,0.0,0.0


In [39]:
y = df['Label']
y

0       Politik
1       Politik
2       Politik
3       Politik
4       Politik
         ...   
5665       Food
5666       Food
5667       Food
5668       Food
5669       Food
Name: Label, Length: 5670, dtype: object

## Splitting

**`train_test_split`** adalah fungsi dalam library **`scikit-learn`** yang digunakan untuk membagi dataset menjadi dua bagian: data latih (train set) dan data uji (test set). Fungsi ini memiliki beberapa parameter penting:

- **`X`**: Variabel fitur yang digunakan untuk memprediksi.
- **`y`**: Variabel target yang ingin diprediksi.
- **`test_size`**: Ukuran dari data uji yang ingin dipisahkan dari total dataset. Nilai ini bisa berupa persentase (0.0 hingga 1.0) atau jumlah data.
- **`random_state`**: Parameter ini digunakan untuk mengatur seed untuk generator angka acak. Ini memungkinkan pembagian dataset menjadi data latih dan data uji yang konsisten setiap kali fungsi ini dijalankan.

Pemisahan ini akan menjadi 70% data untuk latih dan 30% data untuk uji berdasarkan nilai **`test_size=0.3`. `random_state=42`**.

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# **Modeling dengan LDA**

## Mencari Parameter Terbaik untuk LDA


In [41]:
def find_best_lda(Xtrain, Xtest, n_components, alpha, beta):
  looping = 1
  best = {'k' : 0, 'alpha' : 0, 'beta' : 0, 'accuracy' : 0, 'model': '', 'lda' : '', 'lda_Xtrain' : '', 'lda_Xtest' : ''}
  history = pd.DataFrame(columns=["Pengujian Ke", "K", "Alpha", "Beta", "Accuracy"])

  # Menambahkan tqdm pada loop terluar
  for k in n_components:
    for a in alpha:
      for b in beta:
        lda = LatentDirichletAllocation(n_components=k, doc_topic_prior=a, topic_word_prior=b)
        lda_Xtrain = lda.fit_transform(Xtrain)
        lda_Xtest = lda.transform(Xtest)

        # Membuat model Naive Bayes
        model = GaussianNB()

        # Melatih model pada data pelatihan
        model.fit(lda_Xtrain, y_train)

        # Melakukan prediksi pada data pengujian
        y_pred = model.predict(lda_Xtest)

        # Menghitung akurasi
        accuracy = accuracy_score(y_test, y_pred)
        print(f"Jumlah Topik: {k}, Alpha: {a}, Beta: {b}, Accuracy: {accuracy}")

        if accuracy > best['accuracy']:
          best['accuracy'] = accuracy
          best['k'] = k
          best['alpha'] = a
          best['beta'] = b
          best['model'] = model
          best['lda'] = lda
          best['lda_Xtrain'] = lda_Xtrain
          best['lda_Xtest'] = lda_Xtest

        history.loc[len(history)] = [f"Pengujian Ke- {looping}", k, a, b, accuracy]
        looping += 1

  return (best, history)

k = [3, 4, 5]
alpha = [0.3, 0.4]
beta = [0.1, 0.2]
best_param, history = find_best_lda(X_train, X_test, k, alpha, beta)

Jumlah Topik: 3, Alpha: 0.3, Beta: 0.1, Accuracy: 0.7430922986478542
Jumlah Topik: 3, Alpha: 0.3, Beta: 0.2, Accuracy: 0.7289829512051734
Jumlah Topik: 3, Alpha: 0.4, Beta: 0.1, Accuracy: 0.892416225749559
Jumlah Topik: 3, Alpha: 0.4, Beta: 0.2, Accuracy: 0.5549676660787772
Jumlah Topik: 4, Alpha: 0.3, Beta: 0.1, Accuracy: 0.8265726043503822
Jumlah Topik: 4, Alpha: 0.3, Beta: 0.2, Accuracy: 0.7095825984714874
Jumlah Topik: 4, Alpha: 0.4, Beta: 0.1, Accuracy: 0.7666078777189889
Jumlah Topik: 4, Alpha: 0.4, Beta: 0.2, Accuracy: 0.6807760141093474
Jumlah Topik: 5, Alpha: 0.3, Beta: 0.1, Accuracy: 0.702527924750147
Jumlah Topik: 5, Alpha: 0.3, Beta: 0.2, Accuracy: 0.5784832451499118
Jumlah Topik: 5, Alpha: 0.4, Beta: 0.1, Accuracy: 0.6513815402704292
Jumlah Topik: 5, Alpha: 0.4, Beta: 0.2, Accuracy: 0.7871840094062317


In [42]:
best_param

{'k': 3,
 'alpha': 0.4,
 'beta': 0.1,
 'accuracy': 0.892416225749559,
 'model': GaussianNB(),
 'lda': LatentDirichletAllocation(doc_topic_prior=0.4, n_components=3,
                           topic_word_prior=0.1),
 'lda_Xtrain': array([[0.80410206, 0.06354546, 0.13235248],
        [0.90241069, 0.04889945, 0.04868986],
        [0.90175824, 0.04899576, 0.049246  ],
        ...,
        [0.91799062, 0.03995905, 0.04205034],
        [0.03793743, 0.9243766 , 0.03768597],
        [0.03625706, 0.03485289, 0.92889004]]),
 'lda_Xtest': array([[0.046813  , 0.05532642, 0.89786059],
        [0.91159218, 0.0440892 , 0.04431862],
        [0.90241069, 0.04889945, 0.04868986],
        ...,
        [0.91159218, 0.0440892 , 0.04431862],
        [0.04033822, 0.03835997, 0.92130181],
        [0.89693084, 0.05139429, 0.05167487]])}

In [43]:
history

Unnamed: 0,Pengujian Ke,K,Alpha,Beta,Accuracy
0,Pengujian Ke- 1,3,0.3,0.1,0.743092
1,Pengujian Ke- 2,3,0.3,0.2,0.728983
2,Pengujian Ke- 3,3,0.4,0.1,0.892416
3,Pengujian Ke- 4,3,0.4,0.2,0.554968
4,Pengujian Ke- 5,4,0.3,0.1,0.826573
5,Pengujian Ke- 6,4,0.3,0.2,0.709583
6,Pengujian Ke- 7,4,0.4,0.1,0.766608
7,Pengujian Ke- 8,4,0.4,0.2,0.680776
8,Pengujian Ke- 9,5,0.3,0.1,0.702528
9,Pengujian Ke- 10,5,0.3,0.2,0.578483


In [44]:
history.to_csv("historypengujian.csv", index=False)

## Deklarasi K, Alpha, dan Beta

In [45]:
lda = best_param['lda']
lda_x_train = best_param['lda_Xtrain']
lda_x_test = best_param['lda_Xtest']

## Hasil Reduksi Dimensi

In [46]:
topik_columns = [f"Topik {i}" for i in range(1, best_param['k']+1)]
dokumen = final_tfidf['Dokumen']
output_proporsi_TD = pd.DataFrame(lda_x_train, columns=topik_columns)
output_proporsi_TD.insert(0,'Dokumen', dokumen)
output_proporsi_TD.insert(len(output_proporsi_TD.columns),'Label', final_tfidf['Label'])
output_proporsi_TD

Unnamed: 0,Dokumen,Topik 1,Topik 2,Topik 3,Label
0,kompascom anggota komisi vii dewan perwakilan ...,0.804102,0.063545,0.132352,Politik
1,kompascom juru bicara tim kampanye nasional tk...,0.902411,0.048899,0.048690,Politik
2,kompascom menteri pertahanan menhan prabowo su...,0.901758,0.048996,0.049246,Politik
3,kompascom menteri pertahanan menhan prabowo su...,0.902411,0.048899,0.048690,Politik
4,kompascom anggota dewan perwakilan daerah dpd ...,0.783918,0.057330,0.158752,Politik
...,...,...,...,...,...
3964,jakarta kompascom imparsial menilai solusi dit...,0.037937,0.924377,0.037686,Nasional
3965,jakarta kompascom salah panelis debat perdana ...,0.035354,0.034639,0.930007,Nasional
3966,jakarta kompascom pembina perkumpulan pemilu d...,0.917991,0.039959,0.042050,Nasional
3967,jakarta kompascom dinas perhubungan dishub dki...,0.037937,0.924377,0.037686,Nasional


In [47]:
output_proporsi_TD.to_csv('Reduksi Dimensi Berita.csv', index=False)

## Proporsi Kata pada setiap Topik

In [48]:
# Output distribusi kata pada topik
distribusi_kata_topik = pd.DataFrame(lda.components_)
distribusi_kata_topik

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3428,3429,3430,3431,3432,3433,3434,3435,3436,3437
0,3.279699,0.100002,0.100002,24.71338,4.948582,0.100003,4.971052,1.009595,0.100003,0.1,...,0.1,2.296119,0.100001,0.58975,5.178234,0.100002,0.100001,0.100003,97.914885,0.1
1,0.100014,0.100003,0.10001,2.509258,0.100002,0.100005,0.100005,0.10001,1.544721,0.1,...,0.1,0.100005,0.100002,0.100004,0.100008,0.100003,0.100001,1.56943,0.100001,0.100001
2,0.100128,2.848942,1.409047,0.100003,0.100005,1.494293,0.983277,0.692094,0.100005,14.824091,...,14.824091,0.100016,2.744169,0.100117,0.100053,0.607696,5.388341,0.100001,0.1,14.662433


## Model Naive Bayes dengan LDA

In [49]:
# Membuat model Naive Bayes
model = best_param['model']

# Melakukan prediksi pada data pengujian
y_pred = model.predict(lda_x_test)

# Menghitung akurasi
accuracy = accuracy_score(y_test, y_pred)
print("Akurasi:", accuracy)

# Menampilkan laporan klasifikasi
print("Laporan Klasifikasi:")
print(classification_report(y_test, y_pred))

# Menampilkan matriks kebingungan
confusion = confusion_matrix(y_test, y_pred)
print("Confusion Matriks:")
print(confusion)

Akurasi: 0.892416225749559
Laporan Klasifikasi:
              precision    recall  f1-score   support

        Food       0.00      0.00      0.00        42
    Nasional       0.89      1.00      0.94      1518
     Politik       0.00      0.00      0.00       141

    accuracy                           0.89      1701
   macro avg       0.30      0.33      0.31      1701
weighted avg       0.80      0.89      0.84      1701

Confusion Matriks:
[[   0   42    0]
 [   0 1518    0]
 [   0  141    0]]


## Prediksi 1

In [50]:
data = ["Anggota Komisi VII Dewan Perwakilan Rakyat (DPR) Republik Indonesia (RI) Lamhot Sinaga mengatakan, kapasitas energi baru terbarukan (EBT) di Indonesia terus mengalami peningkatan selama satu dekade terakhir.Hal itu, sebut dia, tercatat dalam laporan tahunan Badan Energi Terbarukan Internasional atau International Renewable Energy Agency (Irena) bertajuk Renewable Energy Statistics 2023. Laporan ini menyebutkan bahwa kapasitas EBT di Indonesia mencapai 12.603 Megawatt (MW) pada 2022."]
a = tfidf_vectorizer.transform(data).toarray()
b = lda.transform(a)
model.predict(b)

array(['Nasional'], dtype='<U8')

In [51]:
joblib.dump(lda, "lda.pkl")
joblib.dump(model, "naive bayes.pkl")

['naive bayes.pkl']

# **Modelling dengan LDA**

In [52]:
# Membuat model Naive Bayes
nb = GaussianNB()

# Melatih model nbada data pelatihan
nb.fit(X_train, y_train)

# Melakukan prediksi pada data pengujian
y_pred_nb = nb.predict(X_test)

# Menghitung akurasi
accuracy = accuracy_score(y_test, y_pred_nb)
print("Akurasi:", accuracy)

# Menampilkan laporan klasifikasi
print("Laporan Klasifikasi:")
print(classification_report(y_test, y_pred_nb))

# Menampilkan matriks kebingungan
confusion = confusion_matrix(y_test, y_pred_nb)
print("Confusion Matriks:")
print(confusion)

Akurasi: 1.0
Laporan Klasifikasi:
              precision    recall  f1-score   support

        Food       1.00      1.00      1.00        42
    Nasional       1.00      1.00      1.00      1518
     Politik       1.00      1.00      1.00       141

    accuracy                           1.00      1701
   macro avg       1.00      1.00      1.00      1701
weighted avg       1.00      1.00      1.00      1701

Confusion Matriks:
[[  42    0    0]
 [   0 1518    0]
 [   0    0  141]]


## Prediksi 2

In [53]:
data = ["Anggota Komisi VII Dewan Perwakilan Rakyat (DPR) Republik Indonesia (RI) Lamhot Sinaga mengatakan, kapasitas energi baru terbarukan (EBT) di Indonesia terus mengalami peningkatan selama satu dekade terakhir.Hal itu, sebut dia, tercatat dalam laporan tahunan Badan Energi Terbarukan Internasional atau International Renewable Energy Agency (Irena) bertajuk Renewable Energy Statistics 2023. Laporan ini menyebutkan bahwa kapasitas EBT di Indonesia mencapai 12.603 Megawatt (MW) pada 2022."]
tfidf_matrix = tfidf_vectorizer.transform(data).toarray()
nb.predict(tfidf_matrix)

array(['Politik'], dtype='<U8')

In [54]:
joblib.dump(nb, "Naive Bayes (Asli).pkl")

['Naive Bayes (Asli).pkl']

In [55]:
joblib.dump(tfidf_vectorizer, 'vectorizer.pkl')

['vectorizer.pkl']