# **UAS**

## Deployment

[Link Streamlit](https://uasppw2023.streamlit.app)

## Import Library

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.svm import SVC

import pandas as pd
import warnings
import joblib
import nltk
import re

nltk.download('stopwords')
nltk.download('punkt')
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Kode di atas adalah impor pustaka dan modul yang diperlukan untuk melakukan analisis teks dengan Naive Bayes, termasuk penggunaan Latent Dirichlet Allocation (LDA) untuk reduksi dimensi dan ekstraksi fitur teks\

## Load Dataset (transform to Term Frequency)

Dalam kode di bawah, program ini menggunakan pustaka `pandas` untuk membaca sebuah file CSV dari lokasi yang diberikan,

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/Feb11F/dataset/main/beritabanten%20(1).csv')
df

Unnamed: 0,Judul,Isi,Tanggal,Kategori
0,Sekretaris KONI Kota Serang Jadi Bacalon Tungg...,SERANG– Satu kandidat calon Ketua Umum (Ketum)...,"Rabu 9 Agu 2023, 19:02 WIB",Sport
1,Prediksi Irak vs Timnas Indonesia Kualifikasi ...,SERANG– Pertandingan Grup F Kualifikasi Piala ...,"Kamis 16 Nov 2023, 04:17 WIB",Sport
2,Dua Pengedar Obat Tanpa Izin Edar Ditangkap Po...,PANDEGLANG– Satuan Reserse Narkoba (Satresnark...,"Sabtu 16 Sep 2023, 11:46 WIB",hukum
3,FIFA Cabut Status Indonesia Sebagai Tuan Rumah...,JAKARTA– Indonesia batal menggelar Piala Dunia...,"Kamis 30 Mar 2023, 00:12 WIB",Sport
4,Tiga Pengurus Panahan Banten Dilirik Pusat,SERANG– Ketua Umum Pengurus Provinsi Persatuan...,"Minggu 12 Feb 2023, 12:08 WIB",Sport
...,...,...,...,...
1195,Sosialisasi E-Sport di Cilegon Merambah ke Ber...,CILEGON– Demam Electronic Sport atau E-Sport k...,"Senin 17 Apr 2023, 12:07 WIB",Sport
1196,Kontingen Atlet Basket U-15 Banten Bersaing di...,"KAB. TANGERANG– Ketua Umum Perbasi Banten, Ahm...","Jumat 6 Jan 2023, 05:07 WIB",Sport
1197,"Rencana Pertemuan SBY dan Megawati, Buka Pelua...",SERANG– Ketua Majelis Tinggi Partai Demokrat S...,"Selasa 5 Sep 2023, 03:10 WIB",Politik
1198,10 Kejutan Piala Dunia Terbesar Teratas,SERANG– Setelah kemenangan yang mengejutkan sa...,"Kamis 24 Nov 2022, 07:13 WIB",Sport


## Cek NULL Data

Perintah `df.isnull().sum()` digunakan untuk menghitung jumlah nilai-nilai yang hilang (missing values) dalam DataFrame `df`. Ini akan memberikan jumlah nilai-nilai yang hilang untuk setiap kolom dalam DataFrame. Hasilnya akan berupa deret data yang menunjukkan berapa banyak nilai yang hilang dalam masing-masing kolom.

In [None]:
df.isnull().sum()

Judul       0
Isi         0
Tanggal     0
Kategori    0
dtype: int64

## Cleaning

In [None]:
def cleaning(text):
  text = re.sub(r'[^a-zA-Z\s]', '', text).strip()
  return text

df['Cleaning'] = df['Isi'].apply(cleaning)
df['Cleaning']

0       SERANG Satu kandidat calon Ketua Umum Ketum KO...
1       SERANG Pertandingan Grup F Kualifikasi Piala D...
2       PANDEGLANG Satuan Reserse Narkoba Satresnarkob...
3       JAKARTA Indonesia batal menggelar Piala Dunia ...
4       SERANG Ketua Umum Pengurus Provinsi Persatuan ...
                              ...                        
1195    CILEGON Demam Electronic Sport atau ESport kia...
1196    KAB TANGERANG Ketua Umum Perbasi Banten Ahmed ...
1197    SERANG Ketua Majelis Tinggi Partai Demokrat Su...
1198    SERANG Setelah kemenangan yang mengejutkan saa...
1199    JAKARTA Timnas Indonesia berusaha bangkit sete...
Name: Cleaning, Length: 1200, dtype: object

## Tokenizing

In [None]:
def tokenizer(text):
  text = text.lower()
  return word_tokenize(text)

df['Tokenizing'] = df['Cleaning'].apply(tokenizer)
df['Tokenizing']

0       [serang, satu, kandidat, calon, ketua, umum, k...
1       [serang, pertandingan, grup, f, kualifikasi, p...
2       [pandeglang, satuan, reserse, narkoba, satresn...
3       [jakarta, indonesia, batal, menggelar, piala, ...
4       [serang, ketua, umum, pengurus, provinsi, pers...
                              ...                        
1195    [cilegon, demam, electronic, sport, atau, espo...
1196    [kab, tangerang, ketua, umum, perbasi, banten,...
1197    [serang, ketua, majelis, tinggi, partai, demok...
1198    [serang, setelah, kemenangan, yang, mengejutka...
1199    [jakarta, timnas, indonesia, berusaha, bangkit...
Name: Tokenizing, Length: 1200, dtype: object

## Stopword Removal

In [None]:
corpus = stopwords.words('indonesian')

def stopwordText(words):
 return [word for word in words if word not in corpus]

df['Stopword Removal'] = df['Tokenizing'].apply(stopwordText)

# Gabungkan kembali token menjadi kalimat utuh
df['Full Text'] = df['Stopword Removal'].apply(lambda x: ' '.join(x))
df['Full Text']

0       serang kandidat calon ketua ketum koni kota se...
1       serang pertandingan grup f kualifikasi piala d...
2       pandeglang satuan reserse narkoba satresnarkob...
3       jakarta indonesia batal menggelar piala dunia ...
4       serang ketua pengurus provinsi persatuan panah...
                              ...                        
1195    cilegon demam electronic sport esport kian mer...
1196    kab tangerang ketua perbasi banten ahmed zaki ...
1197    serang ketua majelis partai demokrat susilo ba...
1198    serang kemenangan mengejutkan arab saudi menga...
1199    jakarta timnas indonesia berusaha bangkit mene...
Name: Full Text, Length: 1200, dtype: object

## TFIDF

In [None]:
def tfidf(dokumen, category):
  vectorizer = TfidfVectorizer()
  x = vectorizer.fit_transform(dokumen).toarray()
  terms = vectorizer.get_feature_names_out()

  final_tfidf = pd.DataFrame(x, columns=terms)
  final_tfidf.insert(0, 'Dokumen', dokumen)
  final_tfidf.insert(len(final_tfidf.columns),'Category', category)

  return (vectorizer, final_tfidf)

tfidf_vectorizer, final_tfidf = tfidf(df['Full Text'], df['Kategori'])
final_tfidf

Unnamed: 0,Dokumen,aa,aafi,aal,aamiin,aan,aang,aansementara,aap,aardianodia,...,zulhas,zulhasdan,zulhasred,zulkarnain,zulkfli,zulkifli,zullfan,zurich,zvezda,Category
0,serang kandidat calon ketua ketum koni kota se...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Sport
1,serang pertandingan grup f kualifikasi piala d...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Sport
2,pandeglang satuan reserse narkoba satresnarkob...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,hukum
3,jakarta indonesia batal menggelar piala dunia ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Sport
4,serang ketua pengurus provinsi persatuan panah...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Sport
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1195,cilegon demam electronic sport esport kian mer...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Sport
1196,kab tangerang ketua perbasi banten ahmed zaki ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Sport
1197,serang ketua majelis partai demokrat susilo ba...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Politik
1198,serang kemenangan mengejutkan arab saudi menga...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Sport


## Deklarasi X dan y dengan drop fitur dokumen dan label

Dalam proses ini melakukan pemisahan antara fitur (X) dan target (y) dalam DataFrame `df` untuk digunakan dalam proses pemodelan. Berikut penjelasan singkatnya:

- `X = df.drop(['Dokumen', 'Label'], axis=1)`: Ini adalah perintah yang menghasilkan DataFrame `X` yang berisi fitur atau atribut dari data. Dalam hal ini, Anda sedang menghapus dua kolom, yaitu 'Dokumen' dan 'Label', dari DataFrame `df` menggunakan metode `drop` dengan parameter `axis=1`. Hasilnya adalah DataFrame `X` yang berisi semua kolom kecuali 'Dokumen' dan 'Label'. Fitur-fitur ini akan digunakan dalam pemodelan.

- `y = df['Label']`: Ini adalah perintah yang menghasilkan Series `y` yang berisi target atau label yang ingin diprediksi. Dalam hal ini, Anda hanya mengambil kolom 'Label' dari DataFrame `df` dan menyimpannya dalam Series `y`. Ini adalah variabel yang akan menjadi target dalam pemodelan klasifikasi.

In [None]:
X = final_tfidf.drop(['Dokumen', 'Category'], axis=1)
X

Unnamed: 0,aa,aafi,aal,aamiin,aan,aang,aansementara,aap,aardianodia,aaron,...,zulfan,zulhas,zulhasdan,zulhasred,zulkarnain,zulkfli,zulkifli,zullfan,zurich,zvezda
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1196,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1197,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1198,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
y = df['Kategori']
y

0         Sport
1         Sport
2         hukum
3         Sport
4         Sport
         ...   
1195      Sport
1196      Sport
1197    Politik
1198      Sport
1199      Sport
Name: Kategori, Length: 1200, dtype: object

## Splitting Data

Dalam proses ini menggunakan modul `train_test_split` dari Scikit-Learn untuk membagi dataset menjadi subset pelatihan (training) dan pengujian (testing). Berikut penjelasan singkat tentang apa yang terjadi:

- `X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)`: Ini adalah perintah yang membagi data menjadi empat subset, yaitu `X_train` (fitur pelatihan), `X_test` (fitur pengujian), `y_train` (target pelatihan), dan `y_test` (target pengujian).

    - `X` adalah DataFrame yang berisi fitur.
    - `y` adalah Series yang berisi target atau label.
    - `test_size=0.3` menentukan bahwa 30% dari data akan digunakan sebagai data pengujian, sementara 70% akan digunakan sebagai data pelatihan.
    - `random_state=42` digunakan untuk mengatur nilai seed agar pembagian data dapat direproduksi dengan hasil yang konsisten. Anda dapat menggantinya dengan nilai lain jika diperlukan.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## **Modeling With LDA**

### Mencari Best Parameter untuk LDA

In [None]:
def find_best_lda(Xtrain, Xtest, n_components, alpha, beta):
  looping = 1
  best = {'k' : 0, 'alpha' : 0, 'beta' : 0, 'accuracy' : 0, 'model': '', 'lda' : '', 'lda_Xtrain' : '', 'lda_Xtest' : ''}
  history = pd.DataFrame(columns=["Pengujian Ke", "K", "Alpha", "Beta", "Accuracy"])

  # Menambahkan tqdm pada loop terluar
  for k in n_components:
    for a in alpha:
      for b in beta:
        lda = LatentDirichletAllocation(n_components=k, doc_topic_prior=a, topic_word_prior=b)
        lda_Xtrain = lda.fit_transform(Xtrain)
        lda_Xtest = lda.transform(Xtest)

        # Membuat model Naive Bayes
        model = GaussianNB()

        # Melatih model pada data pelatihan
        model.fit(lda_Xtrain, y_train)

        # Melakukan prediksi pada data pengujian
        y_pred = model.predict(lda_Xtest)

        # Menghitung akurasi
        accuracy = accuracy_score(y_test, y_pred)
        print(f"Jumlah Topik: {k}, Alpha: {a}, Beta: {b}, Accuracy: {accuracy}")

        if accuracy > best['accuracy']:
          best['accuracy'] = accuracy
          best['k'] = k
          best['alpha'] = a
          best['beta'] = b
          best['model'] = model
          best['lda'] = lda
          best['lda_Xtrain'] = lda_Xtrain
          best['lda_Xtest'] = lda_Xtest

        history.loc[len(history)] = [f"Pengujian Ke- {looping}", k, a, b, accuracy]
        looping += 1

  return (best, history)

k = [3, 4, 5]
alpha = [0.3, 0.4]
beta = [0.1, 0.2]
best_param, history = find_best_lda(X_train, X_test, k, alpha, beta)

Jumlah Topik: 3, Alpha: 0.3, Beta: 0.1, Accuracy: 0.575
Jumlah Topik: 3, Alpha: 0.3, Beta: 0.2, Accuracy: 0.6416666666666667
Jumlah Topik: 3, Alpha: 0.4, Beta: 0.1, Accuracy: 0.6222222222222222
Jumlah Topik: 3, Alpha: 0.4, Beta: 0.2, Accuracy: 0.3888888888888889
Jumlah Topik: 4, Alpha: 0.3, Beta: 0.1, Accuracy: 0.9
Jumlah Topik: 4, Alpha: 0.3, Beta: 0.2, Accuracy: 0.3638888888888889
Jumlah Topik: 4, Alpha: 0.4, Beta: 0.1, Accuracy: 0.6805555555555556
Jumlah Topik: 4, Alpha: 0.4, Beta: 0.2, Accuracy: 0.6333333333333333
Jumlah Topik: 5, Alpha: 0.3, Beta: 0.1, Accuracy: 0.6222222222222222
Jumlah Topik: 5, Alpha: 0.3, Beta: 0.2, Accuracy: 0.6194444444444445
Jumlah Topik: 5, Alpha: 0.4, Beta: 0.1, Accuracy: 0.7583333333333333
Jumlah Topik: 5, Alpha: 0.4, Beta: 0.2, Accuracy: 0.5611111111111111


In [None]:
best_param

{'k': 4,
 'alpha': 0.3,
 'beta': 0.1,
 'accuracy': 0.9,
 'model': GaussianNB(),
 'lda': LatentDirichletAllocation(doc_topic_prior=0.3, n_components=4,
                           topic_word_prior=0.1),
 'lda_Xtrain': array([[0.11167298, 0.06938836, 0.35566841, 0.46327024],
        [0.04811573, 0.45940694, 0.39720573, 0.0952716 ],
        [0.35758938, 0.0342325 , 0.03366905, 0.57450907],
        ...,
        [0.31523774, 0.5495225 , 0.05842989, 0.07680986],
        [0.53616335, 0.27920354, 0.05365378, 0.13097932],
        [0.04651574, 0.85347376, 0.04237216, 0.05763834]]),
 'lda_Xtest': array([[0.07353487, 0.72326061, 0.06398067, 0.13922385],
        [0.07545652, 0.42960799, 0.06121119, 0.4337243 ],
        [0.30581655, 0.15764694, 0.39108264, 0.14545388],
        ...,
        [0.42690611, 0.2252452 , 0.09115185, 0.25669683],
        [0.85255759, 0.04900407, 0.03131822, 0.06712012],
        [0.08617082, 0.21358072, 0.0532875 , 0.64696096]])}

In [None]:
history

Unnamed: 0,Pengujian Ke,K,Alpha,Beta,Accuracy
0,Pengujian Ke- 1,3,0.3,0.1,0.575
1,Pengujian Ke- 2,3,0.3,0.2,0.641667
2,Pengujian Ke- 3,3,0.4,0.1,0.622222
3,Pengujian Ke- 4,3,0.4,0.2,0.388889
4,Pengujian Ke- 5,4,0.3,0.1,0.9
5,Pengujian Ke- 6,4,0.3,0.2,0.363889
6,Pengujian Ke- 7,4,0.4,0.1,0.680556
7,Pengujian Ke- 8,4,0.4,0.2,0.633333
8,Pengujian Ke- 9,5,0.3,0.1,0.622222
9,Pengujian Ke- 10,5,0.3,0.2,0.619444


In [None]:
history.to_csv("history.csv", index=False)

### Deklarasi K, Alpha, dan Beta

### LDA

Dalam langkah berikutnya, saya menggunakan algoritma yang disebut Latent Dirichlet Allocation atau LDA untuk mengurangi dimensi data teks. LDA adalah algoritma yang membantu saya mengidentifikasi topik-topik utama yang muncul dalam dokumen-dokumen saya. Hasil dari ini adalah representasi dokumen-dalam-topik, yang artinya kita menggambarkan setiap dokumen sebagai kombinasi dari topik-topik yang ada.



In [None]:
lda = best_param['lda']
lda_x_train = best_param['lda_Xtrain']
lda_x_test = best_param['lda_Xtest']

### Tampilan Hasil Reduksi Dimensi

In [None]:
topik_columns = [f"Topik {i}" for i in range(1, best_param['k']+1)]
dokumen = final_tfidf['Dokumen']
output_proporsi_TD = pd.DataFrame(lda_x_train, columns=topik_columns)
output_proporsi_TD.insert(0,'Dokumen', dokumen)
output_proporsi_TD.insert(len(output_proporsi_TD.columns),'Category', final_tfidf['Category'])
output_proporsi_TD

Unnamed: 0,Dokumen,Topik 1,Topik 2,Topik 3,Topik 4,Category
0,serang kandidat calon ketua ketum koni kota se...,0.111673,0.069388,0.355668,0.463270,Sport
1,serang pertandingan grup f kualifikasi piala d...,0.048116,0.459407,0.397206,0.095272,Sport
2,pandeglang satuan reserse narkoba satresnarkob...,0.357589,0.034232,0.033669,0.574509,hukum
3,jakarta indonesia batal menggelar piala dunia ...,0.034414,0.073872,0.356195,0.535519,Sport
4,serang ketua pengurus provinsi persatuan panah...,0.072029,0.540037,0.238715,0.149218,Sport
...,...,...,...,...,...,...
835,tangerang turnamen sepakbola walikota cup diik...,0.045815,0.044038,0.049847,0.860300,Sport
836,lebak satuan reserse narkoba satresnarkoba pol...,0.052314,0.043971,0.046504,0.857210,hukum
837,pandeglang polres pandeglang menyita botol min...,0.315238,0.549523,0.058430,0.076810,hukum
838,serang perburuan satwa liar ilegal jenis badak...,0.536163,0.279204,0.053654,0.130979,hukum


### Save Data hasil reduksi dimensi

In [None]:
output_proporsi_TD.to_csv('reduksi dimensi.csv', index=False)

### Tampilan proporsi kata di tiap topik

In [None]:
# Output distribusi kata pada topik
distribusi_kata_topik = pd.DataFrame(lda.components_)
distribusi_kata_topik

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,27804,27805,27806,27807,27808,27809,27810,27811,27812,27813
0,0.101821,0.100009,0.177232,0.100192,0.100008,0.100006,0.100007,0.100453,0.100125,0.153549,...,0.1,0.100001,0.1,0.100003,0.1,0.1,0.100001,0.1,0.100005,0.1
1,0.267228,0.222475,0.1738,0.104155,0.100025,0.178326,0.10006,0.180356,0.175531,0.100206,...,0.1,0.100002,0.1,0.100011,0.1,0.1,0.100001,0.1,0.100128,0.1
2,0.17075,0.100004,0.100012,0.133437,0.100037,0.100007,0.100016,0.100007,0.100008,0.100219,...,0.1,0.100024,0.1,0.100013,0.1,0.1,0.100012,0.1,0.223033,0.1
3,0.100242,0.100016,0.100765,0.100191,0.545968,0.100565,0.182447,0.100104,0.100701,0.100051,...,0.1,0.476789,0.1,0.196963,0.1,0.1,0.540343,0.1,0.100025,0.1


### Model Naive Bayes With LDA

In [None]:
# Membuat model Naive Bayes
model = best_param['model']

# Melakukan prediksi pada data pengujian
y_pred = model.predict(lda_x_test)

# Menghitung akurasi
accuracy = accuracy_score(y_test, y_pred)
print("Akurasi:", accuracy)

# Menampilkan laporan klasifikasi
print("Laporan Klasifikasi:")
print(classification_report(y_test, y_pred))

# Menampilkan matriks kebingungan
confusion = confusion_matrix(y_test, y_pred)
print("Confusion Matriks:")
print(confusion)

Akurasi: 0.9
Laporan Klasifikasi:
              precision    recall  f1-score   support

     Politik       0.84      0.98      0.90       123
       Sport       0.95      0.83      0.89       125
       hukum       0.93      0.89      0.91       112

    accuracy                           0.90       360
   macro avg       0.91      0.90      0.90       360
weighted avg       0.91      0.90      0.90       360

Confusion Matriks:
[[120   0   3]
 [ 16 104   5]
 [  7   5 100]]


### Predict

In [None]:
data = ["Penelitian ini menggabungkan konsep kecerdasan buatan dengan algoritma penjadwalan dalam upaya meningkatkan efisiensi produksi dalam lingkungan manufaktur. Kami memperkenalkan pendekatan yang memanfaatkan kecerdasan komputasional, yaitu algoritma optimasi berbasis swarm intelligence, seperti algoritma PSO (Particle Swarm Optimization) dan algoritma ACO (Ant Colony Optimization). Tujuan utama penelitian ini adalah untuk mengoptimalkan jadwal produksi dengan meminimalkan waktu produksi dan biaya, sambil mempertimbangkan berbagai kendala produksi seperti kapasitas mesin, waktu pemrosesan, dan persyaratan bahan baku. Melalui eksperimen dan simulasi, kami membandingkan hasil dari algoritma swarm intelligence dengan pendekatan konvensional. Hasilnya menunjukkan bahwa algoritma PSO dan ACO dapat menghasilkan jadwal produksi yang lebih efisien, dengan waktu produksi yang lebih pendek dan biaya yang lebih rendah. Selain itu, algoritma ini mampu beradaptasi dengan perubahan dalam lingkungan produksi dan menghasilkan jadwal yang optimal bahkan dalam situasi yang kompleks. Penelitian ini menunjukkan potensi besar dari penggunaan kecerdasan komputasional dalam meningkatkan efisiensi dan produktivitas dalam industri manufaktur. Hasil penelitian ini dapat digunakan sebagai dasar untuk mengembangkan sistem penjadwalan cerdas yang dapat diterapkan dalam berbagai industri."]
a = tfidf_vectorizer.transform(data).toarray()
b = lda.transform(a)
model.predict(b)

array(['Tech'], dtype=object)

### Save Model

In [None]:
joblib.dump(lda, "lda.pkl")
joblib.dump(model, "naive bayes.pkl")

['naive bayes.pkl']

## Modeling Without LDA

### Training Model dengan Dataset Asli

In [None]:
# Membuat model Naive Bayes
nb = GaussianNB()

# Melatih model nbada data pelatihan
nb.fit(X_train, y_train)

# Melakukan prediksi pada data pengujian
y_pred_nb = nb.predict(X_test)

# Menghitung akurasi
accuracy = accuracy_score(y_test, y_pred_nb)
print("Akurasi:", accuracy)

# Menampilkan laporan klasifikasi
print("Laporan Klasifikasi:")
print(classification_report(y_test, y_pred_nb))

# Menampilkan matriks kebingungan
confusion = confusion_matrix(y_test, y_pred_nb)
print("Confusion Matriks:")
print(confusion)

Akurasi: 0.9583333333333334
Laporan Klasifikasi:
              precision    recall  f1-score   support

     Politik       0.96      0.94      0.95       123
       Sport       0.98      0.98      0.98       125
       hukum       0.93      0.96      0.94       112

    accuracy                           0.96       360
   macro avg       0.96      0.96      0.96       360
weighted avg       0.96      0.96      0.96       360

Confusion Matriks:
[[116   0   7]
 [  2 122   1]
 [  3   2 107]]


### Predict

In [None]:
data = ["Penelitian ini menggabungkan konsep kecerdasan buatan dengan algoritma penjadwalan dalam upaya meningkatkan efisiensi produksi dalam lingkungan manufaktur. Kami memperkenalkan pendekatan yang memanfaatkan kecerdasan komputasional, yaitu algoritma optimasi berbasis swarm intelligence, seperti algoritma PSO (Particle Swarm Optimization) dan algoritma ACO (Ant Colony Optimization). Tujuan utama penelitian ini adalah untuk mengoptimalkan jadwal produksi dengan meminimalkan waktu produksi dan biaya, sambil mempertimbangkan berbagai kendala produksi seperti kapasitas mesin, waktu pemrosesan, dan persyaratan bahan baku. Melalui eksperimen dan simulasi, kami membandingkan hasil dari algoritma swarm intelligence dengan pendekatan konvensional. Hasilnya menunjukkan bahwa algoritma PSO dan ACO dapat menghasilkan jadwal produksi yang lebih efisien, dengan waktu produksi yang lebih pendek dan biaya yang lebih rendah. Selain itu, algoritma ini mampu beradaptasi dengan perubahan dalam lingkungan produksi dan menghasilkan jadwal yang optimal bahkan dalam situasi yang kompleks. Penelitian ini menunjukkan potensi besar dari penggunaan kecerdasan komputasional dalam meningkatkan efisiensi dan produktivitas dalam industri manufaktur. Hasil penelitian ini dapat digunakan sebagai dasar untuk mengembangkan sistem penjadwalan cerdas yang dapat diterapkan dalam berbagai industri."]
tfidf_matrix = tfidf_vectorizer.transform(data).toarray()
nb.predict(tfidf_matrix)

array(['Politik'], dtype='<U7')

### Save Model

In [None]:
joblib.dump(nb, "Naive Bayes (Asli).pkl")

['Naive Bayes (Asli).pkl']

## Save Vectorizer

In [None]:
joblib.dump(tfidf_vectorizer, 'vectorizer.pkl')

['vectorizer.pkl']