EUGENIUS KRISWINAR ADI CAHYA/235314099

Implementasi Naive Bayes dalam Python #1

In [1]:
from collections import Counter
import math

In [2]:
# Data Latihan
spam_docs = ["Promo besar hari ini diskon 50%!", "Gratis hadiah untuk pelanggan setia!", "Segera klaim hadiah spesial kamu!"]
not_spam_docs = ["Halo, bagaimana kabarmu hari ini?", "Jangan lupa meeting besok pukul 10.00", "Dokumen penting sudah dikirim ke email"]

In [3]:
# Tokenisasi
def tokenize(text):
    return text.lower().replace("!", "").replace("%", "").split()
spam_words = [word for doc in spam_docs for word in tokenize(doc)]
not_spam_words = [word for doc in not_spam_docs for word in tokenize(doc)]

In [4]:
# Hitung Probabilitas untuk Naive Bayes
spam_counts = Counter(spam_words)
not_spam_counts = Counter(not_spam_words)

# Hitung vocabulary size dan total kata
V = len(set(spam_words + not_spam_words))
total_spam = len(spam_words)
total_not_spam = len(not_spam_words)

def get_prob(word, category_counts, total_count):
    """
    Menghitung probabilitas kata dengan Laplace smoothing
    Args:
        word: kata yang akan dihitung probabilitasnya
        category_counts: Counter object berisi frekuensi kata
        total_count: total kata dalam kategori
    Returns:
        float: nilai probabilitas
    """
    return (category_counts.get(word, 0) + 1) / (total_count + V)

In [5]:
# Prediksi Email Baru
email = "Hadiah besar gratis untuk kamu!"
words = tokenize(email)

spam_prob = math.prod([get_prob(word, spam_counts, total_spam) for word in words])
not_spam_prob = math.prod([get_prob(word, not_spam_counts, total_not_spam) for word in words])

print("Spam Probability:", spam_prob)
print("Not Spam Probability:", not_spam_prob)
print("Prediction:", "Spam" if spam_prob > not_spam_prob else "Not Spam")

Spam Probability: 2.0929167208772063e-07
Not Spam Probability: 3.9245856642232505e-09
Prediction: Spam


Kode diatas mengimplementasikan algoritma **Naive Bayes** untuk mendeteksi email spam.

Langkah-langkah:
1. Mengumpulkan data latihan (spam dan not spam).
2. Melakukan tokenisasi teks.
3. Menghitung frekuensi kata dengan Counter.
4. Menggunakan Laplace smoothing untuk menghitung probabilitas kata.
5. Menghitung probabilitas total tiap kategori untuk email baru.
6. Menentukan kategori berdasarkan probabilitas terbesar.

Implementasi Naive Bayes dalam Python #2

In [6]:
# Dataset
texts = [
    "Promo besar hari ini diskon 50%!",          # spam
    "Gratis hadiah untuk pelanggan setia!",      # spam 
    "Segera klaim hadiah spesial kamu!",         # spam
    "Halo, bagaimana kabarmu hari ini?",         # not spam
    "Jangan lupa meeting besok pukul 10.00",     # not spam
    "Dokumen penting sudah dikirim ke email"     # not spam
]

labels = ["spam", "spam", "spam", "not spam", "not spam", "not spam"]

In [7]:
# 1. Tokenisasi dan Stopword Removal 
stopwords = ["ke"]

def preprocess(text):
    tokens = text.lower().split()
    return [token for token in tokens if token not in stopwords]

processed_texts = [preprocess(text) for text in texts]

In [8]:
# 2. Buat Vocabulary
vocab = {}
index = 0
for text in processed_texts:
    for token in text:
        if token not in vocab:
            vocab[token] = index
            index += 1

In [9]:
# 3. Konversi ke Vektor (Bag of Words)
def text_to_vector(tokens, vocab):
    vector = [0] * len(vocab)
    for token in tokens:
        if token in vocab:
            vector[vocab[token]] += 1
    return vector

X = [text_to_vector(tokens, vocab) for tokens in processed_texts]
y = [1 if label == "not spam" else 0 for label in labels]

In [10]:
import numpy as np
# 4. Implementasi Naive Bayes
class NaiveBayes:
    def __init__(self, alpha=1):
        self.alpha = alpha  # Laplace smoothing

    def fit(self, X, y):
        n_samples, n_features = len(X), len(X[0])
        self.classes = np.unique(y)
        n_classes = len(self.classes)
        
        # Hitung prior probabilities
        self.priors = np.zeros(n_classes)
        for c in self.classes:
            self.priors[c] = (sum(y == c)) / (n_samples)
        
        # Hitung likelihood
        self.likelihoods = np.zeros((n_classes, n_features))
        for c in self.classes:
            X_c = [X[i] for i in range(n_samples) if y[i] == c]
            total_words_c = sum(sum(x) for x in X_c)
            for j in range(n_features):
                count_j = sum(x[j] for x in X_c)
                self.likelihoods[c][j] = (count_j + self.alpha) / (total_words_c + self.alpha * n_features)
        print(self.likelihoods)

    def predict(self, X):
        predictions = []
        for x in X:
            posteriors = []
            for c in self.classes:
                prior = self.priors[c]
                likelihood = math.prod(self.likelihoods[c][j] for j, val in enumerate(x) if val > 0)
                posteriors.append(prior * likelihood)
            predictions.append(self.classes[np.argmax(posteriors)])
            print(posteriors)
        return predictions

In [11]:
# 5. Contoh Penggunaan
nb = NaiveBayes(alpha=1)
nb.fit(X, y)
test_text = preprocess("Hadiah besar gratis untuk kamu!")
test_vector = text_to_vector(test_text, vocab)
print(nb.predict([test_vector]))

[[0.04347826 0.04347826 0.04347826 0.04347826 0.04347826 0.04347826
  0.04347826 0.06521739 0.04347826 0.04347826 0.04347826 0.04347826
  0.04347826 0.04347826 0.04347826 0.02173913 0.02173913 0.02173913
  0.02173913 0.02173913 0.02173913 0.02173913 0.02173913 0.02173913
  0.02173913 0.02173913 0.02173913 0.02173913 0.02173913 0.02173913]
 [0.02173913 0.02173913 0.04347826 0.02173913 0.02173913 0.02173913
  0.02173913 0.02173913 0.02173913 0.02173913 0.02173913 0.02173913
  0.02173913 0.02173913 0.02173913 0.04347826 0.04347826 0.04347826
  0.04347826 0.04347826 0.04347826 0.04347826 0.04347826 0.04347826
  0.04347826 0.04347826 0.04347826 0.04347826 0.04347826 0.04347826]]
[np.float64(1.1652579733553663e-07), np.float64(2.4276207778236796e-09)]
[np.int64(0)]


1. **Dataset & Labeling:**  
   Mengandung contoh email dengan label “spam” atau “not spam”.

2. **Preprocessing:**  
   Tokenisasi dan penghapusan stopword dilakukan agar hanya kata penting yang dipertahankan.

3. **Bag of Words:**  
   Representasi teks dalam bentuk vektor numerik berdasarkan frekuensi kata.

4. **Model Naive Bayes:**  
   - **Prior:** Probabilitas tiap kelas (spam dan not spam).
   - **Likelihood:** Probabilitas setiap kata muncul pada kelas tertentu menggunakan **Laplace smoothing**.
   - **Posterior:** Kombinasi prior dan likelihood untuk menentukan kelas paling mungkin.

5. **Prediksi:**  
   Model menghitung probabilitas masing-masing kelas dan memilih yang paling tinggi sebagai hasil akhir.

Implementasi SVM dalam Python

In [12]:
# 1. Data Preparation
import numpy as np

texts_train = [
    "Promo besar hari ini diskon 50%!",          # spam
    "Gratis hadiah untuk pelanggan setia!",      # spam
    "Segera klaim hadiah spesial kamu!",         # spam
    "Halo, bagaimana kabarmu hari ini?",         # not spam
    "Jangan lupa meeting besok pukul 10.00",     # not spam
    "Dokumen penting sudah dikirim ke email"     # not spam
]

labels_train = ["spam", "spam", "spam", "not spam", "not spam", "not spam"]
texts_test = [
    "Kamu mendapatkan hadiah besar!",
    "Besok ada ujian penting!"
]

In [13]:
# 2. Text Preprocessing
stopwords = ["ke", "di", "dari", "untuk", "pada"]

def preprocess(text):
    tokens = text.lower().split()
    return [token for token in tokens if token not in stopwords]

processed_train = [preprocess(text) for text in texts_train]
processed_test = [preprocess(text) for text in texts_test]

In [14]:
# 3. Vocabulary Building
vocab_train = {}
index = 0
for text in processed_train:
    for token in text:
        if token not in vocab_train:
            vocab_train[token] = index
            index += 1
print(f"Vocabulary dari training: {vocab_train}")

Vocabulary dari training: {'promo': 0, 'besar': 1, 'hari': 2, 'ini': 3, 'diskon': 4, '50%!': 5, 'gratis': 6, 'hadiah': 7, 'pelanggan': 8, 'setia!': 9, 'segera': 10, 'klaim': 11, 'spesial': 12, 'kamu!': 13, 'halo,': 14, 'bagaimana': 15, 'kabarmu': 16, 'ini?': 17, 'jangan': 18, 'lupa': 19, 'meeting': 20, 'besok': 21, 'pukul': 22, '10.00': 23, 'dokumen': 24, 'penting': 25, 'sudah': 26, 'dikirim': 27, 'email': 28}


In [15]:
# 4. Feature Extraction (Bag of Words)
def text_to_vector(tokens, vocab):
    vector = [0] * len(vocab)
    for token in tokens:
        if token in vocab:
            vector[vocab[token]] += 1
    return vector

X_train = np.array([text_to_vector(tokens, vocab_train) for tokens in processed_train])
X_test = np.array([text_to_vector(tokens, vocab_train) for tokens in processed_test])
y = np.array([1 if label == "not spam" else -1 for label in labels_train])

print("\nX_train (BoW dari training data):")
print(X_train)
print("\nX_test (BoW dari testing data):")
print(X_test)


X_train (BoW dari training data):
[[1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1]]

X_test (BoW dari testing data):
[[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]]


In [16]:
# 5. SVM Training
w = np.zeros(X_train.shape[1])
b = 0
eta = 1  # Learning rate
epochs = 5

for epoch in range(epochs):
    print(f"\nEpoch {epoch + 1}")
    for i in range(len(X_train)):
        if y[i] * (np.dot(w, X_train[i]) + b) < 1:
            w = w + eta * y[i] * X_train[i]
            b = b + eta * y[i]
            print(f" ➜ Update: w = {w}, b = {b}")
        else:
            print(f" ➜ Tidak ada update untuk sampel ke-{i+1}")

print("\nModel SVM selesai dilatih!")
print(f"Bobot akhir: {w}")
print(f"Bias akhir: {b}")


Epoch 1
 ➜ Update: w = [-1. -1. -1. -1. -1. -1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.], b = -1
 ➜ Tidak ada update untuk sampel ke-2
 ➜ Tidak ada update untuk sampel ke-3
 ➜ Update: w = [-1. -1.  0. -1. -1. -1.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  1.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.], b = 0
 ➜ Update: w = [-1. -1.  0. -1. -1. -1.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.], b = 1
 ➜ Tidak ada update untuk sampel ke-6

Epoch 2
 ➜ Tidak ada update untuk sampel ke-1
 ➜ Update: w = [-1. -1.  0. -1. -1. -1. -1. -1. -1. -1.  0.  0.  0.  0.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.], b = 0
 ➜ Tidak ada update untuk sampel ke-3
 ➜ Tidak ada update untuk sampel ke-4
 ➜ Tidak ada update untuk sampel ke-5
 ➜ Update: w = [-1. -1.  0. -1. -1. -1. -1. -1. -1. -1.  0.  0.  0.  0.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.], b = 1

Epoch 3
 

In [17]:
# 6. Prediction
def predict(X_test):
    return np.sign(np.dot(X_test, w) + b)

predictions = predict(X_test)
print("\nPrediksi untuk sampel uji:")
print(predictions)
for i in range(len(predictions)):
    label = "Not Spam" if predictions[i] == 1 else "Spam"
    print(f" - Sampel {i+1}: {label}")


Prediksi untuk sampel uji:
[-1.  1.]
 - Sampel 1: Spam
 - Sampel 2: Not Spam


Kode di atas merupakan **implementasi sederhana algoritma Support Vector Machine (SVM)** untuk klasifikasi teks *spam* dan *not spam* menggunakan pendekatan **Bag of Words (BoW)**.

### Langkah-langkah:
1. **Data Preparation:**  
   Membagi data menjadi *training* dan *testing*.

2. **Text Preprocessing:**  
   Melakukan tokenisasi dan menghapus *stopwords* untuk mendapatkan kata penting.

3. **Vocabulary Building:**  
   Membuat daftar kata unik dari data latih untuk dijadikan fitur.

4. **Feature Extraction (BoW):**  
   Mengubah teks menjadi vektor numerik berdasarkan frekuensi kata.

5. **SVM Training:**  
   Melatih model dengan memperbarui **bobot (w)** dan **bias (b)** secara manual

6. **Prediction:**  
   Menghitung hasil prediksi 

### Kesimpulan:
Menunjukkan konsep dasar **SVM Linear** dalam mengklasifikasikan teks tanpa menggunakan library apapun atau bisa dikatakan perhitungan manual.  


Implementasi Random Forest dalam Python

In [18]:
import random

# Dataset
data = [
    {
        "hadiah": 1, "besar": 1, "gratis": 1, "untukmu": 1, "hari": 1, "ini": 1,
        "besok": 0, "kita": 0, "ada": 0, "pertemuan": 0, "kampus": 0, 
        "selamat": 0, "anda": 0, "memenangkan": 0, "jangan": 0, "lupa": 0,
        "tugas": 0, "harus": 0, "dikumpulkan": 0, "terpilih": 0, "spesial": 0,
        "label": 1
    },
    {
        "hadiah": 0, "besar": 0, "gratis": 0, "untukmu": 0, "hari": 0, "ini": 0,
        "besok": 1, "kita": 1, "ada": 1, "pertemuan": 1, "kampus": 1, 
        "selamat": 0, "anda": 0, "memenangkan": 0, "jangan": 0, "lupa": 0,
        "tugas": 0, "harus": 0, "dikumpulkan": 0, "terpilih": 0, "spesial": 0,
        "label": -1
    },
    {
        "hadiah": 1, "besar": 0, "gratis": 1, "untukmu": 0, "hari": 0, "ini": 0,
        "besok": 0, "kita": 0, "ada": 0, "pertemuan": 0, "kampus": 0,
        "selamat": 0, "anda": 1, "memenangkan": 1, "jangan": 0, "lupa": 0,
        "tugas": 0, "harus": 0, "dikumpulkan": 0, "terpilih": 0, "spesial": 0,
        "label": 1
    }
]

def gini_impurity(data):
    total = len(data)
    if total == 0:
        return 0
    spam_count = sum(1 for row in data if row["label"] == 1)
    not_spam_count = total - spam_count
    p_spam = spam_count / total
    p_not_spam = not_spam_count / total
    return 1 - (p_spam ** 2 + p_not_spam ** 2)

def split_data(data, feature):
    left = [row for row in data if row[feature] == 0]
    right = [row for row in data if row[feature] == 1]
    return left, right

def best_split(data):
    best_feature = None
    best_gini = 1
    best_left, best_right = None, None
    
    for feature in data[0].keys():
        if feature == "label":
            continue
        left, right = split_data(data, feature)
        gini_left = gini_impurity(left)
        gini_right = gini_impurity(right)
        gini_split = (len(left) / len(data)) * gini_left + (len(right) / len(data)) * gini_right
        
        if gini_split < best_gini:
            best_gini = gini_split
            best_feature = feature
            best_left, best_right = left, right
            
    return best_feature, best_left, best_right

class DecisionTree:
    def __init__(self, depth=2):
        self.depth = depth
        self.tree = None
        
    def build_tree(self, data, depth=0):
        if len(set(row["label"] for row in data)) == 1 or depth >= self.depth:
            return {"prediction": max(set(row["label"] for row in data), 
                                key=[row["label"] for row in data].count)}
        
        feature, left, right = best_split(data)
        if not left or not right:
            return {"prediction": max(set(row["label"] for row in data), 
                                key=[row["label"] for row in data].count)}
            
        return {
            "feature": feature,
            "left": self.build_tree(left, depth + 1),
            "right": self.build_tree(right, depth + 1)
        }
        
    def fit(self, data):
        self.tree = self.build_tree(data)
        
    def predict_one(self, row, node):
        if "prediction" in node:
            return node["prediction"]
        if row[node["feature"]] == 0:
            return self.predict_one(row, node["left"])
        else:
            return self.predict_one(row, node["right"])
            
    def predict(self, data):
        return [self.predict_one(row, self.tree) for row in data]

class RandomForest:
    def __init__(self, n_trees=3, sample_size=0.8, depth=2):
        self.n_trees = n_trees
        self.sample_size = sample_size
        self.depth = depth
        self.trees = []
        
    def bootstrap_sample(self, data):
        return random.sample(data, int(len(data) * self.sample_size))
        
    def fit(self, data):
        for _ in range(self.n_trees):
            sample = self.bootstrap_sample(data)
            tree = DecisionTree(depth=self.depth)
            tree.fit(sample)
            self.trees.append(tree)
            
    def predict_one(self, row):
        predictions = [tree.predict_one(row, tree.tree) for tree in self.trees]
        return max(set(predictions), key=predictions.count)
        
    def predict(self, data):
        return [self.predict_one(row) for row in data]

In [19]:
# Training and Prediction
rf = RandomForest(n_trees=3, depth=2)
rf.fit(data)

In [20]:
# Test data
test_sms = {
    "hadiah": 1, "besar": 0, "gratis": 1, "untukmu": 0, "hari": 0, "ini": 0,
    "besok": 0, "kita": 0, "ada": 0, "pertemuan": 0, "kampus": 0, 
    "selamat": 0, "anda": 1, "memenangkan": 1, "jangan": 0, "lupa": 0,
    "tugas": 0, "harus": 0, "dikumpulkan": 0, "terpilih": 0, "spesial": 0
}

In [21]:
# Predict using Random Forest
prediction = rf.predict_one(test_sms)
print("Prediksi:", "Spam" if prediction == 1 else "Not Spam")

Prediksi: Spam


### Langkah-langkah:
1. **Dataset:**  
   Setiap pesan direpresentasikan dalam bentuk biner (0/1) berdasarkan kata yang muncul, dengan label 1 untuk spam dan -1 untuk not spam.

2. **Gini Impurity:**  
   Mengukur tingkat ketidaksamaan data pada setiap node pohon/tree.  
   Semakin kecil nilai Gini, semakin baik pemisahan datanya.

3. **Decision Tree:**  
   - Setiap pohon dibangun menggunakan subset data.  
   - Pohon akan memilih fitur terbaik untuk membagi data berdasarkan **nilai Gini terkecil**.  
   - Kedalaman maksimum pohon diatur dengan parameter depth.

4. **Random Forest:**  
   - Membentuk beberapa *decision tree* .  
   - Setiap pohon dilatih dengan data *bootstrap sample* (subset acak dari data asli).  
   - Prediksi akhir diperoleh dengan **voting mayoritas** dari semua pohon.

5. **Prediksi:**  
   Input teks diuji berdasarkan fitur yang dimilikinya. Jika mayoritas pohon memprediksi 1, maka hasilnya *Spam*, jika -1 maka *Not Spam*.

### Kesimpulan:
Algoritma **Random Forest** bekerja dengan menggabungkan hasil dari banyak *decision tree* untuk meningkatkan akurasi dan mengurangi *overfitting*.  


Implementasi Logistic Regression dalam Python

In [22]:
# Dataset (fitur dan label)
X = [
    [1, 0],  # Hadiah besar menanti kamu! → Spam
    [0, 1],  # Tugas harus dikumpulkan! → Not Spam
    [1, 0],  # Gratis hadiah untukmu! → Spam
    [0, 1],  # Jangan lupa tugas kuliah! → Not Spam
    [1, 0],  # Selamat! Kamu mendapat hadiah! → Spam
]
y = [1, 0, 1, 0, 1]  # 1 = Spam, 0 = Not Spam

In [23]:
import math

# Fungsi sigmoid
def sigmoid(z):
    return 1 / (1 + math.exp(-z))

In [24]:
# Fungsi prediksi
def predict(X, w, b):
    preds = []
    for x in X:
        z = sum(w[i] * x[i] for i in range(len(x))) + b
        preds.append(sigmoid(z))
    return preds

In [25]:
# Fungsi pembaruan bobot dengan Gradient Descent
def train_logistic_regression(X, y, lr=0.1, epochs=100):
    w = [0.0 for _ in range(len(X[0]))]  # inisialisasi bobot
    b = 0.0  # bias
    
    for epoch in range(epochs):
        for i in range(len(X)):
            z = sum(w[j] * X[i][j] for j in range(len(X[0]))) + b
            pred = sigmoid(z)
            error = y[i] - pred
            
            # update bobot dan bias
            for j in range(len(X[0])):
                w[j] += lr * error * X[i][j]
            b += lr * error
            
    return w, b

In [26]:
# Training
weights, bias = train_logistic_regression(X, y, lr=0.1, epochs=100)

In [27]:
# Prediksi contoh baru
test_input = [0, 1]  # Teks: "Tugas penting menanti kamu"
z = sum(weights[i] * test_input[i] for i in range(len(test_input))) + bias
prob = sigmoid(z)
print("Probabilitas Spam:", round(prob, 4))
print("Prediksi:", "Spam" if prob >= 0.5 else "Not Spam")

Probabilitas Spam: 0.0521
Prediksi: Not Spam


### Langkah-langkah:
1. **Dataset:**  
   Data input (X) berisi dua fitur biner:
   - Fitur 1 = ada kata yang mengindikasikan spam (misal “hadiah”, “gratis”)  
   - Fitur 2 = ada kata yang mengindikasikan pesan normal (misal “tugas”, “kuliah”)  
   Label y menunjukkan hasil klasifikasi: 1 = Spam, 0 = Not Spam.

2. **Fungsi Sigmoid:**  
   Fungsi sigmoid(z) digunakan untuk mengubah hasil linear (z) menjadi nilai probabilitas antara 0 dan 1.  

3. **Training (Gradient Descent):**  
   Bobot (w) dan bias (b) diperbarui berdasarkan *error* antara prediksi dan label sebenarnya

4. **Prediksi:**  
   Setelah model dilatih, data uji baru dievaluasi untuk menghasilkan probabilitas spam.  
   Jika probabilitas ≥ 0.5 → *Spam*, jika < 0.5 → *Not Spam*.

### Kesimpulan:
Algoritma **Logistic Regression** mempelajari hubungan antara fitur dan label menggunakan fungsi sigmoid dan *gradient descent*, menghasilkan model klasifikasi biner seperti deteksi *spam*.


Implementasi BERT dengan LR dalam Python

In [28]:
import torch
from transformers import AutoTokenizer, AutoModel
import math

# 1. Load IndoBERT (tanpa classifier head)
tokenizer = AutoTokenizer.from_pretrained("indobenchmark/indobert-base-p1")
bert = AutoModel.from_pretrained("indobenchmark/indobert-base-p1")

  from .autonotebook import tqdm as notebook_tqdm


In [29]:
# 2. Dataset: teks dan label
texts = [
    "Hadiah besar menanti kamu!",     # Spam
    "Tugas harus dikumpulkan!",       # Not Spam
    "Gratis hadiah untukmu!",         # Spam
    "Jangan lupa tugas kuliah!",      # Not Spam
    "Selamat! Kamu mendapat hadiah!"  # Spam
]
labels = [1, 0, 1, 0, 1]

In [30]:
# 3. Fungsi: Ambil embedding [CLS] dari BERT
def get_cls_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = bert(**inputs)
        cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze().numpy()  # shape: (768,)
    return cls_embedding

In [31]:
# 4. Ambil semua embedding
X = [get_cls_embedding(text) for text in texts]
y = labels

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [32]:
# 5. Fungsi Sigmoid
def sigmoid(z):
    return 1 / (1 + math.exp(-z))

In [33]:
# 6. Logistic Regression Manual
def train_logistic_regression(X, y, lr=0.01, epochs=10):
    w = [0.0] * len(X[0])  # 768 dimensi
    b = 0.0
    
    for epoch in range(epochs):
        for i in range(len(X)):
            z = sum(w[j] * X[i][j] for j in range(len(w))) + b
            pred = sigmoid(z)
            error = y[i] - pred
            
            for j in range(len(w)):
                w[j] += lr * error * X[i][j]
            b += lr * error
            
    return w, b

In [34]:
# 7. Latih model
weights, bias = train_logistic_regression(X, y)

In [35]:
# 8. Prediksi input baru
def predict(text):
    x = get_cls_embedding(text)
    z = sum(weights[i] * x[i] for i in range(len(x))) + bias
    prob = sigmoid(z)
    return prob, "Spam" if prob >= 0.5 else "Not Spam"

In [36]:
# 9. Uji prediksi
test_text = "Tugas penting menanti kamu!"
prob, label = predict(test_text)
print(f"Probabilitas Spam: {prob:.4f} → Prediksi: {label}")

Probabilitas Spam: 0.4866 → Prediksi: Not Spam


Kode di atas menggabungkan **BERT** dengan **Logistic Regression** untuk klasifikasi teks *spam* dan *not spam*.

### Langkah-langkah Utama:

1. **Load Model IndoBERT**
   - Menggunakan indobenchmark/indobert-base-p1, model BERT khusus bahasa Indonesia.
   - AutoTokenizer dan AutoModel digunakan untuk tokenisasi dan ekstraksi embedding.

2. **Ambil Embedding (CLS)**
   - Untuk setiap teks, diambil vektor representasi (CLS) dari BERT (ukuran 768 dimensi).
   - Vektor ini menjadi fitur input untuk model logistic regression.

3. **Logistic Regression (Manual)**
   - Bobot (w) dan bias (b) diperbarui dengan **gradient descent** berdasarkan selisih antara label aktual dan prediksi.

4. **Prediksi**
   - Model menghitung probabilitas Spam dari teks uji baru menggunakan hasil embedding dari BERT.
   - Jika probabilitas ≥ 0.5 → *Spam*, jika < 0.5 → *Not Spam*.
