# 📘 Chapter 17: Transformers and Pretrained Language Models

Bab ini membahas terobosan terbaru dalam NLP modern, yaitu penggunaan **arsitektur Transformer** dan model **pra-latih (pretrained)** seperti **BERT**, yang telah merevolusi cara kita menangani tugas-tugas NLP.

---

## 🎯 Tujuan Pembelajaran

- Memahami struktur dasar dari arsitektur Transformer
- Mempelajari komponen kunci: Attention, Positional Encoding, Layer Normalization
- Menggunakan model pretrained (BERT) untuk klasifikasi teks
- Melakukan fine-tuning terhadap model untuk dataset IMDB

---

## 📦 Dataset: IMDb Movie Reviews (via Hugging Face)

- Dataset terdiri dari 50.000 review film (label: positif/negatif)
- Kita gunakan subset (3.000 train, 1.000 test) untuk fine-tuning cepat
- Dataset dimuat dengan `datasets.load_dataset("imdb")` dari Hugging Face Datasets

---

## 🧠 Arsitektur Transformer

Transformer adalah model encoder-decoder berbasis **self-attention**, berbeda dari RNN/CNN karena:
- Tidak memproses data secara sekuensial, sehingga dapat diparalelisasi
- Menggunakan **positional encoding** untuk menangkap urutan
- Unit kunci: **Multi-Head Self Attention** dan **Feedforward Layer**

Model seperti BERT hanya menggunakan **bagian encoder** dari Transformer.

---

## 🧱 Model: BERT (Bidirectional Encoder Representations from Transformers)

### ⚙️ Komponen Utama:
- **Tokenizer**: Mengubah teks menjadi ID token dengan `AutoTokenizer`
- **BERT Pretrained**: `bert-base-uncased` digunakan dengan `TFAutoModelForSequenceClassification`
- **Input**: `input_ids`, `attention_mask`
- **Loss**: `SparseCategoricalCrossentropy(from_logits=True)`
- **Output**: Logits → Softmax → Prediksi kelas (0 = negatif, 1 = positif)

### 🔧 Fine-Tuning
Model pretrained dilatih ulang sedikit (fine-tuned) pada dataset IMDb agar menyesuaikan tugas klasifikasi:

```python
model.fit(tf_train, epochs=2)


In [1]:
# 📦 Install huggingface transformers
!pip install -q transformers

# 📌 Import library
import tensorflow as tf
from tensorflow import keras
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import numpy as np

# ✅ Step 1: Load dataset IMDB dari keras (versi kecil)
(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data(num_words=10000)

# Ambil subset untuk hemat memori
X_train, y_train = X_train[:1000], y_train[:1000]
X_test, y_test = X_test[:500], y_test[:500]

# ✅ Step 2: Decode integer ke teks
word_index = keras.datasets.imdb.get_word_index()
index_word = {index + 3: word for word, index in word_index.items()}
index_word[0], index_word[1], index_word[2] = "<PAD>", "<START>", "<UNK>"

def decode_review(encoded):
    return ' '.join(index_word.get(i, '?') for i in encoded)

train_texts = [decode_review(x) for x in X_train]
test_texts = [decode_review(x) for x in X_test]

# ✅ Step 3: Load tokenizer dari model BERT
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# ✅ Step 4: Dataset generator (hemat RAM)
def gen_dataset(texts, labels):
    for text, label in zip(texts, labels):
        enc = tokenizer(text, truncation=True, padding='max_length', max_length=200, return_tensors="tf")
        yield {
            "input_ids": enc["input_ids"][0],
            "attention_mask": enc["attention_mask"][0]
        }, label

# ✅ Step 5: Buat tf.data.Dataset dari generator
def make_dataset(texts, labels, batch_size=8, shuffle=False):
    ds = tf.data.Dataset.from_generator(
        lambda: gen_dataset(texts, labels),
        output_signature=(
            {
                "input_ids": tf.TensorSpec(shape=(200,), dtype=tf.int32),
                "attention_mask": tf.TensorSpec(shape=(200,), dtype=tf.int32)
            },
            tf.TensorSpec(shape=(), dtype=tf.int64)
        )
    )
    if shuffle:
        ds = ds.shuffle(1000)
    return ds.batch(batch_size)

train_ds = make_dataset(train_texts, y_train, batch_size=8, shuffle=True)
test_ds = make_dataset(test_texts, y_test, batch_size=8)

# ✅ Step 6: Load model BERT pretrained untuk klasifikasi
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# ✅ Step 7: Train manual dengan GradientTape (tidak pakai compile)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)

# Loop training
for epoch in range(2):
    print(f"\n🌀 Epoch {epoch+1}")
    for step, (x_batch, y_batch) in enumerate(train_ds):
        with tf.GradientTape() as tape:
            logits = model(x_batch, training=True).logits
            loss = loss_fn(y_batch, logits)
        grads = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))
        if step % 20 == 0:
            print(f"Step {step}, Loss: {loss.numpy():.4f}")

# ✅ Step 8: Evaluasi manual
from sklearn.metrics import accuracy_score

y_preds = []
y_trues = []

for x_batch, y_batch in test_ds:
    logits = model(x_batch, training=False).logits
    preds = tf.argmax(logits, axis=1)
    y_preds.extend(preds.numpy())
    y_trues.extend(y_batch.numpy())

acc = accuracy_score(y_trues, y_preds)
print(f"\n✅ Final Test Accuracy: {acc:.4f}")


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
[1m1641221/1641221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



🌀 Epoch 1
Step 0, Loss: 0.6534
Step 20, Loss: 0.7205
Step 40, Loss: 0.6647
Step 60, Loss: 0.6693
Step 80, Loss: 0.6738
Step 100, Loss: 0.7203
Step 120, Loss: 0.7244

🌀 Epoch 2
Step 0, Loss: 0.7517
Step 20, Loss: 0.7239
Step 40, Loss: 0.7107
Step 60, Loss: 0.7327
Step 80, Loss: 0.6631
Step 100, Loss: 0.6294
Step 120, Loss: 0.7048

✅ Final Test Accuracy: 0.4760
