# CHAPTER 13: Transformers

## Ringkasan

Chapter ini membahas **Transformer architecture**, model state-of-the-art yang revolusionize NLP dengan pure attention mechanism tanpa recurrence. Berbeda dari LSTM/GRU yang process sequential, Transformer dapat melihat seluruh sequence sekaligus, meningkatkan parallelization dan language understanding. Chapter mencakup implementasi full Transformer dari scratch dengan semua komponennya (token embeddings, positional embeddings, multi-head attention, layer normalization, residual connections), spam classification menggunakan pretrained BERT dari TensorFlow Hub, dan question answering dengan Hugging Face transformers library. BERT adalah encoder-only Transformer pretrained pada large corpus menggunakan masked language modeling dan next-sentence prediction, memungkinkan transfer learning untuk berbagai downstream NLP tasks.

---

## Part 1: Transformer Architecture Detail

### Keuntungan Transformer vs RNN

**RNN Limitations**: LSTM dan GRU model sequential—process one timestep at a time, maintaining limited state vector (memory). Ini membuat:
- **Sequential bottleneck**: Tidak bisa parallel processing karena timestep t tergantung pada t-1
- **Long-term dependencies**: Sulit remember information dari awal sequence untuk long texts
- **Context limitation**: Attention mechanism di Chapter 12 membantu, tapi masih limited oleh sequential nature

**Transformer Solution**: Transformer process entire sequence simultaneously menggunakan self-attention. Key advantages:
- **Parallelization**: Semua timesteps computed sekaligus, drastically reducing training time
- **Full context**: Setiap token dapat attend ke semua tokens lain, tidak limited by sequential processing
- **Scalability**: Lebih mudah scale ke very large models karena parallel-friendly architecture

### Self-Attention: Query, Key, Value

Self-attention adalah core mechanism yang memungkinkan token "look at" tokens lain untuk enrich representationnya. Untuk setiap token, model generates tiga vectors melalui learned weight matrices:

**Query (Q)**: Represents "what am I looking for?" Untuk token saat ini, query vector menentukan apa informasi yang dibutuhkan dari tokens lain.

**Key (K)**: Represents "what do I offer?" Untuk setiap candidate token, key vector indicates informasi apa yang bisa diprovide.

**Value (V)**: Represents actual content. Hidden representation dari token yang akan di-aggregate based on attention weights.

**Computation Process**:
1. **Score calculation**: Untuk setiap query, compute similarity dengan semua keys: \( score = Q \cdot K^T / \sqrt{d_k} \) (scaled dot product, division by \(\sqrt{d_k}\) untuk stable gradients)
2. **Attention weights**: Normalize scores dengan softmax: \( \alpha = softmax(score) \)
3. **Weighted sum**: Combine values weighted by attention: \( output = \alpha \cdot V \)

Formula lengkap: \( Attention(Q, K, V) = softmax(QK^T / \sqrt{d_k}) V \)

**Multi-Head Attention**: Instead of single attention, Transformer uses multiple "heads" (biasanya 8 atau 12). Setiap head learn different aspects of relationships. Outputs dari all heads di-concatenate dan project back ke original dimension.

### Token dan Positional Embeddings

**Token Embeddings**: Seperti word embeddings standard—setiap token (word/subword) mapped ke dense vector representation. Matrix size \( V \times d_{model} \) where V = vocabulary size, \( d_{model} \) = embedding dimension (typically 512 atau 768). Embeddings learned jointly dengan task training.

**Positional Embeddings**: Critical karena Transformer tidak inherently sequential. Tanpa positional info, "John loves Mary" dan "Mary loves John" akan identik. Original paper menggunakan sinusoidal functions:
- Even positions: \( PE(pos, 2i) = sin(pos / 10000^{2i/d_{model}}) \)
- Odd positions: \( PE(pos, 2i+1) = cos(pos / 10000^{2i/d_{model}}) \)

Frequency decreases dengan increasing position dalam embedding vector, creating unique signature untuk each position. Alternative: learned positional embeddings (model learns positions during training).

**Final embedding**: Token embedding + Positional embedding (element-wise addition) = Input ke Transformer.

### Residual Connections dan Layer Normalization

**Residual Connections**: Add input dari sublayer ke outputnya: \( output = Sublayer(x) + x \). Benefits:
- **Gradient flow**: Direct path untuk gradients flow backward, combating vanishing gradients
- **Identity mapping**: Model dapat learn identity function dengan setting Sublayer weights ke zero
- **Easier optimization**: Layers learn residual (difference), bukan full transformation

**Layer Normalization**: Normalize activations within each sample (bukan across batch seperti batch norm). Untuk hidden layer dengan units \( h_1, h_2, ..., h_n \):
- Compute mean: \( \mu = \frac{1}{n} \sum h_i \)
- Compute variance: \( \sigma^2 = \frac{1}{n} \sum (h_i - \mu)^2 \)
- Normalize: \( \hat{h}_i = (h_i - \mu) / \sqrt{\sigma^2 + \epsilon} \)
- Scale and shift: \( output = \gamma \hat{h}_i + \beta \) (learnable parameters)

Layer norm tidak depend on batch size, lebih stable untuk variable batch sizes dan RNN-like architectures.

### Encoder dan Decoder Structure

**Encoder**: Stack of N layers (typically 6 atau 12), each containing:
1. **Multi-head self-attention**: Tokens attend to all tokens dalam input sequence
2. **Add & Norm**: Residual connection + layer normalization
3. **Feed-forward network**: Two dense layers dengan ReLU: \( FFN(x) = ReLU(xW_1 + b_1)W_2 + b_2 \)
4. **Add & Norm**: Residual connection + layer normalization

**Decoder**: Stack of N layers, each containing:
1. **Masked multi-head self-attention**: Tokens hanya attend ke previous positions (prevent looking ahead)
2. **Add & Norm**
3. **Encoder-decoder attention**: Queries dari decoder attend to keys dan values dari encoder output
4. **Add & Norm**
5. **Feed-forward network**
6. **Add & Norm**

Masking ensures decoder cannot "cheat" by seeing future tokens during training.

---

## Part 2: BERT (Bidirectional Encoder Representations from Transformers)

### BERT Overview

BERT adalah **encoder-only** Transformer (tidak ada decoder). Key innovation: **bidirectional pretraining** (model sees full context dari left dan right simultaneously, unlike GPT yang unidirectional left-to-right).

**Architecture**:
- **BERT-Base**: 12 encoder layers, 768 hidden size, 12 attention heads, 110M parameters
- **BERT-Large**: 24 encoder layers, 1024 hidden size, 16 attention heads, 340M parameters

**Three Embeddings Spaces**:
1. **Token embeddings**: Standard word/subword vectors
2. **Positional embeddings**: Learned positions (bukan sinusoidal)
3. **Segment embeddings**: Distinguish sequence A vs B (untuk tasks dengan 2 input sequences)

Final embedding = sum dari ketiga embeddings.

### Pretraining Tasks

BERT pretrained pada massive corpus (Wikipedia + BookCorpus) menggunakan two self-supervised tasks:

**Masked Language Modeling (MLM)**: Randomly mask 15% tokens, model predicts masked tokens. Masking strategy:
- 80% probability: Replace dengan [MASK] token
- 10% probability: Replace dengan random token
- 10% probability: Keep original token
Ini prevents pretraining-finetuning discrepancy (karena [MASK] tidak ada di real data).

**Next Sentence Prediction (NSP)**: Given two sentences A dan B, predict apakah B adalah actual next sentence after A. Training data:
- 50% positive examples: B follows A
- 50% negative examples: B random sentence
NSP helps model understand sentence-level relationships.

### Special Tokens

**[CLS]**: Classification token, always first token. Hidden representation dari [CLS] position digunakan untuk sequence-level classification tasks.

**[SEP]**: Separator token, marks boundaries antara sequences. Example: `[CLS] Question [SEP] Context [SEP]` untuk question answering.

**[PAD]**: Padding token (ID=0) untuk bring sequences ke same length.

**[MASK]**: Masking token untuk MLM task.

### Four Task Categories

BERT designed untuk solve:

1. **Sequence Classification**: Single sequence → single label (sentiment analysis, spam detection). Use [CLS] representation + classification head.

2. **Token Classification**: Single sequence → label per token (NER, POS tagging). Use each token's representation + classification head.

3. **Question Answering**: Two sequences (question + context) → predict start/end span dalam context. Two classification heads predict start dan end positions.

4. **Multiple Choice**: Multiple sequences (question + N candidates) → predict correct choice. Process each candidate separately dan compare [CLS] representations.

### Transfer Learning dengan BERT

Workflow standard:
1. **Download pretrained BERT**: Model sudah trained pada massive corpus, has rich language understanding
2. **Add task-specific head**: Simple classification layer(s) on top
3. **Fine-tune end-to-end**: Train BERT + head jointly pada labeled task data
4. **Benefit**: Requires far less data dan training time karena BERT already understands language

---

## Part 3: Spam Classification dengan BERT

### Dataset dan Class Imbalance

SMS Spam Collection: 5,574 messages (4,827 ham, 747 spam). Significant imbalance → need balanced splits.

**Strategy**:
1. Random undersample untuk create balanced test (100 spam, 100 ham) dan validation sets (100 each)
2. Use **NearMiss algorithm** untuk undersample training set. NearMiss removes majority class samples yang close to minority class, increasing separation.

Bag-of-words representation digunakan untuk compute distances dalam NearMiss (karena algorithm needs numerical distance metric).

**Final splits**: Train (547 spam, 547 ham), Valid (100 each), Test (100 each) — all balanced.

### Tokenization dengan WordPiece

BERT menggunakan **WordPiece tokenization**: Breaks words into frequent subwords. Example: "seashells" → ["seas", "##hell", "##s"] where ## indicates continuation.

**Advantages**:
- **Smaller vocabulary**: "walk", "walked", "walking" → ["walk", "##ed", "##ing"]
- **Handle OOV**: Unseen words dapat represented by combining known subwords
- **Better generalization**: Model learns morphological patterns

**Process**:
1. Tokenizer splits input string ke subword tokens
2. Add [CLS] at start, [SEP] at end
3. Convert tokens ke IDs via vocabulary lookup
4. Create attention mask (1 for real tokens, 0 for padding)
5. Create segment IDs (all 0s untuk single sequence task)

### Model Architecture

**Components**:
1. **Input layers**: `input_word_ids`, `input_mask`, `input_type_ids`
2. **BERT encoder**: Download from TensorFlow Hub (`bert_en_uncased_L-12_H-768_A-12`)
3. **BERT outputs**: `sequence_output` (all tokens) dan `pooled_output` ([CLS] representation)
4. **Classification head**: BertClassifier automatically adds Dense layer on top untuk binary classification

**Compilation**:
- **Optimizer**: AdamW dengan polynomial learning rate decay dan warmup
- **Loss**: Sparse categorical cross-entropy
- **Metric**: Accuracy

**Warmup Schedule**: Learning rate linearly increases dari 0 ke init_lr dalam warmup_steps, then polynomial decay ke 0 across remaining steps. Ini stabilizes training.

### Training dan Results

Train 3 epochs, batch size 56. Results:
- Epoch 1: 45% train accuracy, 51% validation accuracy (model learning)
- Epoch 2: 65% train, 81% validation (rapid improvement)
- Epoch 3: 76% train, 85% validation (convergence)
- **Test accuracy: 79.5%**

Remarkable result dengan minimal effort: hanya 3 epochs, no hyperparameter tuning, small dataset. BERT's pretrained knowledge makes this possible.

---

## Part 4: Question Answering dengan Hugging Face

### SQuAD Dataset

**Stanford Question Answering Dataset (SQuAD v1)**: 87,599 training examples, 10,570 validation examples. Each example contains:
- **Question**: "What color is the ball?"
- **Context**: "Tippy is a dog. She loves to play with her red ball."
- **Answer**: {"text": "red", "answer_start": 49}

**Data Issues**: Some examples have character offset errors (answer_start doesn't align dengan actual answer position). Solution: Correction function yang tries offsets -2, -1, 0 dan selects correct alignment.

### DistilBERT

**DistilBERT** adalah smaller, faster version dari BERT:
- **Size**: 66M parameters (vs BERT-Base 110M)
- **Speed**: 60% faster
- **Performance**: Retains 97% of BERT's performance
- **Training**: Knowledge distillation (student model learns from teacher BERT)

Ideal untuk production scenarios di mana speed dan size matters.

### Tokenization untuk QA

**Input Format**: `[CLS] Question tokens [SEP] Context tokens [SEP]`

**Key Steps**:
1. Tokenize question dan context separately
2. Concatenate dengan special tokens
3. Convert char-based answer positions ke token-based positions (crucial step!)
4. Handle truncation: If answer truncated, set start/end positions ke last token index

**Token-based Positions**: Model predicts token indices (bukan character indices). Must map character answer_start/answer_end ke corresponding token positions using `encodings.char_to_token()`.

### Model Architecture

**TFDistilBertForQuestionAnswering**:
- Input: Token IDs + attention mask
- BERT encoding: Generate hidden representations untuk all tokens
- **Two classification heads**:
  - Start position classifier: Predicts probability distribution over all token positions untuk answer start
  - End position classifier: Predicts probability distribution untuk answer end
- Loss: Sum dari cross-entropy losses untuk start dan end predictions

**Training**:
- Optimizer: AdamW dengan learning rate 5e-5
- Batch size: 8
- Epochs: 2
- Training time: ~2 hours on GPU

### Inference dan Evaluation

**Inference Process**:
1. Tokenize question + context
2. Model predicts start/end probability distributions
3. Select argmax positions dari each distribution
4. Extract tokens between start dan end positions
5. Convert token IDs back to text

**Metrics**:
- **Exact Match (EM)**: % predictions yang exactly match reference answer
- **F1 Score**: Token-level F1 between predicted dan reference answers (more forgiving)

**Example Result**:
- Question: "What was the theme of Super Bowl 50?"
- Context: "...league emphasized the 'golden anniversary'..."
- Predicted: "golden anniversary" ✓
- True answer: "golden anniversary"

Model berhasil extract correct answer dari long paragraph context.

---

## Part 5: Program-Program Implementasi

### Program 1: Positional Embeddings

```python
import tensorflow as tf
import numpy as np

def get_positional_encoding(n_steps, d_model):
    """Generate sinusoidal positional encodings"""
    position = np.arange(n_steps)[:, np.newaxis]
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
    
    # Create empty positional encoding matrix
    pos_encoding = np.zeros((n_steps, d_model))
    
    # Apply sin to even indices
    pos_encoding[:, 0::2] = np.sin(position * div_term)
    
    # Apply cos to odd indices
    pos_encoding[:, 1::2] = np.cos(position * div_term)
    
    return tf.constant(pos_encoding, dtype=tf.float32)

# Usage
n_steps = 100  # Sequence length
d_model = 512  # Embedding dimension

pos_encoding = get_positional_encoding(n_steps, d_model)
print(f"Positional encoding shape: {pos_encoding.shape}")  # (100, 512)

# Combine dengan token embeddings
token_embeddings = tf.keras.layers.Embedding(vocab_size, d_model)(input_ids)
final_embeddings = token_embeddings + pos_encoding[:n_steps, :]
```

**Penjelasan**: Positional encoding menggunakan sine/cosine functions dengan different frequencies. Even indices use sine, odd indices use cosine. Frequencies decrease dengan position, creating unique signatures. Final embeddings adalah sum dari token embeddings dan positional encodings.

---

### Program 2: Multi-Head Attention Layer

```python
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        
        assert d_model % num_heads == 0
        self.depth = d_model // num_heads
        
        # Dense layers untuk Q, K, V projections
        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)
        
        # Final dense layer
        self.dense = tf.keras.layers.Dense(d_model)
    
    def split_heads(self, x, batch_size):
        """Split last dimension ke (num_heads, depth)"""
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
    
    def call(self, v, k, q, mask=None):
        batch_size = tf.shape(q)[0]
        
        # Linear projections
        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)
        v = self.wv(v)
        
        # Split heads
        q = self.split_heads(q, batch_size)  # (batch, heads, seq_len, depth)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)
        
        # Scaled dot-product attention
        matmul_qk = tf.matmul(q, k, transpose_b=True)
        
        # Scale
        dk = tf.cast(tf.shape(k)[-1], tf.float32)
        scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
        
        # Apply mask (if provided)
        if mask is not None:
            scaled_attention_logits += (mask * -1e9)
        
        # Softmax
        attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
        
        # Weighted sum
        output = tf.matmul(attention_weights, v)
        
        # Concatenate heads
        output = tf.transpose(output, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(output, (batch_size, -1, self.d_model))
        
        # Final linear projection
        output = self.dense(concat_attention)
        
        return output, attention_weights

# Usage
mha = MultiHeadAttention(d_model=512, num_heads=8)
x = tf.random.uniform((32, 100, 512))  # (batch, seq_len, d_model)
output, attn_weights = mha(x, x, x)
print(f"Output shape: {output.shape}")  # (32, 100, 512)
```

**Penjelasan**: Multi-head attention splits d_model dimension ke multiple heads, each compute attention independently. Queries, keys, dan values projected ke different subspaces, attention computed per head, results concatenated dan projected back. Multiple heads allow model learn different relationship types simultaneously.

---

### Program 3: Transformer Encoder Layer

```python
class TransformerEncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, dropout_rate=0.1):
        super(TransformerEncoderLayer, self).__init__()
        
        self.mha = MultiHeadAttention(d_model, num_heads)
        
        # Feed-forward network
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(dff, activation='relu'),
            tf.keras.layers.Dense(d_model)
        ])
        
        # Layer normalization
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        
        # Dropout
        self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
        self.dropout2 = tf.keras.layers.Dropout(dropout_rate)
    
    def call(self, x, training, mask=None):
        # Multi-head attention
        attn_output, _ = self.mha(x, x, x, mask)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # Residual + LayerNorm
        
        # Feed-forward
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # Residual + LayerNorm
        
        return out2

# Complete encoder dengan multiple layers
class TransformerEncoder(tf.keras.Model):
    def __init__(self, num_layers, d_model, num_heads, dff, vocab_size, max_len, dropout_rate=0.1):
        super(TransformerEncoder, self).__init__()
        
        self.d_model = d_model
        
        # Embeddings
        self.token_embedding = tf.keras.layers.Embedding(vocab_size, d_model)
        self.pos_encoding = get_positional_encoding(max_len, d_model)
        
        # Encoder layers
        self.enc_layers = [
            TransformerEncoderLayer(d_model, num_heads, dff, dropout_rate)
            for _ in range(num_layers)
        ]
        
        self.dropout = tf.keras.layers.Dropout(dropout_rate)
    
    def call(self, x, training, mask=None):
        seq_len = tf.shape(x)[1]
        
        # Token + positional embeddings
        x = self.token_embedding(x)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))  # Scale embeddings
        x += self.pos_encoding[:seq_len, :]
        
        x = self.dropout(x, training=training)
        
        # Pass through encoder layers
        for enc_layer in self.enc_layers:
            x = enc_layer(x, training, mask)
        
        return x

# Usage
encoder = TransformerEncoder(
    num_layers=6, d_model=512, num_heads=8,
    dff=2048, vocab_size=10000, max_len=5000
)
input_ids = tf.random.uniform((32, 100), maxval=10000, dtype=tf.int32)
output = encoder(input_ids, training=True)
print(f"Encoder output shape: {output.shape}")  # (32, 100, 512)
```

**Penjelasan**: Encoder layer consists of multi-head attention + residual connection + layer norm, followed by feed-forward network + residual + layer norm. Full encoder stacks multiple layers (typically 6 atau 12). Embeddings scaled by \(\sqrt{d_{model}}\) untuk stabilize training.

---

### Program 4: BERT Spam Classification

```python
import tensorflow_hub as hub
import tensorflow_models as tfm
from tensorflow.keras.layers import Input

# Download BERT dari TensorFlow Hub
bert_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"
max_seq_length = 128

# Define inputs
input_word_ids = Input(shape=(max_seq_length,), dtype=tf.int32, name="input_word_ids")
input_mask = Input(shape=(max_seq_length,), dtype=tf.int32, name="input_mask")
input_type_ids = Input(shape=(max_seq_length,), dtype=tf.int32, name="input_type_ids")

# Load BERT encoder
bert_layer = hub.KerasLayer(bert_url, trainable=True)
bert_outputs = bert_layer({
    "input_word_ids": input_word_ids,
    "input_mask": input_mask,
    "input_type_ids": input_type_ids
})

# Create encoder model
encoder = tf.keras.Model(
    inputs={
        "input_word_ids": input_word_ids,
        "input_mask": input_mask,
        "input_type_ids": input_type_ids
    },
    outputs={
        "sequence_output": bert_outputs["sequence_output"],
        "pooled_output": bert_outputs["pooled_output"]
    }
)

# Add classification head
classifier = tfm.nlp.models.BertClassifier(
    network=encoder,
    num_classes=2  # Binary classification: spam vs ham
)

# Compile
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
classifier.compile(
    optimizer=optimizer,
    loss=loss,
    metrics=['accuracy']
)

# Train
history = classifier.fit(
    train_inputs,
    train_labels,
    validation_data=(valid_inputs, valid_labels),
    epochs=3,
    batch_size=32
)

# Evaluate
test_loss, test_acc = classifier.evaluate(test_inputs, test_labels)
print(f"Test accuracy: {test_acc:.4f}")
```

**Penjelasan**: BERT downloaded sebagai KerasLayer dari TensorFlow Hub. Encoder outputs pooled representation dari [CLS] token. BertClassifier automatically adds Dense layer on top untuk classification. Training dengan small learning rate (3e-5) karena pretrained weights sudah good. Achieves high accuracy dengan minimal epochs.

---

### Program 5: Question Answering dengan Hugging Face

```python
from transformers import DistilBertTokenizerFast, TFDistilBertForQuestionAnswering
from datasets import load_dataset
import tensorflow as tf

# Load dataset
dataset = load_dataset("squad")
train_contexts = dataset["train"]["context"]
train_questions = dataset["train"]["question"]
train_answers = dataset["train"]["answers"]

# Initialize tokenizer dan model
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
model = TFDistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')

# Tokenize training data
def tokenize_and_add_positions(contexts, questions, answers):
    # Tokenize
    encodings = tokenizer(
        questions,
        contexts,
        truncation=True,
        padding=True,
        max_length=384,
        return_offsets_mapping=True
    )
    
    # Convert char positions ke token positions
    start_positions = []
    end_positions = []
    
    for i, answer in enumerate(answers):
        start_char = answer['answer_start'][0]
        end_char = start_char + len(answer['text'][0])
        
        # Find token positions
        start_token = encodings.char_to_token(i, start_char)
        end_token = encodings.char_to_token(i, end_char - 1)
        
        # Handle truncation
        if start_token is None:
            start_token = tokenizer.model_max_length - 1
        if end_token is None:
            end_token = tokenizer.model_max_length - 1
        
        start_positions.append(start_token)
        end_positions.append(end_token)
    
    encodings['start_positions'] = start_positions
    encodings['end_positions'] = end_positions
    
    return encodings

train_encodings = tokenize_and_add_positions(train_contexts, train_questions, train_answers)

# Create tf.data dataset
train_dataset = tf.data.Dataset.from_tensor_slices((
    {
        'input_ids': train_encodings['input_ids'],
        'attention_mask': train_encodings['attention_mask']
    },
    {
        'start_positions': train_encodings['start_positions'],
        'end_positions': train_encodings['end_positions']
    }
)).batch(8)

# Compile
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer)

# Train
model.fit(train_dataset, epochs=2)

# Inference function
def answer_question(question, context):
    inputs = tokenizer(
        question,
        context,
        return_tensors='tf',
        truncation=True,
        padding=True
    )
    
    outputs = model(inputs)
    
    # Get start dan end positions
    start_idx = tf.argmax(outputs.start_logits, axis=1).numpy()[0]
    end_idx = tf.argmax(outputs.end_logits, axis=1).numpy()[0]
    
    # Convert token IDs ke text
    answer_tokens = inputs['input_ids'][0][start_idx:end_idx+1]
    answer = tokenizer.decode(answer_tokens)
    
    return answer

# Test
question = "What is the capital of France?"
context = "Paris is the capital and largest city of France."
answer = answer_question(question, context)
print(f"Answer: {answer}")  # Output: paris
```

**Penjelasan**: DistilBERT question answering model has two classification heads (start position dan end position). Tokenizer converts question + context ke token IDs, maintaining offset mapping untuk convert char positions ke token positions. Model predicts probability distributions over all positions, argmax selects most likely start/end. Decoder converts token IDs back to readable text.

---

## Kesimpulan

Transformers represent paradigm shift dalam NLP—dari sequential processing (RNN) ke parallel attention-based processing. Key innovations: self-attention memungkinkan model melihat full context simultaneously, positional encodings preserve order information, multi-head attention captures multiple relationship types, dan residual connections + layer normalization stabilize deep networks. BERT proves power dari transfer learning: pretrain large model pada massive corpus, fine-tune untuk specific tasks dengan minimal data. Hugging Face transformers library democratizes access ke state-of-the-art models, making powerful NLP accessible dengan simple APIs. Transformer architecture sudah expand beyond NLP ke computer vision (ViT), multimodal learning (CLIP), dan hampir setiap domain di AI.
