# CHAPTER 12: Sequence-to-Sequence Learning dengan Attention

## Ringkasan

Chapter ini melanjutkan pembangunan machine translator dari Chapter 11 dengan menambahkan **Bahdanau attention mechanism** untuk meningkatkan accuracy secara signifikan. Tanpa attention, decoder hanya akses context vector final dari encoder (bottleneck besar). Dengan attention, decoder dapat "melihat" semua output timestep encoder dan secara dinamis memilih bagian mana yang paling relevan untuk setiap decoding step. Implementasi menggunakan custom `DecoderRNNAttentionWrapper` layer berbasis Keras subclassing API. Chapter juga menunjukkan bagaimana visualisasi attention weights memberikan interpretability—mengungkap word alignment patterns antara English dan German yang sesuai secara linguistik.

---

## Konsep Attention Mechanism

### Problem Tanpa Attention

Dalam seq2seq model standard (Chapter 11), encoder mengubah English sentence menjadi single context vector—representasi "compact" dari seluruh input. Decoder kemudian harus decode entire German sentence menggunakan hanya context vector ini. Ini adalah massive bottleneck, khususnya untuk sentences panjang: tidak praktis untuk single vector merangkum semua information dari 20+ word sentence. Hasilnya, model kesulitan pada long sequences dan tidak bisa "track" dengan baik word correspondence antara languages.

### Bahdanau Attention Solution

**Bahdanau attention** memecahkan ini dengan membiarkan decoder akses ke **semua** encoder outputs (bukan hanya final context vector). Untuk setiap decoding step, attention mechanism menghitung **attention weights** yang merepresentasikan berapa banyak decoder harus "focus" pada setiap encoder position. Weights ini digunakan untuk compute weighted sum dari encoder outputs—creating dynamic context vector yang specific untuk current decoding step. Dengan cara ini, decoder bisa adapt attention based on apa yang sedang diterjemahkan.

### Attention Computation

Untuk decoding step i dengan decoder state s_{i-1} dan semua encoder outputs h_j, attention dilakukan:

1. **Energy calculation**: e_{ij} = v^T tanh(W·s_{i-1} + U·h_j) - mengukur "relevance" encoder position j untuk decoding step i
2. **Normalization**: α_{ij} = exp(e_{ij}) / Σ_k exp(e_{ik}) - convert energies ke probability distribution
3. **Context aggregation**: c_i = Σ_j α_{ij}·h_j - weighted sum dari encoder outputs

Weights W, U, v adalah learnable matrices yang model adjust selama training. Hasilnya adalah context vector c_i yang focused pada relevant parts dari encoder input, di-concatenate dengan decoder input untuk next GRU cell.

---

## Custom Attention Layer Implementation

### DecoderRNNAttentionWrapper Architecture

Karena TensorFlow tidak provide built-in Bahdanau attention untuk seq2seq, kami implement custom layer menggunakan Keras subclassing API dengan 3 key methods:

**`__init__`**: Initializes layer dengan GRUCell (decoder cell function) dan units (attention hidden dimension).

**`build`**: Declares 3 weight matrices:
- W_a: shape [encoder_hidden, attention_hidden]
- U_a: shape [decoder_hidden, attention_hidden]
- V_a: shape [attention_hidden, 1]

Weights initialized uniform dan trainable.

**`call`**: Main computation yang iterate melalui decoder timesteps menggunakan K.rnn() backend function. Untuk setiap timestep, `_step` function computes attention weights menggunakan encoder outputs dan current decoder state, generate attention-weighted context vector, concatenate dengan decoder input, feed ke GRUCell untuk generate output dan next state.

### GRUCell vs GRU Layer

GRUCell adalah primitive building block yang compute single timestep: `output, next_state = GRUCell(input, state)`. GRU layer adalah full implementation yang process entire sequences dengan convenience methods. Kami gunakan GRUCell di attention layer karena perlu fine-grained control over individual timesteps untuk attention computation.

---

## Model dengan Attention

### Encoder (Unchanged)

Encoder tetap identik dengan Chapter 11: string input → TextVectorization → Embedding → Bidirectional GRU. Bidirectional processing penting untuk understand context dari kedua directions. Output adalah context vector ditambah full sequence dari encoder hidden states (needed untuk attention).

### Decoder dengan Attention

Decoder dimodifikasi signifikan:

1. **Input**: German sequence vectorized dan embedded (seperti sebelumnya)
2. **Attention layer**: DecoderRNNAttentionWrapper menerima encoder output sequence dan decoder embeddings, outputs attention-weighted outputs plus attention weights matrix
3. **Concatenation**: Attention outputs (already concatenated dengan decoder input internally) dipass ke Dense hidden layer
4. **Output**: Final Dense layer outputs probability distribution atas German vocabulary

Key improvement: decoder sekarang dapat akses ALL encoder information, bukan hanya final context vector.

---

## Training dan Evaluation

### Performance Improvement

Model dengan attention mencapai BLEU score **2x higher** dari model tanpa attention:
- Tanpa attention: ~0.1 BLEU
- Dengan attention: ~0.20 BLEU di validation set

Ini significant improvement menunjukkan attention mechanism benar-benar solve bottleneck problem. Accuracy juga naik dari ~73% ke ~83%, dan training berjalan dalam 5 menit untuk 5 epochs.

### BLEU Score Interpretation

BLEU score 0.20 means model getting roughly 20% of n-gram matches exact dengan reference translations. Dibandingkan state-of-the-art 0.35 pada WMT dataset, score kami decent considering model simplicity dan training time. Gap merefleksikan:
- Smaller dataset (50K vs millions)
- Simpler architecture (no transformers, layer normalization, etc.)
- No special training tricks (curriculum learning, back-translation, etc.)

---

## Attention Visualization untuk Interpretability

### Mengapa Visualisasi Penting

Salah satu keuntungan major dari attention adalah **interpretability**. Attention weights matrix (α) memiliki entry untuk setiap encoder-decoder timestep pair. Matriks ini bisa diviz sebagai heatmap—lighter colors = higher attention. Ini reveals word alignment patterns: jika German word "und" (and) attend highly ke English word "and", model benar-benar belajar linguistic correspondences.

### Visualization Process

1. **Load trained model** dan extract semua intermediate layers
2. **Create visualizer model** yang outputs predictions PLUS attention weights matrix PLUS vectorized tokens
3. **Untuk sample English sentence**: predict German translation dan get attention weights
4. **Create heatmap** dengan:
   - X-axis: predicted German words
   - Y-axis: input English words
   - Cell values: attention weights (darker = more attention)
5. **Analyze patterns**: observe word alignments, check jika sensible

### Example Patterns

Pada "Tom and Mary haven't heard from John in a long time" → "Tom und Maria [UNK] schon seit [UNK] [UNK] [UNK] Johannes gestohlen":
- "und" attends heavily to "and"
- "Maria" attends to "Mary"
- "gestohlen" attends to surrounding context
- Attention pattern roughly diagonal (natural untuk closely related languages)

---

## Program-Program Implementasi

### Program 1: Custom Attention Layer

```python
from tensorflow import keras
import tensorflow.keras.backend as K

class DecoderRNNAttentionWrapper(keras.layers.Layer):
    def __init__(self, cell_fn, units, **kwargs):
        self._cell_fn = cell_fn
        self.units = units
        super(DecoderRNNAttentionWrapper, self).__init__(**kwargs)
    
    def build(self, input_shape):
        # W_a: [encoder_hidden, attention_hidden]
        self.W_a = self.add_weight(
            name='W_a',
            shape=(input_shape[0][2], input_shape[0][2]),
            initializer='uniform',
            trainable=True
        )
        # U_a: [decoder_hidden, attention_hidden]
        self.U_a = self.add_weight(
            name='U_a',
            shape=(self._cell_fn.units, self._cell_fn.units),
            initializer='uniform',
            trainable=True
        )
        # V_a: [attention_hidden, 1]
        self.V_a = self.add_weight(
            name='V_a',
            shape=(input_shape[0][2], 1),
            initializer='uniform',
            trainable=True
        )
        super(DecoderRNNAttentionWrapper, self).build(input_shape)
    
    def call(self, inputs, initial_state, training=False):
        def _step(inputs, states):
            encoder_outputs = states[-1]  # Extract dari constants
            
            # Compute energy: e_ij = v^T tanh(W·s + U·h)
            W_a_dot_h = K.dot(encoder_outputs, self.W_a)
            U_a_dot_s = K.expand_dims(K.dot(states[0], self.U_a), 1)
            Wh_plus_Us = K.tanh(W_a_dot_h + U_a_dot_s)
            e_i = K.squeeze(K.dot(Wh_plus_Us, self.V_a), axis=-1)
            
            # Normalize ke probabilities
            a_i = K.softmax(e_i)
            
            # Weighted sum dari encoder outputs
            c_i = K.sum(encoder_outputs * K.expand_dims(a_i, -1), axis=1)
            
            # Concatenate attention context dengan decoder input
            # Feed ke GRUCell
            s, states = self._cell_fn(
                K.concatenate([inputs, c_i], axis=-1), states
            )
            
            return (s, a_i), states
        
        encoder_outputs, decoder_inputs = inputs
        
        # K.rnn iterates _step untuk setiap decoder timestep
        _, attn_outputs, _ = K.rnn(
            step_function=_step,
            inputs=decoder_inputs,
            initial_states=[initial_state],
            constants=[encoder_outputs]
        )
        
        attn_out, attn_energy = attn_outputs
        return attn_out, attn_energy
```

**Penjelasan**: Custom layer implement Bahdanau attention. K.rnn() adalah TensorFlow backend function yang iteratively apply _step function untuk setiap decoder timestep, passing encoder outputs sebagai constants. Attention weights computed sebagai softmax dari energy scores, digunakan untuk weight encoder outputs.

---

### Program 2: Final Model dengan Attention

```python
def get_final_seq2seq_model_with_attention(n_vocab, encoder, de_vectorizer):
    # Encoder input dan get encoder outputs
    e_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='e_input_final')
    fwd_state, bwd_state, en_states = encoder(e_inp)
    
    # Decoder input
    d_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='d_input')
    d_vectorized = de_vectorizer(d_inp)
    d_emb = tf.keras.layers.Embedding(n_vocab+2, 128, mask_zero=True)(d_vectorized)
    
    # Initial state: concatenate forward dan backward encoder states
    d_init_state = tf.keras.layers.Concatenate(axis=-1)([fwd_state, bwd_state])
    
    # Attention layer
    gru_cell = tf.keras.layers.GRUCell(256)
    attn_out, _ = DecoderRNNAttentionWrapper(
        cell_fn=gru_cell, units=512
    )([en_states, d_emb], initial_state=d_init_state)
    
    # Dense layers
    d_dense1 = tf.keras.layers.Dense(512, activation='relu')(attn_out)
    d_final = tf.keras.layers.Dense(n_vocab+2, activation='softmax')(d_dense1)
    
    model = tf.keras.models.Model(
        inputs=[e_inp, d_inp],
        outputs=d_final,
        name='seq2seq_with_attention'
    )
    
    return model

# Build model
encoder = get_encoder(en_vocab, en_vectorizer)
model_with_attention = get_final_seq2seq_model_with_attention(
    de_vocab, encoder, de_vectorizer
)
model_with_attention.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)
```

**Penjelasan**: Model kombinasi encoder dan decoder dengan attention. Key difference: `en_states` (semua encoder timestep outputs) dipass ke attention layer, bukan hanya final context vector. Attention layer outputs attention-weighted features yang concatenate dengan decoder input before passing ke GRUCell.

---

### Program 3: Model Training

```python
def train_model(model, de_vectorizer, train_df, valid_df, test_df,
                epochs=5, batch_size=128):
    bleu_metric = BLEUMetric(de_vocabulary)
    data = prepare_data(train_df, valid_df, test_df)
    
    for epoch in range(epochs):
        # Training
        train_data = data['train']
        indices = np.random.permutation(len(train_data['encoder_inputs']))
        for key in train_data:
            train_data[key] = train_data[key][indices]
        
        loss_list, acc_list, bleu_list = [], [], []
        n_batches = len(train_data['encoder_inputs']) // batch_size
        
        for i in range(n_batches):
            start, end = i * batch_size, (i + 1) * batch_size
            x = [
                train_data['encoder_inputs'][start:end],
                train_data['decoder_inputs'][start:end]
            ]
            y = de_vectorizer(train_data['decoder_labels'][start:end])
            
            model.train_on_batch(x, y)
            loss, acc = model.evaluate(x, y, verbose=0)
            pred_y = model.predict(x, verbose=0)
            bleu = bleu_metric.calculate_bleu(y, pred_y)
            
            loss_list.append(loss), acc_list.append(acc), bleu_list.append(bleu)
        
        print(f"Epoch {epoch+1}: Loss={np.mean(loss_list):.4f}, "
              f"Acc={np.mean(acc_list):.4f}, BLEU={np.mean(bleu_list):.4f}")

train_model(model_with_attention, de_vectorizer, train_df, valid_df, test_df)
```

**Penjelasan**: Training loop sama seperti sebelumnya—batch training dengan shuffle, compute loss/accuracy/BLEU metrics. Key output: BLEU score ~0.20 (2x improvement vs tanpa attention).

---

### Program 4: Attention Visualizer Model

```python
def attention_visualizer(save_path):
    model = tf.keras.models.load_model(save_path)
    
    # Encoder setup
    e_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='e_input_final')
    en_model = model.get_layer("encoder")
    fwd_state, bwd_state, en_states = en_model(e_inp)
    e_vec_out = en_model.get_layer("e_vectorizer")(e_inp)
    
    # Decoder setup
    d_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='d_input')
    d_vec_out = model.get_layer('d_vectorizer')(d_inp)
    d_emb_out = model.get_layer('d_embedding')(d_vec_out)
    
    # Attention
    d_init_state = tf.keras.layers.Concatenate(axis=-1)([fwd_state, bwd_state])
    d_attn_layer = model.get_layer("d_attention")
    attn_out, attn_states = d_attn_layer(
        [en_states, d_emb_out], initial_state=d_init_state
    )
    
    # Dense layers
    d_dense1_out = model.get_layer("d_dense_1")(attn_out)
    d_final_out = model.get_layer("d_dense_final")(d_dense1_out)
    
    # Return model yang output predictions, attention weights, vectorized tokens
    visualizer_model = tf.keras.models.Model(
        inputs=[e_inp, d_inp],
        outputs=[d_final_out, attn_states, e_vec_out, d_vec_out]
    )
    
    return visualizer_model

visualizer_model = attention_visualizer('models/seq2seq_attention')
```

**Penjelasan**: Visualizer model extract layers dari trained model dan reconstruct forward pass untuk capture intermediate outputs (especially attention weights). Multiple outputs: predictions, attention weights, vectorized token IDs untuk axes labeling.

---

### Program 5: Attention Heatmap Visualization

```python
import matplotlib.pyplot as plt

def visualize_attention(visualizer_model, en_vocab, de_vocab,
                        sample_en_text, sample_de_text, save_path):
    # Get predictions dan attention weights
    d_pred, attention_weights, e_out, d_out = visualizer_model.predict(
        [np.array([sample_en_text]), np.array([sample_de_text])]
    )
    
    # Convert predictions ke word tokens
    d_pred_ids = np.argmax(d_pred[0], axis=-1)
    
    # Build y-labels (English words)
    y_labels = []
    for e_id in e_out[0]:
        if en_vocab[e_id] == "":  # Stop at padding
            break
        y_labels.append(en_vocab[e_id])
    
    # Build x-labels (German words)
    x_labels = []
    for d_id in d_pred_ids:
        if de_vocab[d_id] == 'eos':  # Stop at EOS
            break
        x_labels.append(de_vocab[d_id])
    
    # Extract relevant attention weights (prune padding/EOS)
    attn_filtered = attention_weights[0, :len(y_labels), :len(x_labels)]
    
    # Create heatmap
    fig, ax = plt.subplots(figsize=(14, 14))
    im = ax.imshow(attn_filtered)
    
    # Set labels
    ax.set_xticks(np.arange(len(x_labels)))
    ax.set_yticks(np.arange(len(y_labels)))
    ax.set_xticklabels(x_labels)
    ax.set_yticklabels(y_labels)
    ax.tick_params(labelsize=20, axis='x', labelrotation=90)
    
    plt.colorbar(im)
    plt.subplots_adjust(left=0.2, bottom=0.2)
    plt.savefig(save_path)
    plt.close()

# Visualize beberapa examples
for i in range(10):
    en_text = test_df["EN"].iloc[i]
    de_text = test_df["DE"].iloc[i:i+1].str.rsplit(n=1, expand=True).iloc[:, 0]
    visualize_attention(visualizer_model, en_vocab, de_vocab,
                       en_text, de_text, f'plots/attention_{i}.png')
```

**Penjelasan**: Heatmap visualization menunjukkan attention weights sebagai colored grid. Lighter = higher attention. Untuk setiap German predicted word (column), heatmap shows mana English words (rows) model attended. Diagonal pattern menunjukkan word order correlation antara languages.

---

## Kesimpulan

Bahdanau attention mechanism adalah simple namun powerful idea: instead of bottlenecking dengan single context vector, decoder akses semua encoder outputs dan dynamically weight mereka based on current decoding step. Hasil: 2x BLEU improvement, better alignment antara languages, dan interpretability via attention weights visualization. Chapter berikutnya akan explore Transformer architecture yang pure attention-based (no recurrence), providing even more powerful model untuk NLP tasks.
