<a href="https://colab.research.google.com/github/DaraRahma536/TensorFlow-in-Action/blob/main/Chapter_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Chapter 5: State-of-the-Art in Deep Learning: Transformers**

# **1. Representasi Teks sebagai Angka**
---
### **A. Masalah Dasar**
Model deep learning memproses data numerik, tetapi teks bersifat kategorikal. Solusi: mengubah kata menjadi vektor.

### **B. Proses Konversi**
* Tokenisasi: Pecah kalimat menjadi kata/unit
* Pembuatan Vocabulary: Buat kamus kata → ID unik

In [None]:
I → 1, went → 2, to → 3, the → 4, beach → 5

* Padding & Truncating: Samakan panjang kalimat
-&nbsp;Padding: Tambah token <PAD> (ID 0) untuk kalimat pendek
-&nbsp;Truncating: Potong kalimat panjang
* One-Hot Encoding: Setiap ID diubah menjadi vektor biner


In [None]:
ID 1 → [0, 1, 0, 0, 0, ...]

### **C. Kelemahan One-Hot Encoding**
* Dimensi sangat besar (vocab size)
* Tidak ada hubungan semantik antar kata
* Solusi: Word Embeddings (seperti Word2Vec, GloVe) yang memetakan kata ke vektor padat (dense)

# **2. Arsitektur Transformer**
---
### **A. Encoder-Decoder Architecture**
Transformer menggunakan pola encoder-decoder:
* Encoder: Memetakan input (misal: kalimat Inggris) ke representasi laten
* Decoder: Menggunakan representasi laten untuk menghasilkan output (misal: terjemahan Prancis)

**Analogi**: Penerjemah manusia:
* Dengarkan kalimat Prancis (encoder)
* Pahami makna (representasi laten)
* Terjemahkan ke Inggris (decoder)

### **B. Komponen Encoder Layer**
Setiap encoder layer terdiri dari:
* Self-Attention Sublayer
* Fully Connected Sublayer

**Self-Attention Layer**
Fungsi: Memungkinkan model "melihat" semua kata dalam kalimat sekaligus saat memproses satu kata.

**Perbandingan dengan RNN:**
* RNN: Proses kata per kata, bisa lupa kata awal
* Self-Attention: Akses semua kata secara paralel

Komputasi Self-Attention:

In [None]:
Input: X (n_words × d_model)
Q = X * Wq  (Query)
K = X * Wk  (Key)
V = X * Wv  (Value)

Attention = softmax((Q * K^T) / √d_k) * V

**Tiga Komponen Kunci:**
* **Query**: Kata yang sedang diproses
* **Key**: Kata-kata kandidat untuk diperhatikan
* **Value**: Representasi yang akan dijumlahkan secara terbobot

### **C. Multi-Head Attention**
* Daripada satu attention head, gunakan beberapa head (biasanya 8)
* Setiap head belajar pola berbeda
* Output semua head digabung (concatenate)

In [None]:
d_head = d_model / n_heads

### **D. Masked Self-Attention (Decoder)**
* Mencegah decoder "mencontek" kata masa depan
* Membuat matriks lower-triangular dengan memberi nilai negatif besar ke posisi masa depan
* Esensial untuk training yang benar

### **E. Fully Connected Sublayer**
Dense layer sederhana dengan ReLU

In [None]:
h1 = ReLU(X * W1 + b1)
h2 = h1 * W2 + b2  (tanpa aktivasi)

# **3. Implementasi dengan TensorFlow/Keras**
---
### **A. SelfAttentionLayer (Custom Layer)**

In [None]:
class SelfAttentionLayer(layers.Layer):
    def __init__(self, d):
        super().__init__()
        self.d = d

    def build(self, input_shape):
        self.Wq = self.add_weight(...)
        self.Wk = self.add_weight(...)
        self.Wv = self.add_weight(...)

    def call(self, q_x, k_x, v_x, mask=None):
        q = tf.matmul(q_x, self.Wq)
        k = tf.matmul(k_x, self.Wk)
        v = tf.matmul(v_x, self.Wv)

        # Masking untuk decoder
        if mask is not None:
            scores += mask * -1e9

        attention = tf.nn.softmax(scores)
        output = tf.matmul(attention, v)
        return output

### **B. EncoderLayer**

In [None]:
class EncoderLayer(layers.Layer):
    def __init__(self, d, n_heads):
        super().__init__()
        self.attention_heads = [SelfAttentionLayer(d//n_heads) for _ in range(n_heads)]
        self.fc_layer = FCLayer(2048, d)

    def call(self, x):
        # Multi-head attention
        head_outputs = [head(x, x, x) for head in self.attention_heads]
        concat = tf.concat(head_outputs, axis=-1)
        # Fully connected
        output = self.fc_layer(concat)
        return output

### **C. DecoderLayer**

In [None]:
class DecoderLayer(layers.Layer):
    def __init__(self, d, n_heads):
        super().__init__()
        self.masked_attention = [SelfAttentionLayer(d//n_heads) for _ in range(n_heads)]
        self.encoder_decoder_attention = [SelfAttentionLayer(d//n_heads) for _ in range(n_heads)]
        self.fc_layer = FCLayer(2048, d)

    def call(self, decoder_input, encoder_output, mask):
        # Masked self-attention
        masked_out = multi_head_attention(decoder_input, decoder_input, decoder_input, mask)
        # Encoder-decoder attention
        attn_out = multi_head_attention(masked_out, encoder_output, encoder_output)
        # Fully connected
        output = self.fc_layer(attn_out)
        return output

### **D. Model Transformer Lengkap**

In [None]:
# Hyperparameters
n_steps = 25        # Max sentence length
n_en_vocab = 300    # English vocabulary size
n_de_vocab = 400    # French vocabulary size
d_model = 512
n_heads = 8

# Encoder
encoder_input = layers.Input(shape=(n_steps,))
encoder_emb = layers.Embedding(n_en_vocab, d_model)(encoder_input)
encoder_out = EncoderLayer(d_model, n_heads)(encoder_emb)

# Decoder
decoder_input = layers.Input(shape=(n_steps,))
decoder_emb = layers.Embedding(n_de_vocab, d_model)(decoder_input)
decoder_out = DecoderLayer(d_model, n_heads)(decoder_emb, encoder_out, mask)
decoder_pred = layers.Dense(n_de_vocab, activation='softmax')(decoder_out)

# Model
transformer = models.Model(
    inputs=[encoder_input, decoder_input],
    outputs=decoder_pred,
    name='MiniTransformer'
)
transformer.compile(loss='categorical_crossentropy', optimizer='adam')

# **4. Keunggulan Transformer**
---
### **A. Dibandingkan RNN/LSTM**
* Parallel Processing: Self-attention proses semua kata sekaligus
* Long-Term Dependencies: Tidak ada masalah "vanishing gradient" seperti RNN
* Scalability: Lebih mudah di-scale untuk data besar

## **B. Aplikasi**
* Machine Translation
* Text Summarization
* Question Answering
* Text Generation
* Bahsa telah extended ke Computer Vision (Vision Transformers)