# **Introduction to Transformers in Keras**

- **Transformers** have **revolutionized Natural Language Processing (NLP)** and are also being applied in other fields such as image processing and time series forecasting.
- The model was introduced in the paper **"[Attention is All You Need" by Vaswani et al](https://neuron-ai.at/attention-is-all-you-need/#:~:text=In%20the%20paper%20%E2%80%9CAttention%20Is%20All%20You%20Need%E2%80%9D%2C,attention%20mechanism%20without%20using%20sequence-aligned%20RNNs%20or%20convolution.)**.
- **Difference from RNNs**: Transformers use Self-Attention, allowing parallel processing of data, making them more efficient than traditional sequential models.

## **Transformers Architecture**
Transformers are **divided into two main components**:

- **Encoder**
- **Decoder**

- Both **contain**:

  - **Self-Attention Mechanism**: Allows the **model to weight words** in the **context of a sentence**.
  - **Feed-Forward Neural Networks (FFNN)**: **Transforms** the data after the attention mechanism.

- **Self-Attention**
Each **word** is **represented by three vectors**:
1. *Query* (Q), Represent **what a word "asks" to other words** in the sentence.
2. *Key* (K), Represent **what each word "offers"** to other words.
3. *Value* (V), Represent the **actual information that each word "provides"**.

The **attention score is calculated as a dot product between Query and Key**, **normalized** via **Softmax**.
This mechanism allows to **capture dependencies** even between **distant words in the sequence**.

## **Implementation in Keras**

1. **_Self-Attention Layer_**

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Layer

class SelfAttention(Layer):
    def __init__(self, embed_dim):
        super(SelfAttention, self).__init__()
        self.embed_dim = embed_dim
        self.query_dense = tf.keras.layers.Dense(embed_dim)
        self.key_dense = tf.keras.layers.Dense(embed_dim)
        self.value_dense = tf.keras.layers.Dense(embed_dim)
        self.softmax = tf.keras.layers.Softmax(axis=-1)
        # embed_dim: Size of embeddings.
        # self.query_dense, self.key_dense, self.value_dense: Dense layers to compute query, key, and value matrices.
        # self.softmax: Softmax layer to normalize attention scores.

    def call(self, inputs):
        query = self.query_dense(inputs)
        key = self.key_dense(inputs)
        value = self.value_dense(inputs)

        attention_scores = tf.matmul(query, key, transpose_b=True) / tf.sqrt(float(self.embed_dim))
        attention_weights = self.softmax(attention_scores)
        output = tf.matmul(attention_weights, value)
        return output
    #    query, key, value: Computes query, key, and value matrices from inputs.
    #    attention_scores: Computes attention scores as the dot product of query and key, normalized by the square root of the embedding size.
    #    attention_weights: Softmaxes the attention scores to get attention weights.
    #    output: Computes the output as a weighted average of the values ​​using the attention weights.

## **Encoders in Transformers**
- An **encoder consists of multiple layers of self-attention** and **feed-forward networks**.
- It **includes residual connections** and **layer normalization to stabilize training**.
- **Positional encoding** is **used to maintain word order**.

2. **_Implementing a Transformer Encoder_**

In [None]:
class TransformerEncoder(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerEncoder, self).__init__()
        self.attention = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        # This layer implements multi-head attention, which allows the model to focus on different parts of the input sequence simultaneously
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(ff_dim, activation="relu"),
            tf.keras.layers.Dense(embed_dim)
        ])
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6) # A small value to avoid division by zero.
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        # Normalization levels to stabilize and accelerate training
        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
        # Dropout levels to prevent overfitting

    def call(self, inputs, training): 
        attn_output = self.attention(inputs, inputs) # Calculate the output of multi-head attention.
        attn_output = self.dropout1(attn_output, training=training) # Apply dropout to attention output
        out1 = self.layernorm1(inputs + attn_output) # Adds the original input to the attention output (residual connection) and applies normalization.

        ffn_output = self.ffn(out1) # Pass the normalized output through the feed-forward network.
        ffn_output = self.dropout2(ffn_output, training=training) # Apply dropout to the output of the feed-forward network.
        return self.layernorm2(out1 + ffn_output)
    
# Transformer uses multi-head attention to capture dependencies between words in the input sequence and 
# a feed-forward network to transform the attention output. 
# Residual connections and layer normalization help stabilize and improve model training.

## **Decoder in Transformers**
- The **decoder is similar to the encoder**, but **includes a cross-attention mechanism to connect to the encoder output**.
- It **generates sequences based on the context** provided by the encoder.

3. **_Implementing a Transformer Decoder_**

In [None]:
class TransformerDecoder(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerDecoder, self).__init__()
        self.attention1 = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.attention2 = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(ff_dim, activation="relu"),
            tf.keras.layers.Dense(embed_dim)
        ])
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
        self.dropout3 = tf.keras.layers.Dropout(rate)
        # embed_dim: Size of embeddings.
        # num_heads: Number of attention heads.
        # ff_dim: Size of feed-forward network.
        # rate: Dropout rate.
        # self.attention1: Multi-head attention level for decoder input.
        # self.attention2: Multi-head attention level for encoder output.
        # self.ffn: Feed-forward network.
        # self.layernorm1, self.layernorm2, self.layernorm3: Normalization levels.
        # self.dropout1, self.dropout2, self.dropout3: Dropout levels.

    def call(self, inputs, encoder_output, training):
        attn1 = self.attention1(inputs, inputs)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(inputs + attn1)

        attn2 = self.attention2(out1, encoder_output)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(out1 + attn2)

        ffn_output = self.ffn(out2)
        ffn_output = self.dropout3(ffn_output, training=training)
        return self.layernorm3(out2 + ffn_output)
        # attn1: Compute multi-head attention on decoder input.
        # attn1 = self.dropout1(attn1, training=training): Apply dropout to attention output.
        # out1 = self.layernorm1(inputs + attn1): Add original input to attention output (residual connection) and apply normalization.
        # attn2: Compute multi-head attention on encoder output.
        # attn2 = self.dropout2(attn2, training=training): Apply dropout to attention output.
        # out2 = self.layernorm2(out1 + attn2): Add previous attention output to encoder output (residual connection) and apply normalization.
        # ffn_output: Pass through feed-forward network.
        # ffn_output = self.dropout3(ffn_output, training=training): Apply dropout to the output of the feed-forward network.
        # return self.layernorm3(out2 + ffn_output): Add the output of the feed-forward network to the previous output (residual connection) and apply the final normalization.

- **Transformers** have surpassed **RNNs** due to their **ability to process inputs in parallel**.
- Their **self-attention mechanism is the heart of the model**, allowing for better understanding of context.
- They are **used in NLP**, **vision**, **time-series forecasting**, and many other applications.
- Their architecture is **based on encoders and decoders**, with self-attention mechanisms and feed-forward networks.

Transformers (For Dummies) **look** similar to RNNs because they process text sequences, but their architecture is **fundamentally different**. Let me explain the flow schematically:

1. **Tokenized Input**:
- Text is converted into tokens (numbers representing words or subwords).
- Tokens are transformed into **dense vectors** with **Word Embeddings** (e.g. Word2Vec, GloVe, or directly learned by the model).

2. **Positional Encoding**:
- Since Transformers **do not use time like RNNs**, a mechanism is needed to maintain the order of tokens.
- A **positional encoding** is added to the vectors to make the model understand the sequence.

3. **Encoder** (Repeated Blocks):
- **Self-Attention**: Each word compares itself to all other words in the sentence, calculating how important it is compared to the others (attention weights).
- **Feed-Forward Neural Network (FFNN)**: A neural network transforms the processed vector.
- **Layer Norm + Residual Connections**: Stabilize the training.

4. **Decoder** (Blocks similar to the encoder but with the addition of Cross-Attention):
- **Masked Self-Attention**: Similar to Self-Attention, but prevents looking at future tokens (avoids cheating in text generation).
- **Cross-Attention**: Allows the decoder to "look" at the encoder output.
- **FFNN + Residuals & Normalization**.

5. **Final Output**:
- The decoder produces a probability distribution over each word in the vocabulary.
- The most likely token is selected to form the answer.

---

### **Main Differences between RNN and Transformers**
| Feature | RNN (Recurrent Neural Network) | Transformer |
|---------|--------------------------------|-----------------------------------|
| **Processing** | Sequential (one token at a time) | Parallel (all tokens simultaneously) |
| **Context Handling** | Limited short-term memory, prone to "forgetting" | Global self-attention, captures long-range dependencies |
| **Architecture** | Recurrent structure, state depends on previous inputs | Attention-based, all tokens interact directly |
| **Training Speed** | Slow, difficult to parallelize effectively | Fast, highly parallelizable |
| **Long-Range Dependencies** | Struggles to capture information from distant tokens | Excels at capturing long-range relationships |
| **Gradient Flow** | Prone to vanishing/exploding gradients in long sequences | More stable gradient flow due to attention mechanism |
| **Computational Complexity** | Linear with sequence length | Quadratic with sequence length (in self-attention) |



**Integrate EWC (Elastic Weight Consolidation)** with **Self-Attention** to create a **mechanism that maintains memory over time**, making the **model less likely to forget previous information**. 

1. **Basic Concept**
- **Self-Attention** allows the **model to weight words** based on the **context of a sentence**.
- **EWC protects the critical weights** for previous tasks, **using the Fisher matrix to estimate the importance of each weight**.
- If we **combine EWC with Self-Attention**, we can **stabilize the connections between keywords**, making the **model "remember" important relationships between words over time**.

If **Self-Attention distributes weights dynamically** and **EWC stabilizes some of these weights**, how can we balance **learning new information** without limiting the model's adaptability too much?
How much should we **protect memory** and how much **should we allow the model to adapt**?