# **Transformer Encoder Architecture**
- **Positional Encoding**: To incorporate **information about the position of words**, since Transformers do **not have temporal memory like RNNs**.
- **Self-Attention Multi-Head**: **Each token "weighs"** the others to **understand the context**.
- **Feed-Forward Layer**: After attention, an MLP **processes the information**.
- **Layer Normalization & Dropout**: To **stabilize the training** and avoid overfitting.

In [2]:
# Positional Encoding 
import tensorflow as tf
import numpy as np

class PositionalEncoding(tf.keras.layers.Layer):
    def __init__(self, sequence_length, d_model):
        super(PositionalEncoding, self).__init__()
        self.sequence_length = sequence_length
        self.d_model = d_model

    def call(self, inputs):
        positions = np.arange(self.sequence_length)[:, np.newaxis]
        div_term = np.exp(np.arange(0, self.d_model, 2) * -(np.log(10000.0) / self.d_model))
        
        pos_encoding = np.zeros((self.sequence_length, self.d_model))
        pos_encoding[:, 0::2] = np.sin(positions * div_term)
        pos_encoding[:, 1::2] = np.cos(positions * div_term)

        pos_encoding = tf.cast(pos_encoding, dtype=tf.float32)
        return inputs + pos_encoding

# Adds information about the position of tokens, without explicitly learning it.
# Uses sinusoidal functions to represent position over time.

In [None]:
# Multi-Head Self-Attention Layer
class MultiHeadSelfAttention(tf.keras.layers.Layer):
    def __init__(self, embed_dim, num_heads):
        super(MultiHeadSelfAttention, self).__init__()
        self.num_heads = num_heads
        self.embed_dim = embed_dim
        self.projection_dim = embed_dim // num_heads

        self.query_dense = tf.keras.layers.Dense(embed_dim)
        self.key_dense = tf.keras.layers.Dense(embed_dim)
        self.value_dense = tf.keras.layers.Dense(embed_dim)
        self.combine_heads = tf.keras.layers.Dense(embed_dim)

    def call(self, inputs):
        batch_size = tf.shape(inputs)[0]

        # Query, Key e Value
        query = self.query_dense(inputs)
        key = self.key_dense(inputs)
        value = self.value_dense(inputs)

        # Split in more heads
        query = tf.reshape(query, (batch_size, -1, self.num_heads, self.projection_dim))
        key = tf.reshape(key, (batch_size, -1, self.num_heads, self.projection_dim))
        value = tf.reshape(value, (batch_size, -1, self.num_heads, self.projection_dim))

        # Calculate the scores
        attention_scores = tf.matmul(query, key, transpose_b=True)
        attention_scores /= tf.math.sqrt(tf.cast(self.projection_dim, tf.float32))
        attention_weights = tf.nn.softmax(attention_scores, axis=-1)

        # Output from the self attention 
        attention_output = tf.matmul(attention_weights, value)
        attention_output = tf.reshape(attention_output, (batch_size, -1, self.embed_dim))
        return self.combine_heads(attention_output)
    
# ✔ Each attention head looks at the text differently.
# ✔ Use softmax to normalize attention weights.

In [4]:
# Transformer Encoder Layer
class TransformerEncoderLayer(tf.keras.layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, dropout_rate=0.1):
        super(TransformerEncoderLayer, self).__init__()
        self.attention = MultiHeadSelfAttention(embed_dim, num_heads)
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(ff_dim, activation="relu"),
            tf.keras.layers.Dense(embed_dim),
        ])
        self.norm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.norm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
        self.dropout2 = tf.keras.layers.Dropout(dropout_rate)

    def call(self, inputs, training=False):
        attn_output = self.attention(inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.norm1(inputs + attn_output)  # Residual Connection

        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.norm2(out1 + ffn_output)  # Residual Connection
    
# ✔ The encoder is the heart of Transformers.
# ✔ Use dropout and normalization for stability.

In [5]:
# Complete model with Input Layer

def build_transformer_encoder(sequence_length=100, embed_dim=128, num_heads=8, ff_dim=256):
    inputs = tf.keras.Input(shape=(sequence_length, embed_dim))
    x = PositionalEncoding(sequence_length, embed_dim)(inputs)

    transformer_block = TransformerEncoderLayer(embed_dim, num_heads, ff_dim)
    x = transformer_block(x)

    x = tf.keras.layers.Flatten()(x)  # Let's flatten the tensor for the final output
    outputs = tf.keras.layers.Dense(1, activation="linear")(x)  # For regression (change to "sigmoid" if binary)

    return tf.keras.Model(inputs, outputs)

model = build_transformer_encoder()
model.summary()





# **Summary of Transformers**  

**Transformers** are an **advanced evolution of temporal neural architectures** like **RNNs and LSTMs**, offering several advantages:  

### **Why Are They Better?**  
- More stable gradients → Avoid the **vanishing gradient** problem present in RNNs/LSTMs.  
- Process data in parallel → Unlike RNNs, which work **sequentially**, Transformers **process everything at once**, making training faster.  
- Self-Attention → Each token **receives an importance weight** relative to others, improving **context understanding**.  
- Better at capturing long-range dependencies → Can **link distant concepts** more effectively than RNNs, which have limited memory.  
- Quadratic scaling → Attention computation grows with **O(n²)** instead of **O(n)** like RNNs, allowing greater flexibility.  

### **How Do They Work?**  
1. **Self-Attention:** Each word connects to all others to determine **which are most relevant** in the context.  
2. **Positional Encoding:** Adds position-related information since Transformers **lack intrinsic temporal memory**.  
3. **Multi-Head Attention:** Multiple attention heads work in parallel to **capture different aspects of meaning**.  
4. **Feed-Forward Layer:** After attention, an MLP processes and transforms the data.  

### **Conclusion**  
They are **more efficient, more accurate, and more scalable** compared to RNNs/LSTMs.  
They can be used for **NLP, enhanced memory, time-series analysis, and generative AI**.  