# Transformers

---
## Encoder-Decoder Architecture

The encoder–decoder framework is a foundational approach designed to convert one sequence into another. It is often used in sequence-to-sequence learning and has been applied to a wide range of tasks, including language translation, text summarization and speech recognition.

The model is built to handle pairs of sequences. The input sequence (e.g., a sentence in English) is transformed into an output sequence (e.g., the same sentence translated into French).

The architecture divides the overall task into two main components:
 - **Encoder:** Processes the entire input sequence and compresses it into a compact representation, often called a context vector.
 - **Decoder:** Uses this context vector to generate the output sequence, one token at a time.

![](https://miro.medium.com/v2/resize:fit:1400/1*1JcHGUU7rFgtXC_mydUA_Q.jpeg)

### How It Works

1. **Encoding:**

The encoder reads the input sequence, processing it token by token. At each step, it updates its internal state to capture the context and semantics of the input. The final hidden state of the encoder is then treated as a distilled summary of the entire input sequence. This summary is expected to capture all relevant information needed for generating the output.

2. **Decoding:**

Once the encoder has produced the context vector, the decoder begins generating the output sequence. Starting with an initial state (often derived from the context vector), the decoder predicts the first output token. It then uses that token, along with its current state and the context vector, to predict the next token, and so on. This process continues until a special end-of-sequence token is generated, signaling that the output is complete.


### Strengths And Weaknesses

1. **Strengths:**
    - **Modularity:** By separating the encoding and decoding processes, the model can be more flexible and easier to adapt to different tasks.
    - **Applicability:** The architecture is versatile and has been successfully applied to numerous tasks involving sequence transformations.
    
2. **Limitations:**
    - **Information Bottleneck:** Compressing an entire input sequence into a single fixed-length context vector can result in the loss of fine-grained details, particularly for long or complex inputs.
    - **Sequential Dependency:** Traditional implementations often rely on sequential processing (using RNNs, for example), which can make it challenging to capture long-range dependencies in the input.
    

The encoder–decoder model set the stage for subsequent innovations in sequence modeling. While early implementations often used Recurrent Neural Networks (RNNs) for both the encoder and decoder, the fundamental idea of transforming one sequence into another has inspired a range of advanced architectures. These include models that incorporate attention mechanisms to alleviate the information bottleneck and, eventually, the development of transformer architectures that further improve on both performance and scalability.

---
## Encoder - Decoder RNNs

The encoder–decoder model with RNNs is a specific implementation of the general encoder–decoder framework, where both the encoder and decoder are built using recurrent neural networks. This configuration leverages the sequential processing capabilities of RNNs to capture the temporal dynamics inherent in language, speech, and other sequential data.

In traditional sequence-to-sequence tasks, the model must transform an input sequence into an output sequence, such as converting a sentence in one language to another. RNNs naturally handle sequential data by updating their hidden state with each new token. However, when RNNs are used in isolation, they can struggle with variable-length sequences and long-range dependencies. 

The encoder–decoder structure with RNNs was designed to address these challenges:

**Encoder RNN:**
- **Function:** Processes the input sequence token by token.
- **Mechanism:** At each time step, the encoder updates its hidden state using the current token and the previous hidden state. This recursive process allows the network to build an internal representation of the entire sequence.
- **Output:** The final hidden state acts as a compressed summary (context vector) of the entire input sequence.
- **Benefit:** Captures sequential patterns and dependencies in the input, albeit in a fixed-length vector.


**Decoder RNN:**
- **Function:** Generates the output sequence based on the encoded representation.
- **Mechanism:** Starting from the context vector, the decoder predicts the output token at each time step, using its previous outputs and hidden state to generate the next token.
- **Output:** A sequence of tokens that represents the target sequence, such as a translated sentence.
- **Benefit:** Provides a structured way to generate sequences while maintaining context across time steps.

### Mathematical Formulation

#### Encoder

Given an input sequence $ X = \{x_1, x_2, \dots, x_T\} $, the encoder processes each token step-by-step using a recurrent formula:
$$
h_t = f(W_{xh} \, x_t + W_{hh} \, h_{t-1} + b_h)
$$
- $ h_t $: Hidden state at time $ t $
- $ W_{xh} $ and $ W_{hh} $: Weight matrices
- $ b_h $: Bias term
- $ f $: Activation function (e.g., $\tanh$, ReLU)

The final hidden state $ h_T $ serves as the **context vector**:
$$
c = h_T
$$

#### Decoder

The decoder RNN generates the output sequence $ Y = \{y_1, y_2, \dots, y_{T'}\} $ conditioned on the context vector $ c $. Its recurrence is:
$$
s_t = f(W_{ys} \, y_{t-1} + W_{cs} \, c + W_{ss} \, s_{t-1} + b_s)
$$

- $ s_t $: Decoder hidden state at time $ t $
- $ y_{t-1} $: Previously generated output (usually embedded)
- $ W_{ys} $, $ W_{cs} $, $ W_{ss} $: Weight matrices
- $ b_s $: Bias term

The output token is computed as:
$$
y_t = \text{softmax}(W_{out} \, s_t + b_{out})
$$

![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTBvrDTed79lFHC8GMLQ757v11n_Y1nV0V_1Q&s)

### Limitations With This Approach

While encoder–decoder RNNs were a significant breakthrough, they come with inherent limitations:

- **Fixed-Length Context Vector:**
Compressing an entire input sequence into a single vector may result in loss of important details, particularly for long or complex sequences. This fixed-size bottleneck can limit performance when the input contains intricate or diverse information.
- **Sequential Bottleneck in Decoding:**
Because the decoder relies on sequential prediction, the generation process can be slow and may struggle with maintaining long-term dependencies over extended outputs.

These challenges led to the development of advanced techniques like attention mechanisms, which allow the decoder to dynamically refer back to different parts of the input sequence during generation, thereby alleviating the information bottleneck of a fixed-length context vector.

### TensorFlow Code Example

#### Import Libraries

In [1]:
import tensorflow as tf
from tensorflow.keras.layers import Embedding, GRU, Dense
from tensorflow.keras.models import Model
import numpy as np

2025-03-16 23:02:34.458538: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


#### Define The Encoder

In [2]:
class Encoder(Model):
    def __init__(self, vocab_size, embedding_dim, enc_units):
        super(Encoder, self).__init__()
        self.enc_units = enc_units
        self.embedding = Embedding(vocab_size, embedding_dim)
        # return_sequences=True to get outputs at all time steps
        # return_state=True to get the final hidden state
        self.gru = GRU(enc_units, return_sequences=True, return_state=True)
    
    def call(self, x, hidden):
        # x: (batch_size, sequence_length)
        x = self.embedding(x)  # (batch_size, sequence_length, embedding_dim)
        output, state = self.gru(x, initial_state=hidden)
        return output, state
    
    def initialize_hidden_state(self, batch_size):
        return tf.zeros((batch_size, self.enc_units))

#### Define The Decoder

In [3]:
class Decoder(Model):
    def __init__(self, vocab_size, embedding_dim, dec_units):
        super(Decoder, self).__init__()
        self.dec_units = dec_units
        self.embedding = Embedding(vocab_size, embedding_dim)
        self.gru = GRU(dec_units, return_sequences=True, return_state=True)
        self.fc = Dense(vocab_size)
    
    def call(self, x, hidden):
        # x: (batch_size, 1) as we process one token at a time
        x = self.embedding(x)  # (batch_size, 1, embedding_dim)
        output, state = self.gru(x, initial_state=hidden)
        # reshape output from (batch_size, 1, dec_units) to (batch_size, dec_units)
        output = tf.reshape(output, (-1, output.shape[2]))
        x = self.fc(output)  # (batch_size, vocab_size)
        return x, state

#### Set Up Training Parameters And Data

In [4]:
vocab_inp_size = 10   # Input vocabulary size
vocab_tar_size = 10   # Target vocabulary size
embedding_dim = 16    # Embedding dimension
units = 16            # Number of GRU units
BATCH_SIZE = 1        # Batch size (for simplicity)

# Create sample input and target sequences (batch_size=1)
input_seq = tf.constant([[1, 2, 3, 4, 5]], dtype=tf.int32)   # shape: (1, sequence_length)
target_seq = tf.constant([[1, 2, 3, 4, 6]], dtype=tf.int32)

# Initialize encoder and decoder models
encoder = Encoder(vocab_inp_size, embedding_dim, units)
decoder = Decoder(vocab_tar_size, embedding_dim, units)

# Define the optimizer and the loss function
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

2025-03-16 23:05:03.036286: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


#### Define Training Step

In [5]:
@tf.function
def train_step(input_seq, target_seq, teacher_forcing_ratio=0.5):
    loss = 0

    with tf.GradientTape() as tape:
        # Initialize encoder hidden state
        enc_hidden = encoder.initialize_hidden_state(BATCH_SIZE)
        # Encode the input sequence
        enc_output, enc_hidden = encoder(input_seq, enc_hidden)
        
        # Set initial state for decoder as the encoder's final hidden state
        dec_hidden = enc_hidden
        # Start-of-sequence token (here assumed to be index 0)
        dec_input = tf.expand_dims([0] * BATCH_SIZE, 1)  # shape: (batch_size, 1)
        
        # Iterate over each token in the target sequence
        for t in range(target_seq.shape[1]):
            # Pass the current token and state through the decoder
            predictions, dec_hidden = decoder(dec_input, dec_hidden)
            # Compute loss comparing predicted token with actual target token
            loss += loss_object(target_seq[:, t], predictions)
            
            # Decide whether to use teacher forcing
            if np.random.rand() < teacher_forcing_ratio:
                # Teacher forcing: feed the target token as the next input
                dec_input = tf.expand_dims(target_seq[:, t], 1)
            else:
                # Without teacher forcing: use the decoder's prediction as the next input
                predicted_ids = tf.argmax(predictions, axis=1, output_type=tf.int32)
                dec_input = tf.expand_dims(predicted_ids, 1)
    
    # Compute gradients and update model parameters
    batch_loss = loss / int(target_seq.shape[1])
    variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))
    
    return batch_loss

#### Run Training

In [6]:
loss_val = train_step(input_seq, target_seq)
print("Loss:", loss_val.numpy())

Loss: 2.3087447


### PyTorch Code Example

#### Import Libraries

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

#### Define The Encoder

In [None]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
    
    def forward(self, input, hidden):
        # Embed the input token and reshape for the GRU
        embedded = self.embedding(input).view(1, 1, self.hidden_size)
        output, hidden = self.gru(embedded, hidden)
        return output, hidden
    
    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size)

#### Define The Decoder

In [None]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)
    
    def forward(self, input, hidden):
        # Embed the input token and reshape for the GRU
        embedded = self.embedding(input).view(1, 1, self.hidden_size)
        output, hidden = self.gru(embedded, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden

#### Training Loop For One Pair Of Sequences

In [None]:
if __name__ == "__main__":
    # Hyperparameters
    input_size = 10    # vocabulary size for input
    output_size = 10   # vocabulary size for output
    hidden_size = 16
    teacher_forcing_ratio = 0.5

    # Example input and target sequences (represented as indices)
    input_seq = torch.tensor([1, 2, 3, 4, 5], dtype=torch.long)   # Input sequence of length 5
    target_seq = torch.tensor([1, 2, 3, 4, 6], dtype=torch.long)  # Target sequence of length 5

    encoder = EncoderRNN(input_size, hidden_size)
    decoder = DecoderRNN(hidden_size, output_size)
    
    # Optimizers and loss function for training
    encoder_optimizer = optim.SGD(encoder.parameters(), lr=0.01)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=0.01)
    criterion = nn.NLLLoss()

    # Zero gradients
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # Encode the input sequence
    encoder_hidden = encoder.initHidden()
    for token in input_seq:
        encoder_output, encoder_hidden = encoder(token.unsqueeze(0), encoder_hidden)
    
    # Initialize decoder: use a start-of-sequence token (e.g., index 0)
    decoder_input = torch.tensor([0], dtype=torch.long)
    decoder_hidden = encoder_hidden

    loss = 0
    # Decode the sequence using teacher forcing
    for di, target_token in enumerate(target_seq):
        decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
        loss += criterion(decoder_output, target_token.unsqueeze(0))
        
        # Decide if we are going to use teacher forcing or not
        use_teacher_forcing = True if torch.rand(1).item() < teacher_forcing_ratio else False
        
        if use_teacher_forcing:
            # Teacher forcing: next input is current target token
            decoder_input = target_token.unsqueeze(0)
        else:
            # Without teacher forcing: next input is decoder's own prediction
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach().unsqueeze(0)

    # Backpropagate the error and update weights
    loss.backward()
    encoder_optimizer.step()
    decoder_optimizer.step()

    print(f"Loss: {loss.item()}")

## RNNs with Attention

To address the bottleneck problem in encoder–decoder architectures, the attention mechanism was introduced. Instead of relying solely on the final encoder state, attention allows the decoder to dynamically focus on different parts of the input sequence at each step of the output generation. This results in a more flexible model that can better capture long-range dependencies.

### How Attention Works

#### Alignment Scores

At each decoding step $ t $, the model calculates an alignment score between the decoder’s previous hidden state $ s_{t-1} $ and each encoder hidden state $ h_j $. 

For instance, in Bahdanau attention:
$$
e_{tj} = \text{score}(s_{t-1}, h_j) = v_a^\top \tanh(W_a s_{t-1} + U_a h_j)
$$

- $ v_a $, $ W_a $, $ U_a $: Learnable parameters

#### Attention Weights and Context Vector

The alignment scores are normalized using the softmax function to yield attention weights:
$$
\alpha_{tj} = \frac{\exp(e_{tj})}{\sum_{k=1}^{T} \exp(e_{tk})}
$$

These weights are then used to compute a dynamic context vector for each decoding step:
$$
c_t = \sum_{j=1}^{T} \alpha_{tj} h_j
$$

This context vector provides targeted information from the input sequence, allowing the decoder to focus on the most relevant parts.

#### Updated Decoder Computation

The decoder now incorporates the dynamic context vector $ c_t $ into its recurrence:
$$
s_t = f(W_{ys} \, y_{t-1} + W_{cs} \, c_t + W_{ss} \, s_{t-1} + b_s)
$$

### TensorFlow Example

#### Import Libraries

In [2]:
import tensorflow as tf
from tensorflow.keras.layers import Embedding, GRU, Dense, Concatenate
from tensorflow.keras.models import Model
import numpy as np

#### Define The Encoder

In [3]:
class Encoder(Model):
    def __init__(self, vocab_size, embedding_dim, enc_units):
        super(Encoder, self).__init__()
        self.enc_units = enc_units
        self.embedding = Embedding(vocab_size, embedding_dim)
        self.gru = GRU(enc_units, return_sequences=True, return_state=True)
    
    def call(self, x, hidden):
        x = self.embedding(x)  # (batch, seq_len, embedding_dim)
        output, state = self.gru(x, initial_state=hidden)
        return output, state
    
    def initialize_hidden_state(self, batch_size):
        return tf.zeros((batch_size, self.enc_units))

#### Bahdanau Attention Definition

In [4]:
class BahdanauAttention(Model):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = Dense(units)
        self.W2 = Dense(units)
        self.V = Dense(1)
    
    def call(self, query, values):
        # query: decoder hidden state at current time step (batch, hidden)
        # values: encoder outputs (batch, seq_len, hidden)
        query_with_time_axis = tf.expand_dims(query, 1)  # (batch, 1, hidden)
        # Score: (batch, seq_len, 1)
        score = self.V(tf.nn.tanh(self.W1(query_with_time_axis) + self.W2(values)))
        # Attention weights: (batch, seq_len, 1)
        attention_weights = tf.nn.softmax(score, axis=1)
        # Context vector: weighted sum of encoder outputs (batch, hidden)
        context_vector = attention_weights * values  # (batch, seq_len, hidden)
        context_vector = tf.reduce_sum(context_vector, axis=1)
        return context_vector, attention_weights

#### Define The Decoder

In [5]:
class Decoder(Model):
    def __init__(self, vocab_size, embedding_dim, dec_units):
        super(Decoder, self).__init__()
        self.dec_units = dec_units
        self.embedding = Embedding(vocab_size, embedding_dim)
        self.gru = GRU(dec_units, return_sequences=True, return_state=True)
        self.fc = Dense(vocab_size)
        
        self.attention = BahdanauAttention(dec_units)
    
    def call(self, x, hidden, enc_output):
        # x: (batch, 1) -> current input token
        x = self.embedding(x)  # (batch, 1, embedding_dim)
        # Calculate attention
        context_vector, attention_weights = self.attention(hidden, enc_output)
        context_vector = tf.expand_dims(context_vector, 1)  # (batch, 1, dec_units)
        # Concatenate context vector with embedding
        x = Concatenate(axis=-1)([context_vector, x])  # (batch, 1, dec_units+embedding_dim)
        output, state = self.gru(x, initial_state=hidden)
        output = tf.reshape(output, (-1, output.shape[2]))  # (batch, dec_units)
        x = self.fc(output)  # (batch, vocab_size)
        return x, state, attention_weights

#### Training Example

In [6]:
vocab_inp_size = 10
vocab_tar_size = 10
embedding_dim = 16
units = 16
BATCH_SIZE = 1

# Sample input and target sequences (batch size = 1)
input_seq = tf.constant([[1, 2, 3, 4, 5]], dtype=tf.int32)   # (batch, seq_len)
target_seq = tf.constant([[1, 2, 3, 4, 6]], dtype=tf.int32)

encoder = Encoder(vocab_inp_size, embedding_dim, units)
decoder = Decoder(vocab_tar_size, embedding_dim, units)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

@tf.function
def train_step(input_seq, target_seq, teacher_forcing_ratio=0.5):
    loss = 0
    with tf.GradientTape() as tape:
        batch_size = input_seq.shape[0]
        enc_hidden = encoder.initialize_hidden_state(batch_size)
        enc_output, enc_hidden = encoder(input_seq, enc_hidden)
        
        dec_hidden = enc_hidden
        # Start-of-sequence token assumed to be 0
        dec_input = tf.expand_dims([0] * batch_size, 1)  # (batch, 1)
        
        for t in range(target_seq.shape[1]):
            predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
            loss += loss_object(target_seq[:, t], predictions)
            
            if np.random.rand() < teacher_forcing_ratio:
                dec_input = tf.expand_dims(target_seq[:, t], 1)
            else:
                predicted_ids = tf.argmax(predictions, axis=1, output_type=tf.int32)
                dec_input = tf.expand_dims(predicted_ids, 1)
    
    batch_loss = loss / int(target_seq.shape[1])
    variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))
    return batch_loss

loss_val = train_step(input_seq, target_seq)
print("TensorFlow Loss:", loss_val.numpy())

TensorFlow Loss: 2.3010743


### PyTorch Example

#### Import Libraries

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

#### Define The Encoder

In [None]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
    
    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, self.hidden_size)  # (1, batch, hidden_size)
        output, hidden = self.gru(embedded, hidden)
        return output, hidden
    
    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size)

#### Bahdanau Attention Definition

In [None]:
class BahdanauAttention(nn.Module):
    def __init__(self, hidden_size):
        super(BahdanauAttention, self).__init__()
        self.hidden_size = hidden_size
        self.attn = nn.Linear(hidden_size * 2, hidden_size)
        self.v = nn.Parameter(torch.rand(hidden_size))
    
    def forward(self, hidden, encoder_outputs):
        # hidden: (1, batch, hidden_size) current decoder hidden state
        # encoder_outputs: (seq_len, batch, hidden_size)
        seq_len = encoder_outputs.size(0)
        batch_size = encoder_outputs.size(1)
        
        # Repeat decoder hidden state seq_len times
        hidden = hidden.repeat(seq_len, 1, 1)  # (seq_len, batch, hidden_size)
        # Concatenate encoder outputs and repeated hidden state along the last dim
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), 2)))  # (seq_len, batch, hidden_size)
        # Compute alignment scores (dot product with v)
        energy = energy.permute(1, 0, 2)  # (batch, seq_len, hidden_size)
        v = self.v.repeat(batch_size, 1).unsqueeze(1)  # (batch, 1, hidden_size)
        scores = torch.bmm(v, energy.permute(0, 2, 1))  # (batch, 1, seq_len)
        attn_weights = torch.softmax(scores, dim=2)  # (batch, 1, seq_len)
        # Compute context vector as weighted sum of encoder_outputs
        encoder_outputs = encoder_outputs.permute(1, 0, 2)  # (batch, seq_len, hidden_size)
        context = torch.bmm(attn_weights, encoder_outputs)  # (batch, 1, hidden_size)
        context = context.permute(1, 0, 2)  # (1, batch, hidden_size)
        return context, attn_weights

#### Define The Decoder

In [None]:
class DecoderRNNWithAttention(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNNWithAttention, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.attention = BahdanauAttention(hidden_size)
        # Combine context vector and embedding before feeding into GRU
        self.gru = nn.GRU(hidden_size * 2, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)
    
    def forward(self, input, hidden, encoder_outputs):
        # Embed input token
        embedded = self.embedding(input).view(1, 1, self.hidden_size)
        # Compute attention context vector
        context, attn_weights = self.attention(hidden, encoder_outputs)
        # Concatenate embedded input and context vector
        rnn_input = torch.cat((embedded, context), 2)  # (1, 1, 2*hidden_size)
        output, hidden = self.gru(rnn_input, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden, attn_weights

#### Example Training Loop

In [None]:
if __name__ == "__main__":
    # Hyperparameters
    input_size = 10
    output_size = 10
    hidden_size = 16
    teacher_forcing_ratio = 0.5
    
    # Dummy input and target sequences (indices)
    input_seq = torch.tensor([1, 2, 3, 4, 5], dtype=torch.long)    # Length = 5
    target_seq = torch.tensor([1, 2, 3, 4, 6], dtype=torch.long)   # Length = 5
    
    encoder = EncoderRNN(input_size, hidden_size)
    decoder = DecoderRNNWithAttention(hidden_size, output_size)
    
    encoder_optimizer = optim.SGD(encoder.parameters(), lr=0.01)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=0.01)
    criterion = nn.NLLLoss()
    
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()
    
    encoder_hidden = encoder.initHidden()
    encoder_outputs = torch.zeros(len(input_seq), 1, hidden_size)
    
    # Encode the input sequence
    for t, token in enumerate(input_seq):
        encoder_output, encoder_hidden = encoder(token.unsqueeze(0), encoder_hidden)
        encoder_outputs[t] = encoder_output[0]
    
    decoder_input = torch.tensor([0], dtype=torch.long)  # Start-of-sequence token
    decoder_hidden = encoder_hidden
    loss = 0
    
    # Decode with attention and teacher forcing
    for t, target_token in enumerate(target_seq):
        decoder_output, decoder_hidden, attn_weights = decoder(decoder_input, decoder_hidden, encoder_outputs)
        loss += criterion(decoder_output, target_token.unsqueeze(0))
        use_teacher_forcing = True if torch.rand(1).item() < teacher_forcing_ratio else False
        
        if use_teacher_forcing:
            decoder_input = target_token.unsqueeze(0)
        else:
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach().unsqueeze(0)
    
    loss.backward()
    encoder_optimizer.step()
    decoder_optimizer.step()
    
    print("PyTorch Loss:", loss.item())

## Transformers

Transformers mark a paradigm shift in sequence modeling by eliminating the need for recurrent architectures. Introduced in the seminal paper "Attention is All You Need," transformers rely solely on attention mechanisms and feed-forward neural networks. This design enables the model to process entire sequences in parallel, vastly improving training efficiency and scalability, particularly for long sequences.

Traditional sequence models like RNNs or LSTMs process tokens sequentially, meaning that each token's representation depends on the previous ones. While effective for capturing temporal dependencies, this sequential nature makes training inefficient and limits the model's ability to capture long-range dependencies due to vanishing gradients.

Transformers address these challenges by:

- **Parallel Processing:**  
  Instead of processing tokens one at a time, transformers analyze the entire sequence simultaneously. This parallelism speeds up training on modern hardware (e.g., GPUs) and allows for more efficient use of computational resources.

- **Long-Range Dependency Modeling:**  
  With attention mechanisms, each token in a sequence can directly interact with every other token. This direct connectivity helps the model capture long-range dependencies that are often lost in RNNs.

- **Scalability:**  
  The architecture scales well with increasing amounts of data and longer sequences, making it suitable for large-scale natural language processing tasks and beyond.

![](https://aiml.com/wp-content/uploads/2023/09/Annotated-Transformers-Architecture.png)

## Core Building Blocks Of Transformers

Transformers are composed of several key components that work together to transform input sequences into meaningful outputs. 

### Self-Attention

Self-attention is the central mechanism that allows transformers to weigh the importance of different tokens in a sequence relative to one another. This mechanism helps the model understand context by dynamically adjusting the influence of each token based on its relationship with all other tokens.

#### Key, Query, and Value

Each input token is transformed into three distinct vectors:
- **Query (Q):** Represents the current token’s request for information.
- **Key (K):** Encodes the content of each token so that it can be matched with queries.
- **Value (V):** Contains the actual information or features of the token that can be passed to subsequent layers.

For an input matrix $X$ (with shape $T \times d_{\text{model}}$, where $T$ is the sequence length and $d_{\text{model}}$ is the embedding dimension), these vectors are computed as:

$$
Q = XW^Q, \quad K = XW^K, \quad V = XW^V
$$

- **Learned Projections:**  
  The weight matrices $W^Q$, $W^K$, and $W^V$ are learned during training. They transform the input embeddings into different subspaces, allowing the model to capture various aspects of the data.

#### Scaled Dot-Product Attention

Once the queries, keys, and values are computed, the model determines how much attention to pay to each token. This is done using the scaled dot-product attention:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V
$$

- **Dot-Product:**  
  The product $QK^\top$ computes the similarity between each query and all keys.
- **Scaling Factor:**  
  Dividing by $\sqrt{d_k}$ (the square root of the key dimension) prevents the dot-product values from growing too large, which could lead to very small gradients when passed through the softmax function.
- **Softmax:**  
  The softmax function normalizes these scores into a probability distribution. Each weight in this distribution indicates the importance of the corresponding token.
- **Weighted Sum:**  
  The final output is a weighted sum of the value vectors $V$, where tokens with higher attention weights contribute more to the output.

### Multi-Head Attention

Instead of computing a single attention function, transformers use multi-head attention to capture information from multiple representation subspaces simultaneously. This is achieved by splitting the queries, keys, and values into multiple “heads,” each with its own learned projections.

For each head $i$:

$$
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
$$

- **Multiple Heads:**  
  Each head learns different aspects or features of the input by operating in its own subspace.
- **Concatenation and Projection:**  
  The outputs from all heads are concatenated and then projected using an output matrix $W^O$:

$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O
$$

This allows the model to integrate diverse information from different heads, resulting in a richer representation.

### Position-Wise Feed-Forward Networks

After the multi-head attention layer, each token’s representation is processed independently through a feed-forward network. This network is applied identically to each position and is composed of two linear transformations with a non-linear activation function (typically ReLU) in between:

$$
\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
$$

- **Independent Processing:**  
  Unlike attention, which mixes information between tokens, the feed-forward network operates on each token independently, allowing the model to learn complex transformations at each position.
- **Enhancing Capacity:**  
  This network increases the representational power of the model and helps it capture intricate patterns in the data.

### Positional Encoding

Since transformers process tokens in parallel and lack any inherent notion of token order, positional encodings are added to the input embeddings to inject information about the position of each token. A common method is to use sine and cosine functions of varying frequencies:

$$
\text{PE}_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
$$

$$
\text{PE}_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
$$

- **Encoding Positions:**  
  Here, $pos$ represents the position of the token in the sequence, and $i$ indexes the dimension. The sine and cosine functions provide a smooth, continuous way to encode positional information.
- **Incorporation into Embeddings:**  
  These positional encodings are added directly to the token embeddings before they are processed by the transformer layers, allowing the model to learn about the order of tokens.

### TensorFlow Example

In [7]:
import tensorflow as tf
import numpy as np
import math

#### Positional Encoding

In [8]:
class PositionalEncoding(tf.keras.layers.Layer):
    def __init__(self, position, d_model):
        """
        Create sinusoidal positional encodings.
        Args:
            position: Maximum position (sequence length).
            d_model: Dimensionality of the embeddings.
        """
        super(PositionalEncoding, self).__init__()
        self.pos_encoding = self.positional_encoding(position, d_model)
    
    def get_angles(self, pos, i, d_model):
        angle_rates = 1 / (10000 ** (2 * (i // 2) / tf.cast(d_model, tf.float32)))
        return pos * angle_rates
    
    def positional_encoding(self, position, d_model):
        angle_rads = self.get_angles(
            tf.range(position, dtype=tf.float32)[:, tf.newaxis],
            tf.range(d_model, dtype=tf.float32)[tf.newaxis, :],
            d_model)
        
        # Apply sin to even indices; cos to odd indices
        sines = tf.math.sin(angle_rads[:, 0::2])
        cosines = tf.math.cos(angle_rads[:, 1::2])
        
        pos_encoding = tf.concat([sines, cosines], axis=-1)
        pos_encoding = pos_encoding[tf.newaxis, ...]  # Shape: (1, position, d_model)
        return tf.cast(pos_encoding, tf.float32)
    
    def call(self, inputs):
        """
        Add positional encodings to the input embeddings.
        Args:
            inputs: Tensor of shape (batch_size, seq_len, d_model)
        Returns:
            Tensor with positional encodings added.
        """
        return inputs + self.pos_encoding[:, :tf.shape(inputs)[1], :]

#### Building The Transformer Model

In [9]:
class Transformer(tf.keras.Model):
    def __init__(self, vocab_size, d_model, num_heads, dff, num_layers, maximum_position_encoding, dropout_rate=0.1):
        """
        A simple transformer for sequence-to-sequence tasks.
        Args:
            vocab_size: Vocabulary size (for both source and target).
            d_model: Dimensionality of the embeddings and transformer model.
            num_heads: Number of attention heads.
            dff: Dimensionality of the feed-forward network.
            num_layers: Number of encoder and decoder layers.
            maximum_position_encoding: Maximum sequence length.
            dropout_rate: Dropout rate.
        """
        super(Transformer, self).__init__()
        self.d_model = d_model
        self.num_layers = num_layers

        # Embedding layer and positional encoding for both source and target
        self.embedding = tf.keras.layers.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(maximum_position_encoding, d_model)
        
        self.dropout = tf.keras.layers.Dropout(dropout_rate)
        
        # Encoder: stack of multi-head attention and feed-forward networks
        self.enc_layers = [tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model) 
                           for _ in range(num_layers)]
        self.ffn_layers_enc = [tf.keras.Sequential([
            tf.keras.layers.Dense(dff, activation='relu'),
            tf.keras.layers.Dense(d_model)
        ]) for _ in range(num_layers)]
        
        # Decoder: each layer performs self-attention, cross-attention, and feed-forward
        self.dec_layers = [tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model) 
                           for _ in range(num_layers)]
        self.ffn_layers_dec = [tf.keras.Sequential([
            tf.keras.layers.Dense(dff, activation='relu'),
            tf.keras.layers.Dense(d_model)
        ]) for _ in range(num_layers)]
        
        # Final dense layer to project the decoder output to the vocabulary space
        self.final_layer = tf.keras.layers.Dense(vocab_size)
    
    def call(self, src, tgt, training):
        """
        Forward pass of the transformer.
        Args:
            src: Source sequence tensor of shape (batch, src_seq_len).
            tgt: Target sequence tensor of shape (batch, tgt_seq_len).
            training: Boolean flag for training mode.
        Returns:
            Output logits of shape (batch, tgt_seq_len, vocab_size).
        """
        # Embedding and positional encoding for source
        src_emb = self.embedding(src)  # (batch, src_seq_len, d_model)
        src_emb *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        src_emb = self.pos_encoding(src_emb)
        src_emb = self.dropout(src_emb, training=training)
        
        # Pass through encoder layers
        enc_output = src_emb
        for i in range(self.num_layers):
            # Self-attention: keys, queries, and values are all the encoder output
            attn_output = self.enc_layers[i](query=enc_output, value=enc_output, key=enc_output)
            attn_output = self.dropout(attn_output, training=training)
            # Residual connection and layer normalization
            enc_output = tf.keras.layers.LayerNormalization(epsilon=1e-6)(enc_output + attn_output)
            
            # Feed-forward network with residual connection and layer normalization
            ffn_output = self.ffn_layers_enc[i](enc_output)
            ffn_output = self.dropout(ffn_output, training=training)
            enc_output = tf.keras.layers.LayerNormalization(epsilon=1e-6)(enc_output + ffn_output)
        
        # Embedding and positional encoding for target
        tgt_emb = self.embedding(tgt)  # (batch, tgt_seq_len, d_model)
        tgt_emb *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        tgt_emb = self.pos_encoding(tgt_emb)
        tgt_emb = self.dropout(tgt_emb, training=training)
        
        # Pass through decoder layers
        dec_output = tgt_emb
        for i in range(self.num_layers):
            # Self-attention on target (for simplicity, look-ahead mask is omitted)
            attn1 = self.dec_layers[i](query=dec_output, value=dec_output, key=dec_output)
            attn1 = self.dropout(attn1, training=training)
            dec_output = tf.keras.layers.LayerNormalization(epsilon=1e-6)(dec_output + attn1)
            
            # Cross-attention: target attends to encoder output
            attn2 = self.dec_layers[i](query=dec_output, value=enc_output, key=enc_output)
            attn2 = self.dropout(attn2, training=training)
            dec_output = tf.keras.layers.LayerNormalization(epsilon=1e-6)(dec_output + attn2)
            
            # Feed-forward network with residual connection and layer normalization
            ffn_output = self.ffn_layers_dec[i](dec_output)
            ffn_output = self.dropout(ffn_output, training=training)
            dec_output = tf.keras.layers.LayerNormalization(epsilon=1e-6)(dec_output + ffn_output)
        
        # Final projection to vocabulary size
        final_output = self.final_layer(dec_output)  # (batch, tgt_seq_len, vocab_size)
        return final_output

#### Example Usage

In [10]:
if __name__ == "__main__":
    # Hyperparameters
    vocab_size = 1000        # Size of the vocabulary
    d_model = 128            # Embedding dimension and transformer model dimension
    num_heads = 4            # Number of attention heads
    dff = 512                # Dimension of the feed-forward network
    num_layers = 2           # Number of layers in both encoder and decoder
    maximum_position_encoding = 100  # Maximum sequence length
    dropout_rate = 0.1
    batch_size = 32
    src_seq_len = 20         # Source sequence length
    tgt_seq_len = 20         # Target sequence length

    # Create dummy source and target sequences (batch, seq_len)
    src = tf.random.uniform((batch_size, src_seq_len), minval=0, maxval=vocab_size, dtype=tf.int32)
    tgt = tf.random.uniform((batch_size, tgt_seq_len), minval=0, maxval=vocab_size, dtype=tf.int32)

    # Instantiate the transformer model
    transformer = Transformer(vocab_size, d_model, num_heads, dff, num_layers, maximum_position_encoding, dropout_rate)

    # Define optimizer and loss function
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
    loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

    # Forward pass: compute model predictions
    predictions = transformer(src, tgt, training=True)
    # predictions shape: (batch, tgt_seq_len, vocab_size)

    # Compute loss between target and predictions
    loss = loss_object(tgt, predictions)
    
    # Backpropagation and optimizer step
    with tf.GradientTape() as tape:
        predictions = transformer(src, tgt, training=True)
        loss = loss_object(tgt, predictions)
    gradients = tape.gradient(loss, transformer.trainable_variables)
    optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

    print("Transformer loss:", loss.numpy())

Transformer loss: 7.017174


### PyTorch Example

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import math

#### Positional Encoding

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        """
        Implements the sinusoidal positional encoding function.
        Args:
            d_model: the embedding dimension.
            dropout: dropout rate.
            max_len: maximum length of sequences.
        """
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # Create a long enough PEs matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        # Compute the positional encodings once in log space.
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)  # Even indices
        pe[:, 1::2] = torch.cos(position * div_term)  # Odd indices
        pe = pe.unsqueeze(1)  # Shape: (max_len, 1, d_model)
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        Args:
            x: Tensor of shape (seq_len, batch_size, d_model)
        Returns:
            Tensor with positional encodings added.
        """
        x = x + self.pe[:x.size(0)]
        return self.dropout(x)

#### Defining The Transformer Model

In [None]:
class TransformerModel(nn.Module):
    def __init__(self, vocab_size, model_dim, num_heads, num_encoder_layers, 
                 num_decoder_layers, ff_dim, dropout=0.1):
        """
        Args:
            vocab_size: Size of the vocabulary (for both src and tgt).
            model_dim: Dimensionality of the embeddings and transformer.
            num_heads: Number of attention heads.
            num_encoder_layers: Number of encoder layers.
            num_decoder_layers: Number of decoder layers.
            ff_dim: Dimensionality of the feed-forward network.
            dropout: Dropout rate.
        """
        super(TransformerModel, self).__init__()
        self.model_dim = model_dim
        
        # Embedding layers for input tokens
        self.embedding = nn.Embedding(vocab_size, model_dim)
        # Positional encoding to inject sequence order information
        self.pos_encoder = PositionalEncoding(model_dim, dropout)
        # PyTorch's built-in Transformer module
        self.transformer = nn.Transformer(d_model=model_dim, nhead=num_heads,
                                          num_encoder_layers=num_encoder_layers,
                                          num_decoder_layers=num_decoder_layers,
                                          dim_feedforward=ff_dim,
                                          dropout=dropout)
        # Final linear layer to project the transformer output to vocab size
        self.fc_out = nn.Linear(model_dim, vocab_size)
    
    def generate_square_subsequent_mask(self, sz):
        """
        Generate a square mask for the sequence. The masked positions are filled with float('-inf').
        Args:
            sz: Size of the mask (typically the target sequence length)
        Returns:
            A tensor mask of shape (sz, sz)
        """
        mask = torch.triu(torch.ones(sz, sz) * float('-inf'), diagonal=1)
        return mask

    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        """
        Args:
            src: Source sequence tensor of shape (src_seq_len, batch_size)
            tgt: Target sequence tensor of shape (tgt_seq_len, batch_size)
            src_mask: Optional mask for the source sequence.
            tgt_mask: Optional mask for the target sequence.
        Returns:
            Output logits of shape (tgt_seq_len, batch_size, vocab_size)
        """
        # Embed the input tokens and scale them
        src = self.embedding(src) * math.sqrt(self.model_dim)
        src = self.pos_encoder(src)
        tgt = self.embedding(tgt) * math.sqrt(self.model_dim)
        tgt = self.pos_encoder(tgt)
        
        # Pass through the transformer
        output = self.transformer(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
        output = self.fc_out(output)
        return output

#### Example Usage

In [None]:
if __name__ == "__main__":
    # Hyperparameters
    vocab_size = 100         # Example vocabulary size
    model_dim = 512          # Embedding dimension and transformer model dimension
    num_heads = 8            # Number of attention heads
    num_encoder_layers = 3   # Number of encoder layers
    num_decoder_layers = 3   # Number of decoder layers
    ff_dim = 2048            # Feed-forward network dimension
    dropout = 0.1
    batch_size = 32
    src_seq_len = 10         # Source sequence length
    tgt_seq_len = 10         # Target sequence length

    # Create dummy source and target sequences (each element is a token index)
    src = torch.randint(0, vocab_size, (src_seq_len, batch_size))
    tgt = torch.randint(0, vocab_size, (tgt_seq_len, batch_size))

    # Create model instance
    model = TransformerModel(vocab_size, model_dim, num_heads, num_encoder_layers,
                             num_decoder_layers, ff_dim, dropout)
    # Generate target mask for autoregressive decoding (prevents attending to future tokens)
    tgt_mask = model.generate_square_subsequent_mask(tgt_seq_len)

    # Define optimizer and loss function
    optimizer = optim.Adam(model.parameters(), lr=0.0001)
    criterion = nn.CrossEntropyLoss()

    # Forward pass: obtain model predictions
    output = model(src, tgt, src_mask=None, tgt_mask=tgt_mask)
    # Output shape: (tgt_seq_len, batch_size, vocab_size)

    # For example purposes, assume the target labels are the same as tgt.
    # Reshape output and target for computing loss.
    output_flat = output.view(-1, vocab_size)
    tgt_flat = tgt.view(-1)
    loss = criterion(output_flat, tgt_flat)
    
    # Backpropagation and optimizer step
    loss.backward()
    optimizer.step()

    print("Transformer loss:", loss.item())

## Real-World Example: Using Pre-Trained Models

### TensorFlow Example

In [7]:
from transformers import MarianTokenizer, TFAutoModelForSeq2SeqLM

# Specify the model for English-to-German translation
model_name = "Helsinki-NLP/opus-mt-en-de"

# Load pre-trained model and tokenizer
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_name)

# Input text
text = "This class has way too much information and my students are dying."

# Tokenize and generate translation
inputs = tokenizer(text, return_tensors="tf", padding=True)
translated_tokens = model.generate(**inputs)
translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)

print("Translated text:", translated_text)

All model checkpoint layers were used when initializing TFMarianMTModel.

All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-de.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


Translated text: Diese Klasse hat viel zu viele Informationen und meine Schüler sterben.


### PyTorch Example

In [None]:
from transformers import MarianMTModel, MarianTokenizer

# Specify the model for English-to-German translation
model_name = "Helsinki-NLP/opus-mt-en-de"

# Load pre-trained model and tokenizer
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Input text
text = "Hello, how are you?"

# Tokenize and generate translation
inputs = tokenizer(text, return_tensors="pt", padding=True)
translated_tokens = model.generate(**inputs)
translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)

print("Translated text:", translated_text)

## Real-World Example Without Pre-Trained Models

Below are examples for both TensorFlow and PyTorch that not only build a model using framework‑specific solutions but also include a simple (toy) inference routine that accepts an English phrase and produces a German translation. 

Keep in mind:
- These examples assume you have already trained your model and built tokenization routines.
- The tokenizers shown here are minimal “dummy” implementations; in production you’d use robust tokenizers (e.g. SentencePiece or subword-based methods).
- The models are for demonstration. **Without training, the outputs will be random.**

These snippets serve as a template for how you might set up and call such a system in a real-world project but, for a meaningful translation output, you will need to train the model on a corpus and use tokenization methods.

### TensorFlow Example

In [14]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Dense, MultiHeadAttention, LayerNormalization, Dropout, Add
from tensorflow.keras.models import Model
import numpy as np

# --- Model definition (same as before) ---
def transformer_block(x, num_heads, ff_dim, dropout_rate=0.1):
    # Self-attention using built-in MultiHeadAttention
    attn_output = MultiHeadAttention(num_heads=num_heads, key_dim=x.shape[-1])(x, x)
    attn_output = Dropout(dropout_rate)(attn_output)
    out1 = Add()([x, attn_output])
    out1 = LayerNormalization(epsilon=1e-6)(out1)
    
    # Feed-forward network
    ffn_output = Dense(ff_dim, activation='relu')(out1)
    ffn_output = Dense(x.shape[-1])(ffn_output)
    ffn_output = Dropout(dropout_rate)(ffn_output)
    out2 = Add()([out1, ffn_output])
    out2 = LayerNormalization(epsilon=1e-6)(out2)
    return out2

# Define inputs (source and target sequences)
input_seq = Input(shape=(None,), name="source")
target_seq = Input(shape=(None,), name="target")

vocab_size = 10000  # Example vocabulary size
d_model = 64        # Embedding dimension

# Create embeddings for source and target
src_embed = Embedding(vocab_size, d_model, name="src_embedding")(input_seq)
tgt_embed = Embedding(vocab_size, d_model, name="tgt_embedding")(target_seq)

# For demonstration, apply a transformer block independently on encoder and decoder sides.
encoder_output = transformer_block(src_embed, num_heads=4, ff_dim=128)
decoder_output = transformer_block(tgt_embed, num_heads=4, ff_dim=128)

# Final softmax layer over vocabulary for output
final_output = Dense(vocab_size, activation='softmax', name="output")(decoder_output)

# Build the model
tf_model = Model(inputs=[input_seq, target_seq], outputs=final_output)
tf_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
tf_model.summary()

# --- Dummy Tokenizers and Inference Routine ---
# For simplicity, we define a minimal vocabulary and dummy tokenization.
# In practice, you would use a proper tokenizer.

# Dummy vocabularies
src_vocab = {"hello": 2, "how": 3, "are": 4, "you": 5}
tgt_vocab = {"<sos>": 0, "<eos>": 1, "hallo": 2, "wie": 3, "geht": 4, "es": 5, "dir": 6}
rev_tgt_vocab = {v: k for k, v in tgt_vocab.items()}

def english_tokenize(text):
    # Splits by whitespace and converts to token IDs; unknown words default to 0.
    return [src_vocab.get(word.lower(), 0) for word in text.split()]

def german_detokenize(token_ids):
    # Converts token IDs back into a string (ignores special tokens in this demo)
    words = [rev_tgt_vocab.get(token, "<unk>") for token in token_ids]
    return " ".join(words)

def translate_tf(input_text, model, max_length=10):
    # Tokenize source sentence
    src_tokens = english_tokenize(input_text)
    src_tensor = tf.expand_dims(src_tokens, axis=0)  # Shape: (1, seq_len)
    
    # Start with the <sos> token for the target sequence
    tgt_tokens = [tgt_vocab["<sos>"]]
    
    for i in range(max_length):
        tgt_tensor = tf.expand_dims(tgt_tokens, axis=0)  # Shape: (1, current_len)
        # Predict next tokens; note: real models require masks and proper training.
        predictions = model([src_tensor, tgt_tensor], training=False)
        # Get the token with the highest probability from the last time step
        next_token = int(tf.argmax(predictions[0, -1, :]).numpy())
        tgt_tokens.append(next_token)
        if next_token == tgt_vocab["<eos>"]:
            break

    # Convert token IDs (skipping <sos>) to a string
    return german_detokenize(tgt_tokens[1:])

# Example usage:
input_phrase = "Hello how are you"
translated = translate_tf(input_phrase, tf_model)
print("Translated text (TF demo):", translated)


Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 target (InputLayer)            [(None, None)]       0           []                               
                                                                                                  
 tgt_embedding (Embedding)      (None, None, 64)     640000      ['target[0][0]']                 
                                                                                                  
 multi_head_attention_9 (MultiH  (None, None, 64)    66368       ['tgt_embedding[0][0]',          
 eadAttention)                                                    'tgt_embedding[0][0]']          
                                                                                                  
 dropout_15 (Dropout)           (None, None, 64)     0           ['multi_head_attention_9[0]

### PyTorch Example

In [None]:
import torch
import torch.nn as nn
import math

# --- Model definition (same as before) ---
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # Shape: (1, max_len, d_model)
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        # x shape: (batch_size, seq_len, d_model)
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

class TransformerMTModel(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512, nhead=8,
                 num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048,
                 dropout=0.1, max_len=5000):
        super(TransformerMTModel, self).__init__()
        self.d_model = d_model
        
        self.src_tok_emb = nn.Embedding(src_vocab_size, d_model)
        self.tgt_tok_emb = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, dropout, max_len)
        
        self.transformer = nn.Transformer(d_model, nhead, num_encoder_layers,
                                          num_decoder_layers, dim_feedforward, dropout)
        self.fc_out = nn.Linear(d_model, tgt_vocab_size)
    
    def forward(self, src, tgt, src_mask=None, tgt_mask=None,
                src_key_padding_mask=None, tgt_key_padding_mask=None,
                memory_key_padding_mask=None):
        # src, tgt: (batch_size, seq_len)
        src_emb = self.src_tok_emb(src) * math.sqrt(self.d_model)
        tgt_emb = self.tgt_tok_emb(tgt) * math.sqrt(self.d_model)
        
        src_emb = self.positional_encoding(src_emb)
        tgt_emb = self.positional_encoding(tgt_emb)
        
        # nn.Transformer expects shape (seq_len, batch_size, d_model)
        src_emb = src_emb.transpose(0, 1)
        tgt_emb = tgt_emb.transpose(0, 1)
        
        output = self.transformer(src_emb, tgt_emb,
                                  src_mask=src_mask, tgt_mask=tgt_mask,
                                  src_key_padding_mask=src_key_padding_mask,
                                  tgt_key_padding_mask=tgt_key_padding_mask,
                                  memory_key_padding_mask=memory_key_padding_mask)
        output = output.transpose(0, 1)  # Back to (batch_size, seq_len, d_model)
        return self.fc_out(output)

# --- Dummy tokenizers and vocabularies ---
src_vocab = {"hello": 2, "how": 3, "are": 4, "you": 5}
tgt_vocab = {"<sos>": 0, "<eos>": 1, "hallo": 2, "wie": 3, "geht": 4, "es": 5, "dir": 6}
rev_tgt_vocab = {v: k for k, v in tgt_vocab.items()}

def english_tokenize(text):
    return [src_vocab.get(word.lower(), 0) for word in text.split()]

def german_detokenize(token_ids):
    words = [rev_tgt_vocab.get(token, "<unk>") for token in token_ids]
    return " ".join(words)

# --- Inference routine with greedy decoding ---
def translate_torch(input_text, model, max_length=10, device='cpu'):
    model.eval()
    with torch.no_grad():
        # Tokenize source sentence
        src_tokens = english_tokenize(input_text)
        src_tensor = torch.tensor(src_tokens, dtype=torch.long, device=device).unsqueeze(0)  # (1, seq_len)
        
        # Initialize target with <sos>
        tgt_tokens = [tgt_vocab["<sos>"]]
        
        for _ in range(max_length):
            tgt_tensor = torch.tensor(tgt_tokens, dtype=torch.long, device=device).unsqueeze(0)  # (1, current_len)
            output = model(src_tensor, tgt_tensor)  # (1, seq_len, vocab_size)
            # Get the token with the highest probability from the last time step
            next_token = output[0, -1, :].argmax().item()
            tgt_tokens.append(next_token)
            if next_token == tgt_vocab["<eos>"]:
                break

    # Return decoded German sentence (skip <sos>)
    return german_detokenize(tgt_tokens[1:])

# --- Example usage ---
src_vocab_size = 10000
tgt_vocab_size = 10000
device = torch.device("cpu")
pt_model = TransformerMTModel(src_vocab_size, tgt_vocab_size, d_model=128, nhead=4,
                              num_encoder_layers=2, num_decoder_layers=2, dim_feedforward=256)
pt_model.to(device)

input_phrase = "Hello how are you"
translated_pt = translate_torch(input_phrase, pt_model, device=device)
print("Translated text (PyTorch demo):", translated_pt)

## Summary

1. **Encoder–Decoder RNNs:**
   - **What:** A two-part model where one RNN encodes the input into a context vector and another decodes it into an output.
   - **Why:** Simplifies mapping between sequences.
   - **Best For:** Moderate-length sequences and early sequence-to-sequence tasks (e.g., machine translation).

2. **RNNs with Attention:**
   - **What:** Enhances the encoder–decoder model by allowing the decoder to attend to different parts of the input dynamically.
   - **Why:** Mitigates the bottleneck of fixed-length context vectors and improves handling of long-range dependencies.
   - **Best For:** Complex translation, image captioning, and tasks requiring dynamic context focus.

3. **Transformers:**
   - **What:** A non-recurrent architecture relying entirely on self-attention, multi-head attention, and feed-forward networks.
   - **Why:** Enables parallel processing, scalability, and effective modeling of long sequences.
   - **Best For:** Large-scale NLP tasks, computer vision, and any task requiring modeling of complex dependencies across sequences.

This progression reflects the evolution of sequence modeling techniques as researchers sought more efficient, scalable, and context-aware models to tackle increasingly complex tasks in natural language processing and beyond.


### Additional Resources

- The Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/
- Tensor2Tensor Visualizer: https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb