**What this code is about?**



The provided code represents an implementation of the Transformer encoder, a pivotal component of the Transformer architecture used in natural language processing tasks. It includes a MultiHeadAttention module for self-attention mechanisms, a TransformerBlock defining a single processing unit with attention, normalization, skip connections, and a feedforward network, and an Encoder module stacking multiple TransformerBlocks to process input sequences. Although it forms the core of the Transformer, the complete model involves additional components such as the decoder, positional encoding, masking, and other architectural details. The Transformer architecture is widely utilized in various sequence-to-sequence tasks, showcasing its effectiveness in machine translation and language understanding applications.

In [67]:
import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(MultiHeadAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        assert (
            self.head_dim * heads == embed_size
        ), "Embedding size needs to be divisible by heads"

        self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.fc_out = nn.Linear(heads * self.head_dim, embed_size)

    def forward(self, values, keys, query, mask):
        # Get number of training examples
        N = query.shape[0]

        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]

        # Split the embedding into self.heads different pieces
        values = values.reshape(N, value_len, self.heads, self.head_dim)
        keys = keys.reshape(N, key_len, self.heads, self.head_dim)
        queries = query.reshape(N, query_len, self.heads, self.head_dim)

        values = self.values(values)
        keys = self.keys(keys)
        queries = self.queries(queries)

        energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])

        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))

        attention = torch.nn.functional.softmax(energy / (self.embed_size ** (1 / 2)), dim=3)

        out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(
            N, query_len, self.heads * self.head_dim
        )

        out = self.fc_out(out)
        return out

Explanation:

    embed_size: The size of the input embeddings.
    heads: The number of attention heads. More heads allow the model to focus on different parts of the input sequence simultaneously.
    The three linear layers (values, keys, and queries): These layers are used to project the input into different subspaces for each attention head.
    fc_out: This linear layer combines the outputs from different attention heads.

In [68]:
class TransformerBlock(nn.Module):
    def __init__(self, embed_size, heads):
        super(TransformerBlock, self).__init__()
        self.attention = MultiHeadAttention(embed_size, heads)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)

        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size, 4 * embed_size),
            nn.ReLU(),
            nn.Linear(4 * embed_size, embed_size),
        )

    def forward(self, value, key, query, mask):
        attention = self.attention(value, key, query, mask)

        # Add skip connection, run through normalization and finally a feed forward network
        x = self.norm1(attention + query)
        forward = self.feed_forward(x)
        out = self.norm2(forward + x)
        return out

Explanation:

    embed_size, heads, dropout, forward_expansion: Parameters for the attention mechanism and feed-forward layers.
    attention: Multi-Head Attention module.
    norm1 and norm2: Layer normalization is applied before and after the attention and feed-forward layers.
    feed_forward: A feed-forward neural network applied to the output of the attention layer.
    dropout: Dropout is applied for regularization.

In [69]:
class Encoder(nn.Module):
    def __init__(self, src_vocab_size, embed_size, num_heads, num_layers):
        super(Encoder, self).__init__()
        self.embed_size = embed_size
        self.num_layers = num_layers

        self.embedding = nn.Embedding(src_vocab_size, embed_size)
        self.transformer_blocks = nn.ModuleList(
            [TransformerBlock(embed_size, num_heads) for _ in range(num_layers)]
        )

    def forward(self, src, mask):
        x = self.embedding(src)

        for transformer in self.transformer_blocks:
            x = transformer(x, x, x, mask)

        return x

Explanation:

    src_vocab_size, embed_size, num_layers, heads, device, forward_expansion, dropout: Parameters for configuring the encoder.
    word_embedding: Embedding layer for words in the input sequence.
    position_embedding: Positional encoding to provide information about the order of words.
    layers: Stacked Transformer blocks.
    dropout: Dropout layer for regularization.

In [70]:
import torch
import torch.nn as nn

# The Transformer classes (MultiHeadAttention, TransformerBlock, and Encoder) go here...

# 1. Create an instance of the Encoder class
src_vocab_size = 1000  # Example vocabulary size
embed_size = 256
num_heads = 8
num_layers = 4

encoder = Encoder(src_vocab_size, embed_size, num_heads, num_layers)

# 2. Generate a sample input sequence
seq_length = 20
sample_input = torch.randint(low=0, high=src_vocab_size, size=(1, seq_length))

# 3. Define a mask (optional, for masking padded elements)
mask = (sample_input != 0).unsqueeze(1).unsqueeze(2)

# 4. Forward pass the input through the Transformer encoder
output = encoder(sample_input, mask)

# Print the output size
print("Output Size:", output.size())


Output Size: torch.Size([1, 20, 256])


Explanation:

    Creates an instance of the Transformer model.
    Generates a random input sequence.
    Applies an attention mask to avoid processing padded elements.
    Prints the size of the output tensor.

The size of the output tensor from the Transformer encoder block depends on the dimensions of your input sequence and the embedding size. In the example provided, the output size would be (batch_size, seq_length, embed_size).

Let's break it down:

    batch_size: The number of sequences you process in parallel (in this case, it's 1 as we generated a single sequence).
    seq_length: The length of your input sequence (in this case, it's 20 as we generated a sequence of length 20).
    embed_size: The size of the embedding vectors you defined for each token (in this case, it's 256).

So, the expected output size would be (1, 20, 256).