# Transformer Architecture From Scratch.

Tokenization

1.Word-based Tokenization

2.Subword-based Tokenization    
3.Character-based Tokenization




**BPE - Byte -Pair Encoding**

*BPE is a subword-based tokenization technique that involves iteratively merging the most frequent pair of tokens (byte pairs) in the corpus until a predefined vocabulary size is reached.
example :
 Example of BPE:
 Initial vocabulary: ['l', 'o', 'w', 'e', 'r', 'n', 'w', 's', 't']
 Corpus: "low lower lowest newest widest"

 Iteration 1: Merge most frequent pair 'e' + 'r' -> 'er'
 Updated vocab: ['l', 'o', 'w', 'e', 'r', 'n', 'w', 's', 't', 'er']

 Iteration 2: Merge most frequent pair 'l' + 'o' -> 'lo'
 Updated vocab: ['l', 'o', 'w', 'e', 'r', 'n', 'w', 's', 't', 'er', 'lo']

 Iteration 3: Merge most frequent pair 'lo' + 'w' -> 'low'
 Updated vocab: ['l', 'o', 'w', 'e', 'r', 'n', 'w', 's', 't', 'er', 'lo', 'low']

 Final tokenization:
 "low" -> ['low']
 "lower" -> ['low', 'er']
 "lowest" -> ['low', 'est']
 "newest" -> ['n', 'ew', 'est']
 "widest" -> ['w', 'id', 'est']

In [1]:
!pip install torchtext==0.13.1

Collecting torchtext==0.13.1
  Downloading torchtext-0.13.1-cp310-cp310-manylinux1_x86_64.whl.metadata (6.9 kB)
Collecting torch==1.12.1 (from torchtext==0.13.1)
  Downloading torch-1.12.1-cp310-cp310-manylinux1_x86_64.whl.metadata (22 kB)
Downloading torchtext-0.13.1-cp310-cp310-manylinux1_x86_64.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading torch-1.12.1-cp310-cp310-manylinux1_x86_64.whl (776.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m776.3/776.3 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torch, torchtext
  Attempting uninstall: torch
    Found existing installation: torch 2.4.0+cu121
    Uninstalling torch-2.4.0+cu121:
      Successfully uninstalled torch-2.4.0+cu121
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the followi

In [2]:
import torch
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# Example sentence
sentence = "I love coding with PyTorch"


In [5]:

# Tokenizer (word-based)
tokenizer = get_tokenizer("basic_english")

# Tokenize the sentence
tokens = tokenizer(sentence)
print("Tokens:", tokens)



Tokens: ['i', 'love', 'coding', 'with', 'pytorch']


In [6]:
# Build Vocabulary using BPE (simulate BPE with torchtext)
def yield_tokens(data_iter):
    for text in data_iter:
        yield tokenizer(text)


# Vocabulary object, with a max vocab size for BPE-like behavior
vocab = build_vocab_from_iterator(yield_tokens([sentence]), specials=["<unk>"], max_tokens=100)
vocab.set_default_index(vocab["<unk>"])

vocab


Vocab()

In [8]:
# Encode sentence into token IDs
token_ids = vocab(tokens)
print("Token IDs:", token_ids)

Vocab()
Token IDs: [2, 3, 1, 5, 4]


WordPice Tokenizer
**bold text**

In [9]:
from transformers import BertTokenizer


In [10]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [11]:
sentence = "I love coding with PyTorch"

# Tokenize and encode the sentence
tokens = tokenizer.tokenize(sentence)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print("Tokens:", tokens)
print("Token IDs:", token_ids)

Tokens: ['i', 'love', 'coding', 'with', 'p', '##yt', '##or', '##ch']
Token IDs: [1045, 2293, 16861, 2007, 1052, 22123, 2953, 2818]


Let's See Byte Pair Encoding (BPE) from the tiktoken library introduced by OpenAI

This technique can improve the performance of LLMs and handle rare and out-of-vocabulary words. The big difference between TikToken BPE and sentencepiece BPE is that TikToken BPE doesn't always split words into smaller parts if the whole word is already known.

For example, if "hugging" is in the vocabulary, it stays as one token instead of splitting into ["hug","ging"].

In [12]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m38.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.7.0


In [20]:
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")



50256

In [15]:
tokens = tokenizer.encode(sentence)
print("BPE Tokenization:", tokens)


BPE Tokenization: [40, 1842, 19617, 351, 9485, 15884, 354]


TypeError: object of type 'Encoding' has no len()

# Create Embeddings of tokenized words.

In [21]:
import torch
import torch.nn as nn

# Assume a vocab size of 30,000 and embedding dimension of 512
vocab_size = tokenizer.max_token_value
embedding_dim = 512  # Given in research Paper

# Create an embedding layer
embedding_layer = nn.Embedding(vocab_size, embedding_dim)

# Example token IDs from a tokenized sentence
# token_ids = torch.tensor([101, 2204, 4539, 1022, 2007])

# Get the embeddings for the token IDs
embeddings = embedding_layer( torch.tensor(token_ids))

print("Embeddings shape:", embeddings.shape)  # Should be (5, 512) for 5 tokens
print("Embeddings:", embeddings)


Embeddings shape: torch.Size([8, 512])
Embeddings: tensor([[-1.5947,  0.3241, -0.4799,  ..., -0.1052, -0.4698, -0.2768],
        [ 0.4734,  1.6934, -0.8733,  ...,  1.0033,  1.6622, -0.6976],
        [-0.3605, -0.1322, -0.6009,  ..., -0.6075, -0.0437, -0.5079],
        ...,
        [ 0.3498, -1.2330, -0.3858,  ...,  1.5729,  0.6963, -1.3328],
        [-0.4225, -0.2490, -0.9021,  ..., -0.3376, -0.7263,  1.8447],
        [-0.8291, -1.1182,  0.6144,  ...,  0.5412,  0.0358,  1.2546]],
       grad_fn=<EmbeddingBackward0>)


# Tokenization to Embedding full steps

In [23]:
data = [
    "I love machine learning.",
    "Artificial intelligence is the future.",
    "Transformers are powerful models.",
    "Natural language processing is fascinating.",
    "Deep learning enables great advances in AI.",
    "PyTorch is a popular deep learning library.",
    "GPT models are widely used in NLP.",
    "Neural networks are the backbone of deep learning.",
    "Embedding layers convert tokens to vectors.",
    "Attention mechanisms help models focus on important parts of the input."
]


In [24]:
import string

## Creating our own vocab

def simple_tokenize(sentence):
    # Remove punctuation and lowercase the sentence
    sentence = sentence.translate(str.maketrans('', '', string.punctuation)).lower()
    print(sentence)
    # Split the sentence into tokens
    return sentence.split()

tokenized_data = [simple_tokenize(sentence) for sentence in data]

# Let's print the tokenized sentences
for i, tokens in enumerate(tokenized_data):
    print(f"Sentence {i+1}: {tokens}")

i love machine learning
artificial intelligence is the future
transformers are powerful models
natural language processing is fascinating
deep learning enables great advances in ai
pytorch is a popular deep learning library
gpt models are widely used in nlp
neural networks are the backbone of deep learning
embedding layers convert tokens to vectors
attention mechanisms help models focus on important parts of the input
Sentence 1: ['i', 'love', 'machine', 'learning']
Sentence 2: ['artificial', 'intelligence', 'is', 'the', 'future']
Sentence 3: ['transformers', 'are', 'powerful', 'models']
Sentence 4: ['natural', 'language', 'processing', 'is', 'fascinating']
Sentence 5: ['deep', 'learning', 'enables', 'great', 'advances', 'in', 'ai']
Sentence 6: ['pytorch', 'is', 'a', 'popular', 'deep', 'learning', 'library']
Sentence 7: ['gpt', 'models', 'are', 'widely', 'used', 'in', 'nlp']
Sentence 8: ['neural', 'networks', 'are', 'the', 'backbone', 'of', 'deep', 'learning']
Sentence 9: ['embedding',

In [26]:
from collections import defaultdict

vocab = defaultdict(lambda: len(vocab))

# Add a special token for unknown words
UNK = vocab["<UNK>"]
PAD = vocab["<PAD>"]  # Padding token

# we add padding for those sentence having less words in sentence. It helps to maintain sentence length equal.


# Populate the vocabulary with the tokenized data
for tokens in tokenized_data:
    for token in tokens:
        _ = vocab[token]

# Convert defaultdict to a regular dictionary
vocab = dict(vocab)

# Print the vocabulary
print("Vocabulary:", vocab)
print("Vocabulary Size:", len(vocab))


Vocabulary: {'<UNK>': 0, '<PAD>': 1, 'i': 2, 'love': 3, 'machine': 4, 'learning': 5, 'artificial': 6, 'intelligence': 7, 'is': 8, 'the': 9, 'future': 10, 'transformers': 11, 'are': 12, 'powerful': 13, 'models': 14, 'natural': 15, 'language': 16, 'processing': 17, 'fascinating': 18, 'deep': 19, 'enables': 20, 'great': 21, 'advances': 22, 'in': 23, 'ai': 24, 'pytorch': 25, 'a': 26, 'popular': 27, 'library': 28, 'gpt': 29, 'widely': 30, 'used': 31, 'nlp': 32, 'neural': 33, 'networks': 34, 'backbone': 35, 'of': 36, 'embedding': 37, 'layers': 38, 'convert': 39, 'tokens': 40, 'to': 41, 'vectors': 42, 'attention': 43, 'mechanisms': 44, 'help': 45, 'focus': 46, 'on': 47, 'important': 48, 'parts': 49, 'input': 50}
Vocabulary Size: 51


In [27]:
def tokens_to_ids(tokens, vocab):
    return [vocab.get(token, UNK) for token in tokens]
max_len = max(len(tokens) for tokens in tokenized_data)

token_ids_data = [tokens_to_ids(tokens, vocab) for tokens in tokenized_data]
padded_token_ids_data = [token_ids + [PAD] * (max_len - len(token_ids)) for token_ids in token_ids_data]


# Print the padded token IDs
for i, token_ids in enumerate(padded_token_ids_data):
    print(f"Padded Token IDs for Sentence {i+1}: {token_ids}")


Padded Token IDs for Sentence 1: [2, 3, 4, 5, 1, 1, 1, 1, 1, 1, 1]
Padded Token IDs for Sentence 2: [6, 7, 8, 9, 10, 1, 1, 1, 1, 1, 1]
Padded Token IDs for Sentence 3: [11, 12, 13, 14, 1, 1, 1, 1, 1, 1, 1]
Padded Token IDs for Sentence 4: [15, 16, 17, 8, 18, 1, 1, 1, 1, 1, 1]
Padded Token IDs for Sentence 5: [19, 5, 20, 21, 22, 23, 24, 1, 1, 1, 1]
Padded Token IDs for Sentence 6: [25, 8, 26, 27, 19, 5, 28, 1, 1, 1, 1]
Padded Token IDs for Sentence 7: [29, 14, 12, 30, 31, 23, 32, 1, 1, 1, 1]
Padded Token IDs for Sentence 8: [33, 34, 12, 9, 35, 36, 19, 5, 1, 1, 1]
Padded Token IDs for Sentence 9: [37, 38, 39, 40, 41, 42, 1, 1, 1, 1, 1]
Padded Token IDs for Sentence 10: [43, 44, 45, 14, 46, 47, 48, 49, 36, 9, 50]


In [28]:
import torch
import torch.nn as nn

# Step 5: Create the Embedding Layer
vocab_size = len(vocab)
embedding_dim = 8

embedding_layer = nn.Embedding(vocab_size, embedding_dim)

# Step 6: Convert Padded Token IDs to Embeddings
token_ids_tensor = [torch.tensor(token_ids) for token_ids in padded_token_ids_data]

# Get embeddings for each sentence
embeddings = [embedding_layer(token_ids) for token_ids in token_ids_tensor]

# Print the embeddings
for i, embedding in enumerate(embeddings):
    print(f"Embeddings for Padded Sentence {i+1}:")
    print(embedding)


Embeddings for Padded Sentence 1:
tensor([[-0.9132,  0.9468,  0.0817,  0.2827,  1.0028, -1.4293, -0.6571, -0.9337],
        [-0.4450,  0.5968, -1.0023,  0.3321, -1.6782,  0.4033, -0.6668, -0.4853],
        [ 0.4007, -0.2098,  0.6176,  2.8208,  1.4950,  1.0831,  0.1060, -0.1578],
        [-0.6790,  1.0265, -0.1804, -0.3918, -0.4139, -0.0575,  1.2878, -0.1764],
        [ 0.6355,  0.4743,  0.0205, -1.0658, -1.4321,  0.6414,  0.3642, -2.0879],
        [ 0.6355,  0.4743,  0.0205, -1.0658, -1.4321,  0.6414,  0.3642, -2.0879],
        [ 0.6355,  0.4743,  0.0205, -1.0658, -1.4321,  0.6414,  0.3642, -2.0879],
        [ 0.6355,  0.4743,  0.0205, -1.0658, -1.4321,  0.6414,  0.3642, -2.0879],
        [ 0.6355,  0.4743,  0.0205, -1.0658, -1.4321,  0.6414,  0.3642, -2.0879],
        [ 0.6355,  0.4743,  0.0205, -1.0658, -1.4321,  0.6414,  0.3642, -2.0879],
        [ 0.6355,  0.4743,  0.0205, -1.0658, -1.4321,  0.6414,  0.3642, -2.0879]],
       grad_fn=<EmbeddingBackward0>)
Embeddings for Padded Sent

# ***Positional Encoding***

Positional Encoding (PE) in Transformers

We use positional encoding to provide information about the relative or absolute position of tokens in a sequence.
This is necessary because the self-attention mechanism in transformers is permutation-invariant.

Types of Positional Encoding:

1. Sinusoidal PE (used in original Transformer):
   - Uses sine and cosine functions to encode positions
   - Allows model to extrapolate to longer sequences

2. Learned PE (used in BERT and many other models):
   - Trainable embedding for each position
   - Can potentially capture more complex positional relationships

3. Rotary Position Embedding (RoPE) (used in LLaMA and Mixtral):
   - Applies rotation to key and query vectors in attention mechanism
   - Enables better relative positional understanding

4. ALiBi (Attention with Linear Biases) (used in some recent models):
   - Adds a bias term to attention scores based on relative positions
   - Allows for better extrapolation to longer sequences

Note: LLaMA and Mixtral specifically use RoPE, which has shown good performance in long-range dependencies.

# **Let's dive deeply into Positional encoding using Sin-cosine funtions**


In [None]:
import numpy as np

def get_positional_encoding(seq_len, d_model):
    """
    Generate a sinusoidal positional encoding matrix.

    :param seq_len: Length of the sequence.
    :param d_model: Dimension of the model (same as the embedding dimension).
    :return: A tensor of shape (seq_len, d_model) containing positional encodings.
    """
    # Initialize the positional encoding matrix
    pos_enc = np.zeros((seq_len, d_model))

    # Create a matrix of positions (i.e., 0, 1, 2, ..., seq_len-1)
    positions = np.arange(seq_len).reshape(-1, 1)

    # Define the denominator for the sine and cosine functions
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))

    # Calculate the positional encodings using sine and cosine functions
    pos_enc[:, 0::2] = np.sin(positions * div_term)
    pos_enc[:, 1::2] = np.cos(positions * div_term)

    # Convert the positional encoding matrix to a tensor
    return torch.tensor(pos_enc, dtype=torch.float32)

In [None]:
# Assume max_len is the length of the padded sequences
max_len = len(token_ids_tensor[0])  # This will be the same for all sentences after padding
d_model = embedding_dim  # The dimensionality of the embeddings

# Get positional encodings
positional_encoding = get_positional_encoding(max_len, d_model)

print(positional_encoding)



tensor([[ 0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,
          1.0000e+00,  0.0000e+00,  1.0000e+00],
        [ 8.4147e-01,  5.4030e-01,  9.9833e-02,  9.9500e-01,  9.9998e-03,
          9.9995e-01,  1.0000e-03,  1.0000e+00],
        [ 9.0930e-01, -4.1615e-01,  1.9867e-01,  9.8007e-01,  1.9999e-02,
          9.9980e-01,  2.0000e-03,  1.0000e+00],
        [ 1.4112e-01, -9.8999e-01,  2.9552e-01,  9.5534e-01,  2.9996e-02,
          9.9955e-01,  3.0000e-03,  1.0000e+00],
        [-7.5680e-01, -6.5364e-01,  3.8942e-01,  9.2106e-01,  3.9989e-02,
          9.9920e-01,  4.0000e-03,  9.9999e-01],
        [-9.5892e-01,  2.8366e-01,  4.7943e-01,  8.7758e-01,  4.9979e-02,
          9.9875e-01,  5.0000e-03,  9.9999e-01],
        [-2.7942e-01,  9.6017e-01,  5.6464e-01,  8.2534e-01,  5.9964e-02,
          9.9820e-01,  6.0000e-03,  9.9998e-01],
        [ 6.5699e-01,  7.5390e-01,  6.4422e-01,  7.6484e-01,  6.9943e-02,
          9.9755e-01,  6.9999e-03,  9.9998e-01],
        [ 9.8936

In [None]:
# Add positional encoding to each embedding tensor
embeddings_with_pos = [embedding + positional_encoding for embedding in embeddings]

# Print the embeddings with positional encodings
for i, embedding_with_pos in enumerate(embeddings_with_pos):
    print(f"Embeddings with Positional Encoding for Padded Sentence {i+1}:")
    print(embedding_with_pos)

Embeddings with Positional Encoding for Padded Sentence 1:
tensor([[-0.1868,  2.9418, -1.9159,  0.8028, -0.1820,  0.1145, -1.6209,  1.3863],
        [-0.4423,  0.5764,  2.3257,  1.6117, -0.7483,  0.1450,  1.4288,  1.9461],
        [ 1.0055, -0.1113, -1.5196, -0.0966, -0.0895,  1.5557,  0.5989,  0.6294],
        [ 0.6427, -1.3467,  0.3895,  1.5610,  2.1572, -0.9829, -0.1190,  0.6251],
        [-1.1933,  1.9160,  1.1186,  1.2478, -1.4848,  2.3074, -0.1447,  0.8419],
        [-1.3954,  2.8533,  1.2086,  1.2043, -1.4749,  2.3070, -0.1437,  0.8419],
        [-0.7159,  3.5298,  1.2938,  1.1521, -1.4649,  2.3064, -0.1427,  0.8419],
        [ 0.2205,  3.3236,  1.3734,  1.0916, -1.4549,  2.3058, -0.1417,  0.8419],
        [ 0.5528,  2.4242,  1.4466,  1.0234, -1.4449,  2.3050, -0.1407,  0.8419],
        [-0.0244,  1.6585,  1.5125,  0.9483, -1.4350,  2.3042, -0.1397,  0.8419],
        [-0.9805,  1.7306,  1.5707,  0.8670, -1.4250,  2.3032, -0.1387,  0.8419]],
       grad_fn=<AddBackward0>)
Embeddi

**Positional Encoding - Learned Positional Embeddings**

How It Works
1.Initialize Positional Embeddings: Create a separate embedding layer for positions, similar to how token embeddings are created.

2.Add Positional Embeddings to Token Embeddings: During the forward pass, add the positional embeddings to the token embeddings.

3.Train the Model: The positional embeddings are updated through backpropagation along with other model parameters.




In [None]:
# Step 7: Learnable Positional Encoding
class LearnablePositionalEncoding(nn.Module):
    def __init__(self, max_seq_len, d_model):
        super(LearnablePositionalEncoding, self).__init__()
        self.pos_embedding = nn.Parameter(torch.zeros(1, max_seq_len, d_model))

    def forward(self, x):
        # Add the positional embeddings to the input embeddings
        return x + self.pos_embedding[:, :x.size(1), :]

In [None]:
# Initialize the learnable positional encoding
learnable_pos_encoding = LearnablePositionalEncoding(max_len, embedding_dim)

# Add the learnable positional encodings to the embeddings
embeddings_with_pos = [learnable_pos_encoding(embedding.unsqueeze(0)) for embedding in embeddings]

# Print the embeddings with learnable positional encodings
for i, embedding_with_pos in enumerate(embeddings_with_pos):
    print(f"Embeddings with Learnable Positional Encoding for Padded Sentence {i+1}:")
    print(embedding_with_pos.squeeze(0))

Embeddings with Learnable Positional Encoding for Padded Sentence 1:
tensor([[-0.1868,  1.9418, -1.9159, -0.1972, -0.1820, -0.8855, -1.6209,  0.3863],
        [-1.2838,  0.0361,  2.2258,  0.6167, -0.7583, -0.8549,  1.4278,  0.9461],
        [ 0.0962,  0.3048, -1.7183, -1.0766, -0.1095,  0.5559,  0.5969, -0.3706],
        [ 0.5016, -0.3567,  0.0940,  0.6057,  2.1272, -1.9824, -0.1220, -0.3749],
        [-0.4365,  2.5697,  0.7292,  0.3267, -1.5248,  1.3082, -0.1487, -0.1581],
        [-0.4365,  2.5697,  0.7292,  0.3267, -1.5248,  1.3082, -0.1487, -0.1581],
        [-0.4365,  2.5697,  0.7292,  0.3267, -1.5248,  1.3082, -0.1487, -0.1581],
        [-0.4365,  2.5697,  0.7292,  0.3267, -1.5248,  1.3082, -0.1487, -0.1581],
        [-0.4365,  2.5697,  0.7292,  0.3267, -1.5248,  1.3082, -0.1487, -0.1581],
        [-0.4365,  2.5697,  0.7292,  0.3267, -1.5248,  1.3082, -0.1487, -0.1581],
        [-0.4365,  2.5697,  0.7292,  0.3267, -1.5248,  1.3082, -0.1487, -0.1581]],
       grad_fn=<SqueezeBackw

# RoPE - (Rotational Positional Encoding)

Mathematical Formulation
Given an embedding vector
𝑥
x for a token at position
𝑝
p, RoPE applies a rotation matrix
𝑅
(
𝑝
)
R(p) to generate the positional encoding.

For each pair of dimensions
𝑖
i and
𝑖
+
1
i+1 in the embedding:

𝑥
′
[
𝑖
]
=
𝑥
[
𝑖
]
cos
⁡
(
𝜃
𝑝
)
−
𝑥
[
𝑖
+
1
]
sin
⁡
(
𝜃
𝑝
)


𝑥
′
[
𝑖
+
1
]
=
𝑥
[
𝑖
]
sin
⁡
(
𝜃
𝑝
)
+
𝑥
[
𝑖
+
1
]
cos
⁡
(
𝜃
𝑝
)




Where
𝜃
𝑝
θ
p
​
  is the angle determined by the position
𝑝
p.

In [None]:
import math

class RotationalPositionalEncoding(nn.Module):
    def __init__(self, d_model):
        super(RotationalPositionalEncoding, self).__init__()
        self.d_model = d_model
        assert d_model % 2 == 0, "Embedding dimension must be even for RoPE."

    def forward(self, x):
        seq_len = x.size(1)
        pos = torch.arange(seq_len, dtype=torch.float32, device=x.device)
        dim = torch.arange(self.d_model // 2, dtype=torch.float32, device=x.device)

        # Compute the angles for the rotations
        inv_freq = 1.0 / (10000 ** (2 * dim / self.d_model))
        sinusoid_inp = torch.einsum("i,j->ij", pos, inv_freq)

        sin = sinusoid_inp.sin()
        cos = sinusoid_inp.cos()

        # Apply rotation
        x_1, x_2 = x[..., 0::2], x[..., 1::2]
        x_rot = torch.cat([x_1 * cos - x_2 * sin, x_1 * sin + x_2 * cos], dim=-1)

        return x_rot

In [None]:

# Initialize RoPE
rot_pos_encoding = RotationalPositionalEncoding(embedding_dim)

# Apply RoPE to the embeddings
rotated_embeddings = [rot_pos_encoding(embedding.unsqueeze(0)) for embedding in embeddings]

# Print the result
print("Embeddings with Rotational Positional Encoding:\n", rotated_embeddings)

Embeddings with Rotational Positional Encoding:
 [tensor([[[-0.1868, -1.9159, -0.1820, -1.6209,  1.9418, -0.1972, -0.8855,
           0.3863],
         [-0.7240,  2.1531, -0.7497,  1.4268, -1.0608,  0.8358, -0.8624,
           0.9475],
         [-0.3172, -1.4701, -0.1206,  0.5976, -0.0394, -1.3965,  0.5536,
          -0.3694],
         [-0.4463, -0.0892,  2.1857, -0.1209,  0.4240,  0.6064, -1.9177,
          -0.3753],
         [ 2.2301,  0.5444, -1.5759, -0.1481, -1.3493,  0.5849,  1.2462,
          -0.1587],
         [ 2.3403,  0.4833, -1.5883, -0.1479,  1.1475,  0.6363,  1.2304,
          -0.1588],
         [ 0.2989,  0.4173, -1.6005, -0.1477,  2.5893,  0.6814,  1.2145,
          -0.1590],
         [-2.0173,  0.3472, -1.6126, -0.1476,  1.6505,  0.7197,  1.1984,
          -0.1591],
         [-2.4788,  0.2737, -1.6245, -0.1474, -0.8058,  0.7507,  1.1822,
          -0.1593],
         [-0.6613,  0.1973, -1.6362, -0.1473, -2.5212,  0.7743,  1.1659,
          -0.1594],
         [ 1.7642,  

# Self-Attention

Self-Attention in Transformers

Self-attention, also known as intra-attention, is a key component of the Transformer architecture. It allows the model to weigh the importance of different parts of the input sequence when processing each element. Here's why it's crucial:

1. Contextual understanding: Self-attention enables each word to attend to all other words in the sequence, capturing long-range dependencies and contextual information.

2. Parallelization: Unlike RNNs, self-attention can be computed in parallel for all positions, making it more efficient for training on modern hardware.

3. No sequential bottleneck: It doesn't suffer from the sequential nature of RNNs, allowing better handling of long-range dependencies.

4. Position-aware: When combined with positional encoding, it can take into account the relative or absolute positions of sequence elements.

5. Interpretability: The attention weights can be visualized to understand which parts of the input the model focuses on for each output.

In Transformers, self-attention is used in both the encoder and decoder, allowing the model to process input and generate output by considering the entire context of the sequence.


In [None]:

# Step 7: Implement Self-Attention
class SelfAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(SelfAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads."

        self.depth = d_model // num_heads

        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)

    def split_heads(self, x, batch_size):
        x = x.view(batch_size, -1, self.num_heads, self.depth)
        return x.permute(0, 2, 1, 3)

    def forward(self, x):
        batch_size = x.size(0)
        print(batch_size)
        print(x.shape)
        q = self.w_q(x)
        k = self.w_k(x)
        v = self.w_v(x)

        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)

        # Scaled dot-product attention
        scores = torch.matmul(q, k.transpose(-1, -2)) / torch.sqrt(torch.tensor(self.depth, dtype=torch.float32))
        attention_weights = nn.functional.softmax(scores, dim=-1)

        out = torch.matmul(attention_weights, v)
        out = out.permute(0, 2, 1, 3).contiguous()
        out = out.view(batch_size, -1, self.d_model)

        return self.w_o(out)

# Example usage of Self-Attention
d_model = embedding_dim
num_heads = 2

self_attention = SelfAttention(d_model, num_heads)
attention_output = [self_attention(embeddings_with_pos[0].unsqueeze(0))]

# Print the attention output
print("Self-Attention Output:\n", attention_output)

1
torch.Size([1, 1, 11, 8])
Self-Attention Output:
 [tensor([[[ 0.2800, -0.3912,  0.3615,  0.2537, -0.0485, -0.3278, -0.4037,
           0.1846],
         [ 0.4607, -0.5139,  0.1797,  0.2159, -0.0106, -0.4759, -0.6212,
           0.1922],
         [ 0.3931, -0.4884,  0.2413,  0.2587, -0.0412, -0.4232, -0.5381,
           0.2028],
         [ 0.3274, -0.4345,  0.2283,  0.2571, -0.1562, -0.3656, -0.4148,
           0.0668],
         [ 0.4038, -0.4634,  0.2993,  0.2249,  0.0493, -0.4266, -0.5735,
           0.2641],
         [ 0.4038, -0.4634,  0.2993,  0.2249,  0.0493, -0.4266, -0.5735,
           0.2641],
         [ 0.4038, -0.4634,  0.2993,  0.2249,  0.0493, -0.4266, -0.5735,
           0.2641],
         [ 0.4038, -0.4634,  0.2993,  0.2249,  0.0493, -0.4266, -0.5735,
           0.2641],
         [ 0.4038, -0.4634,  0.2993,  0.2249,  0.0493, -0.4266, -0.5735,
           0.2641],
         [ 0.4038, -0.4634,  0.2993,  0.2249,  0.0493, -0.4266, -0.5735,
           0.2641],
         [ 0.4038

# Grouped Multi-query Attention

Grouped Multi-query Attention (GMQA)

Grouped Multi-query Attention (GMQA) is an optimization technique for transformer models that aims to reduce computational complexity and memory usage while maintaining model performance. It was introduced as an alternative to the standard multi-head attention mechanism.

Key features of GMQA:

1. Reduced parameter count: Instead of having separate key and value projections for each attention head, GMQA groups multiple query heads to share the same key and value projections.

2. Improved efficiency: By reducing the number of key and value projections, GMQA decreases the computational cost and memory requirements of the attention mechanism.

3. Scalability: GMQA allows for a higher number of attention heads without significantly increasing the model size, potentially leading to improved model capacity and performance.

4. Flexibility: The number of groups can be adjusted to balance between computational efficiency and model expressiveness.

GMQA has shown promising results in various natural language processing tasks, offering a good trade-off between model size, computational efficiency, and performance.


In [None]:
class GroupedMultiQueryAttention(nn.Module):
    def __init__(self, d_model, num_heads, num_groups):
        super(GroupedMultiQueryAttention, self).__init__()
        self.num_heads = num_heads
        self.num_groups = num_groups
        self.d_model = d_model
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads."
        assert num_heads % num_groups == 0, "num_heads must be divisible by num_groups."

        self.depth = d_model // num_heads
        self.group_depth = d_model // num_groups

        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, self.group_depth)
        self.w_v = nn.Linear(d_model, self.group_depth)
        self.w_o = nn.Linear(d_model, d_model)

    def split_heads(self, x, batch_size, num_heads):
        x = x.view(batch_size, -1, num_heads, self.depth)
        return x.permute(0, 2, 1, 3)

    def forward(self, x):
        batch_size = x.size(0)

        q = self.w_q(x)
        k = self.w_k(x)
        v = self.w_v(x)

        q = self.split_heads(q, batch_size, self.num_heads)

        # Split into groups first
        k = self.split_heads(k, batch_size, self.num_groups)
        v = self.split_heads(v, batch_size, self.num_groups)

        # Expand k and v to match the number of heads
        k = k.repeat(1, self.num_heads // self.num_groups, 1, 1)
        v = v.repeat(1, self.num_heads // self.num_groups, 1, 1)

        # Scaled dot-product attention
        scores = torch.matmul(q, k.transpose(-1, -2)) / torch.sqrt(torch.tensor(self.depth, dtype=torch.float32))
        attention_weights = nn.functional.softmax(scores, dim=-1)

        out = torch.matmul(attention_weights, v)
        out = out.permute(0, 2, 1, 3).contiguous()
        out = out.view(batch_size, -1, self.d_model)

        return self.w_o(out)



In [None]:
# Example usage of Self-Attention
d_model = embedding_dim
num_heads = 4
num_groups = 2

grp_attention = GroupedMultiQueryAttention(d_model, num_heads,num_groups)
attention_output = [grp_attention(embeddings_with_pos[0].unsqueeze(0)) ]

# Print the attention output
print("grp-Attention Output:\n", attention_output)

grp-Attention Output:
 [tensor([[[-0.0975,  0.4815, -0.2181, -0.0645,  0.6781, -0.1469,  0.2447,
           0.5530],
         [-0.2061,  0.4858, -0.2481, -0.1751,  0.7303, -0.1108,  0.2562,
           0.5559],
         [-0.1628,  0.3477, -0.1717, -0.1807,  0.6270, -0.0950,  0.3227,
           0.4950],
         [-0.1191,  0.6156, -0.2846, -0.0753,  0.7080, -0.1475,  0.1373,
           0.5293],
         [-0.1412,  0.3695, -0.1708, -0.1479,  0.6505, -0.1287,  0.2982,
           0.5093],
         [-0.1412,  0.3695, -0.1708, -0.1479,  0.6505, -0.1287,  0.2982,
           0.5093],
         [-0.1412,  0.3695, -0.1708, -0.1479,  0.6505, -0.1287,  0.2982,
           0.5093],
         [-0.1412,  0.3695, -0.1708, -0.1479,  0.6505, -0.1287,  0.2982,
           0.5093],
         [-0.1412,  0.3695, -0.1708, -0.1479,  0.6505, -0.1287,  0.2982,
           0.5093],
         [-0.1412,  0.3695, -0.1708, -0.1479,  0.6505, -0.1287,  0.2982,
           0.5093],
         [-0.1412,  0.3695, -0.1708, -0.1479, 

# Feed-Forward Networks (FFN) in Transformers

Feed-Forward Networks (FFN) in Transformers

Feed-Forward Networks (FFN) are a crucial component of the Transformer architecture. They are applied after the self-attention mechanism in each encoder and decoder layer. The FFN helps in introducing non-linearity and increasing the model's capacity to learn complex patterns.

Architecture of FFN:
1. Input layer: Takes in the output from the self-attention layer (dimension: d_model)
2. First linear transformation: Expands the input to a higher dimension (d_ff, typically 4 times d_model)
3. Activation function: Usually ReLU (Rectified Linear Unit)
4. Second linear transformation: Projects back to the original dimension (d_model)
5. Residual connection: The output is added to the input
6. Layer normalization: Applied to the sum of the residual connection

The FFN can be represented as:

FFN(x) = max(0, xW1 + b1)W2 + b2

Where W1, W2 are weight matrices, and b1, b2 are bias vectors.

This architecture allows the Transformer to process information across all positions in the sequence, complementing the global dependencies captured by the self-attention mechanism.


In [None]:
import torch.nn.functional as F

# Define the Feed-Forward Network (FFN) with Layer Normalization and Residual Connection
class FeedForwardNetwork(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForwardNetwork, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.layer_norm = nn.LayerNorm(d_model)

    def forward(self, x):
        # Apply the feed-forward network with ReLU activation
        residual = x
        x = F.relu(self.fc1(x))
        x = self.fc2(x)

        # Add residual connection and apply layer normalization
        x = self.layer_norm(x + residual)
        return x

# Example usage in the Transformer block
d_ff = 2048  # Feed-forward hidden dimension size (usually larger than d_model)

# Initialize the feed-forward network
ffn = FeedForwardNetwork(d_model=embedding_dim, d_ff=d_ff)
# Pass the output from attention through the feed-forward network
ffn_output = [ffn(attn_out) for attn_out in attention_output]

# Print the output of the feed-forward network
for i, output in enumerate(ffn_output):
    print(f"Feed-Forward Output for Sentence {i+1}:\n", output)

Feed-Forward Output for Sentence 1:
 tensor([[[-0.7320,  1.0657, -1.1681, -0.9003,  1.6522, -0.3548, -0.5565,
           0.9937],
         [-0.8963,  1.0240, -1.0854, -1.0707,  1.6280, -0.1372, -0.4415,
           0.9791],
         [-0.8567,  0.7859, -0.9993, -1.2850,  1.6923, -0.0752, -0.2852,
           1.0232],
         [-0.7094,  1.3196, -1.2032, -0.7750,  1.5363, -0.3225, -0.7072,
           0.8615],
         [-0.8087,  0.8230, -1.0017, -1.1716,  1.7443, -0.2256, -0.3844,
           1.0249],
         [-0.8087,  0.8230, -1.0017, -1.1716,  1.7443, -0.2256, -0.3844,
           1.0249],
         [-0.8087,  0.8230, -1.0017, -1.1716,  1.7443, -0.2256, -0.3844,
           1.0249],
         [-0.8087,  0.8230, -1.0017, -1.1716,  1.7443, -0.2256, -0.3844,
           1.0249],
         [-0.8087,  0.8230, -1.0017, -1.1716,  1.7443, -0.2256, -0.3844,
           1.0249],
         [-0.8087,  0.8230, -1.0017, -1.1716,  1.7443, -0.2256, -0.3844,
           1.0249],
         [-0.8087,  0.8230, -1.00

In [None]:
# SwiGLU Activation
class SwiGLU(nn.Module):
    def forward(self, x):
        x1, x2 = x.chunk(2, dim=-1)  # Split the input into two equal parts
        return F.silu(x1) * x2  # Apply Swish (SiLU) to one part and multiply with the other


In [None]:
import torch.nn.functional as F

# Define the Feed-Forward Network (FFN) with Layer Normalization and Residual Connection
class FeedForwardNetworkSwiGLU(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForwardNetworkSwiGLU, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff*2)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.swiGLU = SwiGLU()

        self.layer_norm = nn.LayerNorm(d_model)

    def forward(self, x):
        # Apply the feed-forward network with ReLU activation
        residual = x

        x = self.fc1(x)
        x = self.swiGLU(x)
        x = self.fc2(x)

        # Add residual connection and apply layer normalization
        x = self.layer_norm(x + residual)
        return x


d_ff = 2048  # Feed-forward hidden dimension size (usually larger than d_model)

# Initialize the FFN with SwiGLU
ffn_swiglu = FeedForwardNetworkSwiGLU(embedding_dim, d_ff)

# Create some example input embeddings (from the previous steps)
print(attention_output[0].size())
# Pass the input through the FFN
output = ffn_swiglu(attention_output[0])

print("Output from Feed-Forward Network with SwiGLU:\n", output)

torch.Size([1, 11, 8])
Output from Feed-Forward Network with SwiGLU:
 tensor([[[-0.9865,  0.9154, -1.0368, -0.6631,  1.5247, -1.0750,  0.2606,
           1.0608],
         [-1.1463,  0.9018, -0.9634, -0.8691,  1.5587, -0.8105,  0.3332,
           0.9956],
         [-1.1431,  0.6535, -0.8548, -1.0026,  1.5162, -0.8577,  0.6576,
           1.0309],
         [-0.9485,  1.2198, -1.1016, -0.6020,  1.5146, -0.9444, -0.0580,
           0.9201],
         [-1.0865,  0.6887, -0.8617, -0.9044,  1.5588, -0.9826,  0.5404,
           1.0473],
         [-1.0865,  0.6887, -0.8617, -0.9044,  1.5588, -0.9826,  0.5404,
           1.0473],
         [-1.0865,  0.6887, -0.8617, -0.9044,  1.5588, -0.9826,  0.5404,
           1.0473],
         [-1.0865,  0.6887, -0.8617, -0.9044,  1.5588, -0.9826,  0.5404,
           1.0473],
         [-1.0865,  0.6887, -0.8617, -0.9044,  1.5588, -0.9826,  0.5404,
           1.0473],
         [-1.0865,  0.6887, -0.8617, -0.9044,  1.5588, -0.9826,  0.5404,
           1.0473],


# Full Implementation till EncoderLayer

In [None]:
import torch
import torch.nn as nn
import string
from collections import defaultdict

# Step 1: Prepare the Data
data = [
    "I love machine learning.",
    "Artificial intelligence is the future.",
    "Transformers are powerful models.",
    "Natural language processing is fascinating.",
    "Deep learning enables great advances in AI.",
    "PyTorch is a popular deep learning library.",
    "GPT models are widely used in NLP.",
    "Neural networks are the backbone of deep learning.",
    "Embedding layers convert tokens to vectors.",
    "Attention mechanisms help models focus on important parts of the input."
]

# Step 2: Simple Tokenization
def simple_tokenize(sentence):
    sentence = sentence.translate(str.maketrans('', '', string.punctuation)).lower()
    return sentence.split()

tokenized_data = [simple_tokenize(sentence) for sentence in data]

# Step 3: Build a Vocabulary with Padding and Unknown Tokens
vocab = defaultdict(lambda: len(vocab))
PAD = vocab["<PAD>"]
UNK = vocab["<UNK>"]

# Populate the vocabulary with the tokenized data
for tokens in tokenized_data:
    for token in tokens:
        _ = vocab[token]

# Convert defaultdict to a regular dictionary
vocab = dict(vocab)







# Step 4: Convert Tokens to IDs and Pad Sequences
def tokens_to_ids(tokens, vocab):
    return [vocab.get(token, UNK) for token in tokens]

max_len = max(len(tokens) for tokens in tokenized_data)
token_ids_data = [tokens_to_ids(tokens, vocab) for tokens in tokenized_data]
padded_token_ids_data = [token_ids + [PAD] * (max_len - len(token_ids)) for token_ids in token_ids_data]







# Step 5: Create the Embedding Layer
vocab_size = len(vocab)
embedding_dim = 8
embedding_layer = nn.Embedding(vocab_size, embedding_dim)

# Convert Padded Token IDs to Embeddings
token_ids_tensor = torch.tensor(padded_token_ids_data)
print(token_ids_tensor)

embeddings = embedding_layer(token_ids_tensor)





# Step 6: Positional Encoding (Sinusoidal)
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-torch.log(torch.tensor(10000.0)) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.pe = pe.unsqueeze(0).transpose(0, 1)

    def forward(self, x):
        return x + self.pe[:x.size(0), :]




pos_encoder = PositionalEncoding(embedding_dim, max_len)
embeddings_with_pos = pos_encoder(embeddings)

# Step 7: Self-Attention Mechanism
class MultiHeadSelfAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadSelfAttention, self).__init__()
        assert d_model % num_heads == 0
        self.num_heads = num_heads
        self.depth = d_model // num_heads

        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)

    def forward(self, x):
        batch_size = x.size(0)
        q = self.w_q(x).view(batch_size, -1, self.num_heads, self.depth).transpose(1, 2)
        k = self.w_k(x).view(batch_size, -1, self.num_heads, self.depth).transpose(1, 2)
        v = self.w_v(x).view(batch_size, -1, self.num_heads, self.depth).transpose(1, 2)

        scores = torch.matmul(q, k.transpose(-1, -2)) / torch.sqrt(torch.tensor(self.depth, dtype=torch.float32))
        attention_weights = nn.functional.softmax(scores, dim=-1)
        attention_output = torch.matmul(attention_weights, v)

        attention_output = attention_output.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.depth)
        output = self.w_o(attention_output)
        return output

num_heads = 4
attention = MultiHeadSelfAttention(embedding_dim, num_heads)
attention_output = attention(embeddings_with_pos)






# Step 8: Feed-Forward Network with SwiGLU
class SwiGLU(nn.Module):
    def forward(self, x):
        return x * torch.nn.functional.silu(x)

class FeedForwardNetworkSwiGLU(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForwardNetworkSwiGLU, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff * 2)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.layer_norm = nn.LayerNorm(d_model)

    def forward(self, x):
        residual = x
        x = self.fc1(x)

        # SwiGLU: Split the output into two parts
        x1, x2 = x.chunk(2, dim=-1)

        # Apply SiLU activation function to one part
        x = x1 * torch.nn.functional.silu(x2)

        # Apply the second linear transformation
        x = self.fc2(x)

        # Add the residual connection and apply layer normalization
        x = self.layer_norm(x + residual)
        return x

d_ff = 32
ffn_swiglu = FeedForwardNetworkSwiGLU(embedding_dim, d_ff)
ffn_output = ffn_swiglu(attention_output)

# Step 9: Combine into an Encoder Layer
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff):
        super(EncoderLayer, self).__init__()
        self.self_attention = MultiHeadSelfAttention(d_model, num_heads)
        self.ffn = FeedForwardNetworkSwiGLU(d_model, d_ff)
        self.layer_norm = nn.LayerNorm(d_model)

    def forward(self, x):
        # Apply self-attention and add residual connection
        attn_output = self.self_attention(x)
        x = self.layer_norm(x + attn_output)

        # Apply feed-forward network and add residual connection
        ffn_output = self.ffn(x)
        x = self.layer_norm(x + ffn_output)
        return x

# Instantiate and apply the encoder layer
encoder_layer = EncoderLayer(embedding_dim, num_heads, d_ff)
encoder_output = encoder_layer(embeddings_with_pos)

# Print the final encoder output
print("Final Encoder Output:\n", encoder_output)


tensor([[ 2,  3,  4,  5,  0,  0,  0,  0,  0,  0,  0],
        [ 6,  7,  8,  9, 10,  0,  0,  0,  0,  0,  0],
        [11, 12, 13, 14,  0,  0,  0,  0,  0,  0,  0],
        [15, 16, 17,  8, 18,  0,  0,  0,  0,  0,  0],
        [19,  5, 20, 21, 22, 23, 24,  0,  0,  0,  0],
        [25,  8, 26, 27, 19,  5, 28,  0,  0,  0,  0],
        [29, 14, 12, 30, 31, 23, 32,  0,  0,  0,  0],
        [33, 34, 12,  9, 35, 36, 19,  5,  0,  0,  0],
        [37, 38, 39, 40, 41, 42,  0,  0,  0,  0,  0],
        [43, 44, 45, 14, 46, 47, 48, 49, 36,  9, 50]])
Final Encoder Output:
 tensor([[[-0.3615,  2.0050, -0.6459, -1.4127,  0.7535, -0.4096, -0.5554,
           0.6265],
         [-1.1762, -0.4951, -0.9052,  0.9795, -1.2819,  0.8444,  0.7284,
           1.3061],
         [-1.1402,  0.9215,  0.0189,  1.5088, -0.6020,  0.1075, -1.6047,
           0.7903],
         [ 0.1036,  0.5697, -1.5493,  0.8538, -1.4637,  1.3849,  0.5352,
          -0.4343],
         [-1.2121,  1.6602, -0.1858, -0.9283,  0.1794,  1.4658, 

In [None]:
## Connecting multiple encoderlayers to make Encoder


class TransformerEncoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, vocab_size, max_len):
        super(TransformerEncoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoder = PositionalEncoding(d_model, max_len)
        self.encoder_layers = nn.ModuleList(
            [EncoderLayer(d_model, num_heads, d_ff) for _ in range(num_layers)]
        )
        self.layer_norm = nn.LayerNorm(d_model)

    def forward(self, x):
        # Apply embedding and positional encoding
        x = self.embedding(x)
        x = self.pos_encoder(x)

        # Pass through each encoder layer
        for encoder_layer in self.encoder_layers:
            x = encoder_layer(x)

        # Apply final layer normalization
        x = self.layer_norm(x)
        return x

In [None]:
# Example usage:
num_layers = 6
encoder = TransformerEncoder(embedding_dim, num_heads, d_ff, num_layers, vocab_size, max_len)

# Convert input data to tensor format
input_tensor = torch.tensor(padded_token_ids_data)

# Pass input through the full encoder
encoded_output = encoder(input_tensor)

print("Encoded Output:\n", encoded_output)

Encoded Output:
 tensor([[[-2.5182e+00,  3.1315e-01,  1.3221e-01,  4.9132e-01,  1.0794e+00,
           2.3023e-01, -1.6487e-02,  2.8838e-01],
         [-5.0042e-01,  3.8617e-01, -1.2510e-01, -1.9037e-01, -2.1190e+00,
           1.4153e-01,  9.3809e-01,  1.4691e+00],
         [-5.1613e-01,  1.4354e+00, -4.9616e-01,  1.3677e+00, -6.6916e-01,
           9.4237e-01, -1.2448e+00, -8.1924e-01],
         [ 9.0937e-01,  7.5987e-02, -3.1832e-01,  1.1996e+00, -1.3432e+00,
          -1.6139e+00, -1.2916e-02,  1.1034e+00],
         [ 7.3825e-01,  1.0895e+00, -1.0489e+00,  8.0652e-01, -1.9039e+00,
          -6.4912e-01,  5.2264e-01,  4.4491e-01],
         [ 7.3825e-01,  1.0895e+00, -1.0489e+00,  8.0652e-01, -1.9039e+00,
          -6.4912e-01,  5.2264e-01,  4.4491e-01],
         [ 7.3825e-01,  1.0895e+00, -1.0489e+00,  8.0652e-01, -1.9039e+00,
          -6.4912e-01,  5.2264e-01,  4.4491e-01],
         [ 7.3825e-01,  1.0895e+00, -1.0489e+00,  8.0652e-01, -1.9039e+00,
          -6.4912e-01,  5.2264e-0

# Masked-MultiHeadAttention And DecoderLayer

In [None]:
### Masked Muti head attention


class MaskedMultiHeadSelfAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MaskedMultiHeadSelfAttention, self).__init__()
        self.num_heads = num_heads
        self.depth = d_model // num_heads

        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)

    def split_heads(self, x, batch_size):
        x = x.view(batch_size, -1, self.num_heads, self.depth)
        return x.permute(0, 2, 1, 3)

    def forward(self, x, kv=None, mask=None):
        batch_size = x.size(0)

        # If kv is None, we're performing self-attention
        if kv is None:
            kv = x

        q = self.w_q(x)  # Query
        k = self.w_k(kv)  # Key
        v = self.w_v(kv)  # Value

        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)

        # Scaled dot-product attention
        scores = torch.matmul(q, k.transpose(-1, -2)) / torch.sqrt(torch.tensor(self.depth, dtype=torch.float32))

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attention_weights = nn.functional.softmax(scores, dim=-1)
        attention_output = torch.matmul(attention_weights, v)

        # Concatenate heads
        attention_output = attention_output.permute(0, 2, 1, 3).contiguous()
        attention_output = attention_output.view(batch_size, -1, self.num_heads * self.depth)

        # Final linear layer
        output = self.w_o(attention_output)
        return output


In [None]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff):
        super(DecoderLayer, self).__init__()
        self.masked_self_attention = MaskedMultiHeadSelfAttention(d_model, num_heads)
        self.encoder_attention = MultiHeadSelfAttention(d_model, num_heads)
        self.ffn = FeedForwardNetworkSwiGLU(d_model, d_ff)
        self.layer_norm_1 = nn.LayerNorm(d_model)
        self.layer_norm_2 = nn.LayerNorm(d_model)
        self.layer_norm_3 = nn.LayerNorm(d_model)

    def forward(self, x, encoder_output, mask=None):
        # Masked self-attention
        residual = x
        x = self.masked_self_attention(x, x, mask)
        x = self.layer_norm_1(x + residual)

        # Encoder-decoder attention
        residual = x
        x = self.encoder_attention(x, encoder_output)
        x = self.layer_norm_2(x + residual)

        # Feed-forward network
        residual = x
        x = self.ffn(x)
        x = self.layer_norm_3(x + residual)

        return x


In [None]:
## Decoder
class TransformerDecoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, vocab_size, max_len):
        super(TransformerDecoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoder = PositionalEncoding(d_model, max_len)
        self.decoder_layers = nn.ModuleList(
            [DecoderLayer(d_model, num_heads, d_ff) for _ in range(num_layers)]
        )
        self.layer_norm = nn.LayerNorm(d_model)
        self.output_layer = nn.Linear(d_model, vocab_size)

    def forward(self, x, encoder_output, mask=None):
        # Apply embedding and positional encoding
        x = self.embedding(x)
        x = self.pos_encoder(x)

        # Pass through each decoder layer
        for decoder_layer in self.decoder_layers:
            x = decoder_layer(x, encoder_output, mask)

        # Apply final layer normalization and output layer
        x = self.layer_norm(x)
        output = self.output_layer(x)
        return output


In [None]:

# Example usage:
num_layers = 6
decoder = TransformerDecoder(embedding_dim, num_heads, d_ff, num_layers, vocab_size, max_len)

# Create some example target input (e.g., shifted right target sequence for training)
target_input_tensor = torch.tensor(padded_token_ids_data)

# Pass target input and encoder output through the decoder
decoder_output = decoder(target_input_tensor, encoded_output)

print("Decoder Output:\n", decoder_output)

Decoder Output:
 tensor([[[-0.9084, -0.3984,  0.3829,  ..., -0.3335,  0.2679, -0.6934],
         [ 0.0345,  1.5045,  0.5387,  ...,  0.3364,  0.5554, -0.5450],
         [ 0.7700,  0.8883,  0.2676,  ...,  0.6521, -0.4089,  0.3865],
         ...,
         [ 0.2712,  1.4132,  0.3575,  ..., -0.1094,  0.3532, -0.5686],
         [ 0.2712,  1.4132,  0.3575,  ..., -0.1094,  0.3532, -0.5686],
         [ 0.2712,  1.4132,  0.3575,  ..., -0.1094,  0.3532, -0.5686]],

        [[-0.1309,  1.1669,  0.3967,  ..., -0.2471,  0.2829,  0.0891],
         [-0.0665,  0.8651,  0.2818,  ...,  0.8058, -0.3688,  0.2312],
         [ 0.9910,  0.2022, -0.3763,  ..., -0.5187, -1.0557,  0.4144],
         ...,
         [ 0.4884,  0.9727, -0.2119,  ..., -0.6482,  0.0232,  0.0171],
         [ 0.4884,  0.9727, -0.2119,  ..., -0.6482,  0.0232,  0.0171],
         [ 0.4884,  0.9727, -0.2119,  ..., -0.6482,  0.0232,  0.0171]],

        [[ 1.0777,  0.1037, -0.2088,  ..., -0.4967, -0.7784,  0.6806],
         [ 0.9500,  0.0276, 

# Bringing all Features together to build Transformer - Encode and Decoder

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import string
from collections import defaultdict

# Define SwiGLU activation function
class SwiGLU(nn.Module):
    def forward(self, x):
        x1, x2 = x.chunk(2, dim=-1)  # Split the input into two equal parts
        return F.silu(x1) * x2  # Apply Swish (SiLU) to one part and multiply with the other

# Define Rotatory Positional Encoding
class RotatoryPositionalEncoding(nn.Module):
    def __init__(self, max_seq_len, d_model):
        super(RotatoryPositionalEncoding, self).__init__()
        self.d_model = d_model
        self.max_seq_len = max_seq_len

        # Create a tensor of shape (max_seq_len, d_model)
        position = torch.arange(0, max_seq_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(np.log(10000.0) / d_model))
        self.register_buffer('pe', torch.zeros(max_seq_len, d_model))
        self.pe[:, 0::2] = torch.sin(position * div_term)
        self.pe[:, 1::2] = torch.cos(position * div_term)

    def forward(self, x):
        return x + self.pe[:x.size(1)]

# Define Multi-Head Self Attention with Masking
class MaskedMultiHeadSelfAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MaskedMultiHeadSelfAttention, self).__init__()
        self.num_heads = num_heads
        self.depth = d_model // num_heads

        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)

    def split_heads(self, x, batch_size):
        x = x.view(batch_size, -1, self.num_heads, self.depth)
        return x.permute(0, 2, 1, 3)

    def forward(self, x, mask=None):
        batch_size = x.size(0)

        q = self.split_heads(self.w_q(x), batch_size)
        k = self.split_heads(self.w_k(x), batch_size)
        v = self.split_heads(self.w_v(x), batch_size)

        scores = torch.matmul(q, k.transpose(-1, -2)) / torch.sqrt(torch.tensor(self.depth, dtype=torch.float32))

        if mask is not None:
            scores += mask  # Apply mask to scores

        attention_weights = F.softmax(scores, dim=-1)
        output = torch.matmul(attention_weights, v)
        output = output.permute(0, 2, 1, 3).contiguous().view(batch_size, -1, self.num_heads * self.depth)
        return self.w_o(output)

# Define Feed-Forward Network with SwiGLU
class FeedForwardNetworkSwiGLU(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForwardNetworkSwiGLU, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff*2)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.swiGLU = SwiGLU()
        self.layer_norm = nn.LayerNorm(d_model)

    def forward(self, x):
        residual = x
        x = self.fc1(x)
        x = self.swiGLU(x)
        x = self.fc2(x)
        x = self.layer_norm(x + residual)
        return x

# Define Transformer Encoder
class TransformerEncoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, vocab_size, max_seq_len):
        super(TransformerEncoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = RotatoryPositionalEncoding(max_seq_len, d_model)
        self.encoder_layers = nn.ModuleList([
            nn.ModuleList([
                MultiHeadSelfAttention(d_model, num_heads),
                FeedForwardNetworkSwiGLU(d_model, d_ff)
            ]) for _ in range(num_layers)
        ])
        self.layer_norm = nn.LayerNorm(d_model)

    def forward(self, x):
        x = self.embedding(x)
        x = self.pos_encoding(x)
        for self_attn, ffn in self.encoder_layers:
            x = self_attn(x)
            x = ffn(x)
        return self.layer_norm(x)

# Define Transformer Decoder
class TransformerDecoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, vocab_size, max_seq_len):
        super(TransformerDecoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = RotatoryPositionalEncoding(max_seq_len, d_model)
        self.decoder_layers = nn.ModuleList([
            nn.ModuleList([
                MaskedMultiHeadSelfAttention(d_model, num_heads),  # Masked MHA
                MultiHeadSelfAttention(d_model, num_heads),  # Cross-attention
                FeedForwardNetworkSwiGLU(d_model, d_ff)
            ]) for _ in range(num_layers)
        ])
        self.layer_norm = nn.LayerNorm(d_model)
        self.fc_out = nn.Linear(d_model, vocab_size)

    def forward(self, x, encoder_output):
        x = self.embedding(x)
        x = self.pos_encoding(x)
        for self_attn, cross_attn, ffn in self.decoder_layers:
            seq_len = x.size(1)
            mask = torch.triu(torch.ones(seq_len, seq_len) * float('-inf'), diagonal=1).to(x.device)
            x = self.layer_norm(x + self_attn(x, mask))
            x = self.layer_norm(x + cross_attn(x, encoder_output))
            x = self.layer_norm(x + ffn(x))
        return self.fc_out(x)


In [None]:


# Fake dataset
data = [
    "I love machine learning.",
    "Artificial intelligence is the future.",
    "Transformers are powerful models.",
    "Natural language processing is fascinating.",
    "Deep learning enables great advances in AI.",
    "PyTorch is a popular deep learning library.",
    "GPT models are widely used in NLP.",
    "Neural networks are the backbone of deep learning.",
    "Embedding layers convert tokens to vectors.",
    "Attention mechanisms help models focus on important parts of the input."
]

# Tokenize data
def simple_tokenize(sentence):
    sentence = sentence.translate(str.maketrans('', '', string.punctuation)).lower()
    return sentence.split()

tokenized_data = [simple_tokenize(sentence) for sentence in data]

# Build vocabulary with padding and unknown tokens
vocab = defaultdict(lambda: len(vocab))
PAD = vocab["<PAD>"]
UNK = vocab["<UNK>"]

for tokens in tokenized_data:
    for token in tokens:
        _ = vocab[token]

vocab = dict(vocab)

def tokens_to_ids(tokens, vocab):
    return [vocab.get(token, UNK) for token in tokens]

max_len = max(len(tokens) for tokens in tokenized_data)

token_ids_data = [tokens_to_ids(tokens, vocab) for tokens in tokenized_data]
padded_token_ids_data = [token_ids + [PAD] * (max_len - len(token_ids)) for token_ids in token_ids_data]

# Model parameters
d_model = 8
num_heads = 4
d_ff = 32
num_layers = 2
vocab_size = len(vocab)
max_seq_len = max_len

# Initialize models
encoder = TransformerEncoder(d_model, num_heads, d_ff, num_layers, vocab_size, max_seq_len)
decoder = TransformerDecoder(d_model, num_heads, d_ff, num_layers, vocab_size, max_seq_len)

# Example input sequences
batch_size = 2
src = torch.tensor(padded_token_ids_data[:batch_size])
tgt = torch.tensor(padded_token_ids_data[batch_size:batch_size*2])

# Forward pass through the encoder and decoder
encoder_output = encoder(src)
decoder_output = decoder(tgt, encoder_output)

# Print the results
print("Source sequences (src):")
print(src)
print("\nTarget sequences (tgt):")
print(tgt)
print("\nEncoder output:")
print(encoder_output)
print("\nDecoder output:")
print(decoder_output)



Source sequences (src):
tensor([[ 2,  3,  4,  5,  0,  0,  0,  0,  0,  0,  0],
        [ 6,  7,  8,  9, 10,  0,  0,  0,  0,  0,  0]])

Target sequences (tgt):
tensor([[11, 12, 13, 14,  0,  0,  0,  0,  0,  0,  0],
        [15, 16, 17,  8, 18,  0,  0,  0,  0,  0,  0]])

Encoder output:
tensor([[[-0.3966, -1.9151,  1.0020,  0.6535,  1.4238, -0.8403,  0.0998,
          -0.0271],
         [-0.3966, -1.9151,  1.0020,  0.6535,  1.4238, -0.8403,  0.0998,
          -0.0271],
         [-0.3966, -1.9151,  1.0020,  0.6535,  1.4238, -0.8403,  0.0998,
          -0.0271],
         [-0.3966, -1.9151,  1.0020,  0.6535,  1.4238, -0.8403,  0.0998,
          -0.0271],
         [-0.3966, -1.9151,  1.0020,  0.6535,  1.4238, -0.8403,  0.0998,
          -0.0271],
         [-0.3966, -1.9151,  1.0020,  0.6535,  1.4238, -0.8403,  0.0998,
          -0.0271],
         [-0.3966, -1.9151,  1.0020,  0.6535,  1.4238, -0.8403,  0.0998,
          -0.0271],
         [-0.3966, -1.9151,  1.0020,  0.6535,  1.4238, -0.8403,  

In [None]:
from torch.utils.data import Dataset, DataLoader

class TransformerDataset(Dataset):
    def __init__(self, src_sequences, tgt_sequences):
        self.src_sequences = src_sequences
        self.tgt_sequences = tgt_sequences

    def __len__(self):
        return len(self.src_sequences)

    def __getitem__(self, idx):
        src = torch.tensor(self.src_sequences[idx])
        tgt = torch.tensor(self.tgt_sequences[idx])
        return src, tgt

# Example dataset (using the padded sequences from before)
src_sequences = padded_token_ids_data[:8]  # Source sequences
tgt_sequences = padded_token_ids_data[1:9]  # Target sequences (shifted by 1 for simplicity)

dataset = TransformerDataset(src_sequences, tgt_sequences)
data_loader = DataLoader(dataset, batch_size=2, shuffle=True)


In [None]:
class Transformer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, vocab_size, max_seq_len):
        super(Transformer, self).__init__()
        self.encoder = TransformerEncoder(d_model, num_heads, d_ff, num_layers, vocab_size, max_seq_len)
        self.decoder = TransformerDecoder(d_model, num_heads, d_ff, num_layers, vocab_size, max_seq_len)

    def forward(self, src, tgt):
        encoder_output = self.encoder(src)
        decoder_output = self.decoder(tgt, encoder_output)
        return decoder_output


In [None]:
import torch.optim as optim

# Initialize the transformer model
model = Transformer(d_model, num_heads, d_ff, num_layers, vocab_size, max_seq_len)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss(ignore_index=PAD)  # Ignoring padding tokens in loss

# Training loop
num_epochs = 100

for epoch in range(num_epochs):
    model.train()
    total_loss = 0

    for src, tgt in data_loader:
        tgt_input = tgt[:, :-1]  # All but last token (teacher forcing)
        tgt_output = tgt[:, 1:]  # All but first token (labels)

        optimizer.zero_grad()
        output = model(src, tgt_input)
        loss = criterion(output.view(-1, vocab_size), tgt_output.contiguous().view(-1))
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(data_loader)
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")


Epoch 1/150, Loss: 4.0442
Epoch 2/150, Loss: 3.6850
Epoch 3/150, Loss: 3.4424
Epoch 4/150, Loss: 3.2484
Epoch 5/150, Loss: 3.0421
Epoch 6/150, Loss: 2.8623
Epoch 7/150, Loss: 2.6885
Epoch 8/150, Loss: 2.4912
Epoch 9/150, Loss: 2.2966
Epoch 10/150, Loss: 2.1498
Epoch 11/150, Loss: 1.9810
Epoch 12/150, Loss: 1.8318
Epoch 13/150, Loss: 1.6873
Epoch 14/150, Loss: 1.5451
Epoch 15/150, Loss: 1.4129
Epoch 16/150, Loss: 1.3051
Epoch 17/150, Loss: 1.1938
Epoch 18/150, Loss: 1.0882
Epoch 19/150, Loss: 0.9945
Epoch 20/150, Loss: 0.9219
Epoch 21/150, Loss: 0.8340
Epoch 22/150, Loss: 0.7451
Epoch 23/150, Loss: 0.6584
Epoch 24/150, Loss: 0.6055
Epoch 25/150, Loss: 0.5418
Epoch 26/150, Loss: 0.4851
Epoch 27/150, Loss: 0.4511
Epoch 28/150, Loss: 0.3996
Epoch 29/150, Loss: 0.3627
Epoch 30/150, Loss: 0.3240
Epoch 31/150, Loss: 0.2923
Epoch 32/150, Loss: 0.2656
Epoch 33/150, Loss: 0.2427
Epoch 34/150, Loss: 0.2169
Epoch 35/150, Loss: 0.2013
Epoch 36/150, Loss: 0.1905
Epoch 37/150, Loss: 0.1878
Epoch 38/1

In [None]:
def evaluate(model, data_loader):
    model.eval()
    total_loss = 0

    with torch.no_grad():
        for src, tgt in data_loader:
            tgt_input = tgt[:, :-1]  # All but last token (teacher forcing)
            tgt_output = tgt[:, 1:]  # All but first token (labels)

            output = model(src, tgt_input)
            loss = criterion(output.view(-1, vocab_size), tgt_output.contiguous().view(-1))

            total_loss += loss.item()

    avg_loss = total_loss / len(data_loader)
    return avg_loss

# Evaluate the model
validation_loss = evaluate(model, data_loader)
print(f"Validation Loss: {validation_loss:.4f}")


Validation Loss: 0.0063


In [None]:
def generate_sequence(model, src, max_len=max_seq_len):
    model.eval()
    src = torch.tensor(src).unsqueeze(0)  # Add batch dimension
    tgt = torch.zeros((1, max_len), dtype=torch.long)  # Initialize target sequence with zeros
    print("Initial Target Sequence:", tgt)

    with torch.no_grad():
        for i in range(max_len):
            tgt_input = tgt[:, :i+1]  # All tokens up to position i (inclusive)
            print("Target Input Shape:", tgt_input.shape)

            # Generate predictions
            output = model(src, tgt_input)
            print("Model Output Shape:", output.shape)

            # Output should be of shape (batch_size, seq_len, vocab_size)
            if output.dim() == 3:
                # Handle cases where output is empty
                if output.size(1) > 0:
                    next_token = output[:, -1, :].argmax(dim=-1)  # Get the most likely next token
                    tgt[:, i] = next_token
                else:
                    print("Output sequence is empty.")
                    break
            else:
                print(f"Unexpected output shape: {output.shape}")
                break

            # Stop if padding token is generated
            if next_token.item() == PAD:
                break

    return tgt.squeeze().tolist()

# Generate a sequence
example_src = padded_token_ids_data[9]  # Use an example source sequence
print(f"Example Source Sequence: {example_src}")
generated_sequence = generate_sequence(model, example_src)
print(f"Generated Sequence: {generated_sequence}")


Example Source Sequence: [43, 44, 45, 14, 46, 47, 48, 49, 36, 9, 50]
Initial Target Sequence: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Target Input Shape: torch.Size([1, 1])
Model Output Shape: torch.Size([1, 1, 51])
Target Input Shape: torch.Size([1, 2])
Model Output Shape: torch.Size([1, 2, 51])
Target Input Shape: torch.Size([1, 3])
Model Output Shape: torch.Size([1, 3, 51])
Target Input Shape: torch.Size([1, 4])
Model Output Shape: torch.Size([1, 4, 51])
Target Input Shape: torch.Size([1, 5])
Model Output Shape: torch.Size([1, 5, 51])
Target Input Shape: torch.Size([1, 6])
Model Output Shape: torch.Size([1, 6, 51])
Target Input Shape: torch.Size([1, 7])
Model Output Shape: torch.Size([1, 7, 51])
Target Input Shape: torch.Size([1, 8])
Model Output Shape: torch.Size([1, 8, 51])
Target Input Shape: torch.Size([1, 9])
Model Output Shape: torch.Size([1, 9, 51])
Target Input Shape: torch.Size([1, 10])
Model Output Shape: torch.Size([1, 10, 51])
Target Input Shape: torch.Size([1, 11])


In [None]:
from torch.optim.lr_scheduler import StepLR

# Define a learning rate scheduler
scheduler = StepLR(optimizer, step_size=5, gamma=0.7)

# Modify the training loop to include the scheduler
for epoch in range(num_epochs):
    model.train()
    total_loss = 0

    for src, tgt in data_loader:
        tgt_input = tgt[:, :-1]
        tgt_output = tgt[:, 1:]

        optimizer.zero_grad()
        output = model(src, tgt_input)
        loss = criterion(output.view(-1, vocab_size), tgt_output.contiguous().view(-1))
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(data_loader)
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

    # Update the learning rate
    scheduler.step()


Epoch 1/150, Loss: 0.0062
Epoch 2/150, Loss: 0.0061
Epoch 3/150, Loss: 0.0061
Epoch 4/150, Loss: 0.0059
Epoch 5/150, Loss: 0.0059
Epoch 6/150, Loss: 0.0059
Epoch 7/150, Loss: 0.0058
Epoch 8/150, Loss: 0.0057
Epoch 9/150, Loss: 0.0057
Epoch 10/150, Loss: 0.0057
Epoch 11/150, Loss: 0.0056
Epoch 12/150, Loss: 0.0056
Epoch 13/150, Loss: 0.0056
Epoch 14/150, Loss: 0.0055
Epoch 15/150, Loss: 0.0055
Epoch 16/150, Loss: 0.0054
Epoch 17/150, Loss: 0.0054
Epoch 18/150, Loss: 0.0054
Epoch 19/150, Loss: 0.0054
Epoch 20/150, Loss: 0.0054
Epoch 21/150, Loss: 0.0054
Epoch 22/150, Loss: 0.0053
Epoch 23/150, Loss: 0.0053
Epoch 24/150, Loss: 0.0053
Epoch 25/150, Loss: 0.0053
Epoch 26/150, Loss: 0.0053
Epoch 27/150, Loss: 0.0053
Epoch 28/150, Loss: 0.0052
Epoch 29/150, Loss: 0.0052
Epoch 30/150, Loss: 0.0052
Epoch 31/150, Loss: 0.0052
Epoch 32/150, Loss: 0.0052
Epoch 33/150, Loss: 0.0052
Epoch 34/150, Loss: 0.0052
Epoch 35/150, Loss: 0.0052
Epoch 36/150, Loss: 0.0051
Epoch 37/150, Loss: 0.0052
Epoch 38/1

In [None]:
# Save the model
torch.save(model.state_dict(), 'transformer_model.pth')

# Load the model
model = Transformer(d_model, num_heads, d_ff, num_layers, vocab_size, max_seq_len)
model.load_state_dict(torch.load('transformer_model.pth'))


<All keys matched successfully>

In [None]:
# Save model and optimizer state
torch.save({
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
}, 'transformer_model.pth')


In [None]:
# Initialize the model and optimizer
model = TransformerEncoder(d_model, num_heads, d_ff, num_layers, vocab_size, max_seq_len)
optimizer = torch.optim.Adam(model.parameters())

# Load the model and optimizer state
checkpoint = torch.load('transformer_model.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
model.eval()  # Set the model to evaluation mode


RuntimeError: Error(s) in loading state_dict for TransformerEncoder:
	Missing key(s) in state_dict: "embedding.weight", "pos_encoding.pe", "encoder_layers.0.0.w_q.weight", "encoder_layers.0.0.w_q.bias", "encoder_layers.0.0.w_k.weight", "encoder_layers.0.0.w_k.bias", "encoder_layers.0.0.w_v.weight", "encoder_layers.0.0.w_v.bias", "encoder_layers.0.0.w_o.weight", "encoder_layers.0.0.w_o.bias", "encoder_layers.0.1.fc1.weight", "encoder_layers.0.1.fc1.bias", "encoder_layers.0.1.fc2.weight", "encoder_layers.0.1.fc2.bias", "encoder_layers.0.1.layer_norm.weight", "encoder_layers.0.1.layer_norm.bias", "encoder_layers.1.0.w_q.weight", "encoder_layers.1.0.w_q.bias", "encoder_layers.1.0.w_k.weight", "encoder_layers.1.0.w_k.bias", "encoder_layers.1.0.w_v.weight", "encoder_layers.1.0.w_v.bias", "encoder_layers.1.0.w_o.weight", "encoder_layers.1.0.w_o.bias", "encoder_layers.1.1.fc1.weight", "encoder_layers.1.1.fc1.bias", "encoder_layers.1.1.fc2.weight", "encoder_layers.1.1.fc2.bias", "encoder_layers.1.1.layer_norm.weight", "encoder_layers.1.1.layer_norm.bias", "layer_norm.weight", "layer_norm.bias". 
	Unexpected key(s) in state_dict: "encoder.embedding.weight", "encoder.pos_encoding.pe", "encoder.encoder_layers.0.0.w_q.weight", "encoder.encoder_layers.0.0.w_q.bias", "encoder.encoder_layers.0.0.w_k.weight", "encoder.encoder_layers.0.0.w_k.bias", "encoder.encoder_layers.0.0.w_v.weight", "encoder.encoder_layers.0.0.w_v.bias", "encoder.encoder_layers.0.0.w_o.weight", "encoder.encoder_layers.0.0.w_o.bias", "encoder.encoder_layers.0.1.fc1.weight", "encoder.encoder_layers.0.1.fc1.bias", "encoder.encoder_layers.0.1.fc2.weight", "encoder.encoder_layers.0.1.fc2.bias", "encoder.encoder_layers.0.1.layer_norm.weight", "encoder.encoder_layers.0.1.layer_norm.bias", "encoder.encoder_layers.1.0.w_q.weight", "encoder.encoder_layers.1.0.w_q.bias", "encoder.encoder_layers.1.0.w_k.weight", "encoder.encoder_layers.1.0.w_k.bias", "encoder.encoder_layers.1.0.w_v.weight", "encoder.encoder_layers.1.0.w_v.bias", "encoder.encoder_layers.1.0.w_o.weight", "encoder.encoder_layers.1.0.w_o.bias", "encoder.encoder_layers.1.1.fc1.weight", "encoder.encoder_layers.1.1.fc1.bias", "encoder.encoder_layers.1.1.fc2.weight", "encoder.encoder_layers.1.1.fc2.bias", "encoder.encoder_layers.1.1.layer_norm.weight", "encoder.encoder_layers.1.1.layer_norm.bias", "encoder.layer_norm.weight", "encoder.layer_norm.bias", "decoder.embedding.weight", "decoder.pos_encoding.pe", "decoder.decoder_layers.0.0.w_q.weight", "decoder.decoder_layers.0.0.w_q.bias", "decoder.decoder_layers.0.0.w_k.weight", "decoder.decoder_layers.0.0.w_k.bias", "decoder.decoder_layers.0.0.w_v.weight", "decoder.decoder_layers.0.0.w_v.bias", "decoder.decoder_layers.0.0.w_o.weight", "decoder.decoder_layers.0.0.w_o.bias", "decoder.decoder_layers.0.1.w_q.weight", "decoder.decoder_layers.0.1.w_q.bias", "decoder.decoder_layers.0.1.w_k.weight", "decoder.decoder_layers.0.1.w_k.bias", "decoder.decoder_layers.0.1.w_v.weight", "decoder.decoder_layers.0.1.w_v.bias", "decoder.decoder_layers.0.1.w_o.weight", "decoder.decoder_layers.0.1.w_o.bias", "decoder.decoder_layers.0.2.fc1.weight", "decoder.decoder_layers.0.2.fc1.bias", "decoder.decoder_layers.0.2.fc2.weight", "decoder.decoder_layers.0.2.fc2.bias", "decoder.decoder_layers.0.2.layer_norm.weight", "decoder.decoder_layers.0.2.layer_norm.bias", "decoder.decoder_layers.1.0.w_q.weight", "decoder.decoder_layers.1.0.w_q.bias", "decoder.decoder_layers.1.0.w_k.weight", "decoder.decoder_layers.1.0.w_k.bias", "decoder.decoder_layers.1.0.w_v.weight", "decoder.decoder_layers.1.0.w_v.bias", "decoder.decoder_layers.1.0.w_o.weight", "decoder.decoder_layers.1.0.w_o.bias", "decoder.decoder_layers.1.1.w_q.weight", "decoder.decoder_layers.1.1.w_q.bias", "decoder.decoder_layers.1.1.w_k.weight", "decoder.decoder_layers.1.1.w_k.bias", "decoder.decoder_layers.1.1.w_v.weight", "decoder.decoder_layers.1.1.w_v.bias", "decoder.decoder_layers.1.1.w_o.weight", "decoder.decoder_layers.1.1.w_o.bias", "decoder.decoder_layers.1.2.fc1.weight", "decoder.decoder_layers.1.2.fc1.bias", "decoder.decoder_layers.1.2.fc2.weight", "decoder.decoder_layers.1.2.fc2.bias", "decoder.decoder_layers.1.2.layer_norm.weight", "decoder.decoder_layers.1.2.layer_norm.bias", "decoder.layer_norm.weight", "decoder.layer_norm.bias", "decoder.fc_out.weight", "decoder.fc_out.bias". 

In [None]:
def generate_sequence(encoder, decoder, src, max_len=11):
    encoder.eval()
    decoder.eval()

    src = torch.tensor(src).unsqueeze(0)  # Add batch dimension
    tgt = torch.zeros((1, max_len), dtype=torch.long)  # Initialize target sequence with zeros
    print("Initial Target Sequence:", tgt)

    with torch.no_grad():
        # Encode the source sequence
        encoder_output = encoder(src)

        for i in range(max_len):
            tgt_input = tgt[:, :i+1]  # All tokens up to position i (inclusive)
            print("Target Input Shape:", tgt_input.shape)

            # Forward pass through the decoder
            output = decoder(tgt_input, encoder_output)
            print("Model Output Shape:", output.shape)

            # Output should be of shape (batch_size, seq_len, vocab_size)
            if output.dim() == 3:
                if output.size(1) > 0:
                    next_token = output[:, -1, :].argmax(dim=-1)  # Get the most likely next token
                    tgt[:, i] = next_token
                else:
                    print("Output sequence is empty.")
                    break
            else:
                print(f"Unexpected output shape: {output.shape}")
                break

            if next_token.item() == PAD:
                break

    return tgt.squeeze().tolist()

# Generate a sequence using the loaded models
example_src = padded_token_ids_data[8]  # Use an example source sequence
print(f"Example Source Sequence: {example_src}")
generated_sequence = generate_sequence(encoder, decoder, example_src)
print(f"Generated Sequence: {generated_sequence}")


Example Source Sequence: [37, 38, 39, 40, 41, 42, 0, 0, 0, 0, 0]
Initial Target Sequence: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Target Input Shape: torch.Size([1, 1])
Model Output Shape: torch.Size([1, 1, 51])
Target Input Shape: torch.Size([1, 2])
Model Output Shape: torch.Size([1, 2, 51])
Target Input Shape: torch.Size([1, 3])
Model Output Shape: torch.Size([1, 3, 51])
Target Input Shape: torch.Size([1, 4])
Model Output Shape: torch.Size([1, 4, 51])
Target Input Shape: torch.Size([1, 5])
Model Output Shape: torch.Size([1, 5, 51])
Target Input Shape: torch.Size([1, 6])
Model Output Shape: torch.Size([1, 6, 51])
Target Input Shape: torch.Size([1, 7])
Model Output Shape: torch.Size([1, 7, 51])
Target Input Shape: torch.Size([1, 8])
Model Output Shape: torch.Size([1, 8, 51])
Target Input Shape: torch.Size([1, 9])
Model Output Shape: torch.Size([1, 9, 51])
Target Input Shape: torch.Size([1, 10])
Model Output Shape: torch.Size([1, 10, 51])
Target Input Shape: torch.Size([1, 11])
Mode

In [None]:
vocab = {
    0: '<PAD>',
    1: '<UNK>',
    2: 'I',
    3: 'love',
    4: 'machine',
    5: 'learning',
    6: '.',
    7: 'Artificial',
    8: 'intelligence',
    9: 'is',
    10: 'the',
    11: 'future',
    12: 'Transformers',
    13: 'are',
    14: 'powerful',
    15: 'models',
    16: 'Natural',
    17: 'language',
    18: 'processing',
    19: 'fascinating',
    20: 'Deep',
    21: 'learning',
    22: 'enables',
    23: 'great',
    24: 'advances',
    25: 'in',
    26: 'AI',
    27: 'PyTorch',
    28: 'popular',
    29: 'deep',
    30: 'library',
    31: 'GPT',
    32: 'models',
    33: 'widely',
    34: 'used',
    35: 'NLP',
    36: 'Neural',
    37: 'networks',
    38: 'are',
    39: 'the',
    40: 'backbone',
    41: 'of',
    42: 'Embedding',
    43: 'layers',
    44: 'convert',
    45: 'tokens',
    46: 'to',
    47: 'vectors',
    48: 'Attention',
    49: 'mechanisms',
    50: 'help',
    51: 'focus',
    52: 'on',
    53: 'important',
    54: 'parts',
    55: 'input'
}

def tokens_to_sentence(tokens, vocab):
    return ' '.join(vocab.get(token, '<UNK>') for token in tokens)

# Convert to sentence

sentence = tokens_to_sentence(generated_sequence, vocab)
print(f"Generated Sentence: {sentence}")

Generated Sentence: enables enables layers are are are are are are are are


Next steps depend on your goals and current project needs. Here are some potential directions:

Fine-Tuning and Evaluation:

Fine-Tune: If you’re working with a specific dataset or task, fine-tune the model to improve performance.
Evaluate: Use metrics like BLEU score for sequence generation tasks, or other relevant metrics for your specific application.
Model Optimization:

Optimize Performance: Explore techniques to optimize the model for faster inference and reduced memory usage.
Quantization: Implement model quantization for deployment on edge devices or for efficiency.
Expand the Model:

Additional Features: Implement additional features or improvements, such as more advanced positional encodings or attention mechanisms.
Experimentation: Try different architectures or hyperparameters to see if performance improves.
Integration:

Deployment: Integrate the model into a production environment or application. Ensure it works well with real-world data and under different conditions.
User Interface: Develop a user interface for interacting with the model, such as a web or mobile application.
Documentation and Sharing:

Document: Create comprehensive documentation for your model, including how to use it, its limitations, and its performance.
Share: Share your work with the community or stakeholders. You might consider publishing a paper, creating a project repository, or presenting your work.
Continual Learning:

Keep Learning: Stay updated with the latest research and techniques in the field. Implement new ideas and improvements as you learn more.