In [1]:
import os
import torch
import torch.nn as nn
import numpy as np
import random

# Reproducibilty

https://pytorch.org/docs/stable/notes/randomness.html

In [2]:
def set_seed(seed_value=42):
    """Sets the seed for reproducibility across different libraries."""

    # PYTHONHASHSEED - It is an environment variable in Python that sets the seed for hash randomization. Hash randomization affects the ordering of elements in things like dictionaries, sets, and anything that relies on hash tables internally. PYTHONHASHSEED doesn't directly affect PyTorch computations like random tensors or model training. But it can indirectly affect training/evaluation in cases where you shuffle datasets that are organized as a dictionary or set or  split datasets based on a random ordering of keys.

    # Since Python 3.7 dict preserves the insertion order i.e. the order of keys you insert into a dict is the order you will see them when you iterate over the dict later. While dict iteration order is now stable (insertion-based), the hash values themselves (i.e., what hash(key) returns) are still randomized if PYTHONHASHSEED isn't set. If any of the pytorch internal sampling, tokenization or workers shuffling depend on this hash, it will result in non-deterministic behaviour. 
    os.environ['PYTHONHASHSEED'] = str(seed_value)
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)  # if you are using multiple GPUs.
    

    # CuDNN is CUDA Deep Neural Network library made by NVIDIA. It’s a low-level GPU-accelerated library that PyTorch, TensorFlow, and other frameworks use under the hood to run things like convolutions, activation functions, RNNs). CuDNN has multiple implementations for convolutions e.g. GEMM-based, Winograd etc. Each of these are better for different scenarios such as some are faster for smaller kernels while others use less memory. Whenever CuDNN convolutions are called, a benchmarking is called to find the implementation that runs the fastest for the current set of parameters sizes. This implementation is then used in all the future calls. Whenever the parameters sizes changes the benchmark is rerun, thus introducing the non-determinism(difficult to compare the performance of different sizes). When disabled, the safest algorithm is chosen always which is deterministic across all runs but it comes at a cost of maybe diminished speed. Hence, if input sizes are fixed, use benchmarking to speed up but if input sizes keep changing disable benchmarking
    torch.backends.cudnn.benchmark = False

    # The various CuDNN implementations themselves might be using RNGs(Random Number Generators). Adding torch.backends.cudnn.deterministic = True will ensure the algorithm behaves same every time. 
    torch.backends.cudnn.deterministic = True
    
    print(f"Random seed set as {seed_value}")

# Usage
set_seed(42)


# Finally, the main thing...how to identify the best seed. In theory, any random seed should not matter, models should on an average perform same across different seeds. But in practice what I have observed is that, results could slightly differ with different seeds especially when the model is unstable or datasets is small or even when the training is sensitive(e.g. RL training)
# In practice some good practices are
#   1. If you have resources go for a Seed Sweep i.e.run multiple times with different seeds to choose the one that gives better results else you can stick to using some commonly used seed values like 42, 0, 1234
#   2. Also, one can assign different seed values to different libraries, but I find using the same seed for all initializations helps to reduce the confusion during comparing/monitoring
#   3. Finally, always log your seed. 


Random seed set as 42


# Pseudo Code
1. Vocab Layer input: B X text, output: B X N
    Create Tokenizer for the data
    Generate token to index mapping
    Incase num_tokens< model sequence length we will [PAD] it
2. Embedding Layer (Input: B X N,  Output: B X N X d_model)
    For each token Id map it to the corresponding embedding
    Create a Embedding Matrix V X d_model
    The input and output layer can have the same weights which is called weghts tying
3. Positional Encoding Layer - GPT uses learnable position embeddings (Input: B X N,  Output: B X N X d_model)
    Create Positional Encoding for the Sequence Length S
    for even position sin(pos/10000**(2*i/d_model))
    for odd position cos(pos/10000**(2*i/d_model))
4. Input Layer -> Embedding Layer + Positional Encoding Layer - (Input: B X N,  Output: B X N X d_model)
5. Transformer Layer - (Input: B X N X d_model, Output: B X N X d_model)
    1. Multi Head Attention
        1. Single Head Causal Attentions (Input: B X N X d_model, Output: B X N X d_hidden)
            Performs Softmax(Q @ K.T/sqroot(d_hidden)) @ V
            Attention Scores will be masked
            Parameters - W_q, W_k, W_v (d_model X d_hidden)
        2. Stack Multiple Heads   Input: B X num_heads X N X d_hidden, Output: B X N X d_model
            d_model = d_hidden * num_heads
    2. LayerNorm - Normalization across feature dimension (Input: B X N X d_model, Ouput: B X N X d_model)
        Parameters: scale (d_model) and shift (d_model)
        Equation: scale*(x-mean)/std_dev + shift
    3. Residual Connection: x + f(x) (Input: B X N X d_model, Ouput: B X N X d_model)
    4. Feed Forward Network: (Input: B X N X d_model, Ouput: B X N X d_model)
        Passes throught two linear layers
        1. First Layer d_model X 4*d_model
        2. Non-linearity - Relu, Gelu etc
        4. Second Layer - 4*d_model X d_model
    Transformer Layer Equations
            x = x + Dropout(MHA(Layer_Norm1(x)))
            x = x + Dropout(FeedForwardNetwork(Layer_Norm2(x)))
6. Output Layer
    1. Linear Layer (Input:B X N X d_model, Output: B X N X V)
        Parameters - d_model X V
    2. Softmax Layer (Input:B X N X V, Output: B X N X V)

# What I learnt today
1. GPT doesn't use [PAD] tokens, the masking and fixed length are handled in attention masks
2. The padding will only effect attention and no other layer because all other layers are run in parallel
3. There is a concept of pre-norm vs post-norm. The later being legacy and the former being preferred
4. Droput can be added to feed forward as well

# Class Definitions

We will create separate class definition for each layer so that we can plug in various variants to test which will be better

In [3]:
from abc import ABC, abstractmethod
class Tokenizer(ABC):
    @abstractmethod
    def create_index(self, text):
        """
        This method create the index by applying methods like BPE, Wordpiece etc
        """
        pass
    
    @abstractmethod
    def get_index(self):
        """
        This method returns the text token to idx mapping
        """
        pass
    
    @abstractmethod
    def get_reverse_index(self):
        """
        This method returns the token idx to text mapping
        """
        pass

    @abstractmethod
    def encode(self, text):
        """
        Based on the toneization technique, this method uses the index to generate the index.
        """
        pass

    @abstractmethod
    def decode(self, tokens):
        """
        Based on the toneization technique, this method uses the index to generate the index.
        """
        pass

In [4]:
class InputEmbedding(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.vocab_size = vocab_size
        # self.embedding_weights = nn.Parameter()
    
    def forward(self, x):
        """
        This method takes in a batch of input tokens of size B X N and returns embeddings corresponding to them of size B X N X d_model
        """
        pass

In [5]:
class PositionalEncoding(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.vocab_size = vocab_size
        self.embedding_weights = nn.Parameter()
    
    def forward(self, x):
        """
        This method takes in a batch of input lengths of each sequence in the batch of size B X 1 and returns positional embeddings corresponding to them of size B X N X d_model
        """
        pass

In [6]:
class TransformerLayer(nn.Module):
    """
    This class performs the following equations
    x = x + Dropout(MHA(Layer_Norm1(x)))
    x = x + Dropout(FeedForwardNetwork(Layer_Norm2(x)))
    """
    def __init__(self):
        super().__init__()
        pass

    def forward(self, x):
        """
        Takes input of size B X N X d_model and returns output of size B X N X d_model
        """
        pass

class LayerNorm(nn.Module):
    """
    This class performs the Layer Normalization. 
    x_norm = scale*(x-mean)/std_dev + shift where scale and shift are learable parameter of size (d_model)
    """
    def __init__(self):
        super().__init__()
        pass

    def forward(self, x):
        """
        Takes input of size B X N X d_model and performs normalization across the last dimension.
        
        """
        pass


class FeedForwardNetwork(nn.Sequential):
    """
    This class applies a feed forward network over the input. It essentially passes the input through two linear layers
        1. First Layer d_model X 4*d_model
        2. Non-linearity - Relu, Gelu etc
        4. Second Layer - 4*d_model X d_model
    Takes input of size B X N X d_model and applies a feed forward network over it and returns an output of size B X N X d_model
    """
    def __init__(self):
        super().__init__()
        pass


class Dropout(nn.Module):
    """
    """
    def __init__(self, prob):
        super().__init__()
        self.prob = prob
        pass

    def forward(self, x):
        """
        Takes input of size B X N X d_model and applies a Dropout layer over it and returns an output of size B X N X d_model
        """
        pass

class MultiHeadAttention(nn.Module):
    """
    This class applies the MultiHeadAttention Layer over the input. The Self-Attention mechanism is run in parallel each time with a different set of learned parameters for query, key and value projections. If the number of heads are represented num_heads, to keep the input and output dimensions constant we compute the the hidden size d_k = d_model/num_heads. All the outputs from the head are concatenated and passed through a Linear Layer and a dropout is applied on them.
    MHA(x) = Linear(Concat([Self_Attention(x)_1, ...., Self_Attention(x)_num_heads]))
    """
    def __init__(self, num_heads):
        super().__init__()
        self.num_heads = num_heads
        pass

    def forward(self, x, attention_mask):
        """
        Takes input of size B X N X d_model and applies a MHA layer over it and returns an output of size B X N X d_model. Uses attention_mask to stop computing attention for padding tokens. 
        """
        pass


class Transformer(nn.Module):
    def __init__(self):
        super().__init__()
        pass

    def forward(self, input_ids, attention_mask):
        """
        Takes input of size B X N text token indices and attention maps and applies the Transformer over it and returns an output of size B X N X V
        """
        pass 
    


# Things to notes

1. nn.Parameter - this is used when you need to define a tensor that is a learnable parameter if your pytorch model. These parameters are tracked by the nn.Module and can be accessed via model.parameters() for optimization.
2. nn.Linear by default has the bias term to compute Wx + b which can disabled essentially making it just Wx. So in the absence of bias both nn.Linear and nn.parameter can be use for Wx.
3. nn.Sequential vs nn.Module - All neural networks are implemented with nn.Module. If the layers are sequentially used (self.layer3(self.layer2(self.layer1(x)))), you can leverage nn.Sequential to not have to define the forward function of the model.
4. torch.nn.functional (a.k.a. F) is a module in PyTorch that provides stateless, function-based implementations of various neural network operations (like activations, normalizations, etc.), as opposed to nn.Module which are stateful, class-based versions. In functional we pass all the pass all the parameters and its update is controlled outside
E.g.    
        
        # nn.Module
        self.linear = nn.Linear(128, 64)            
        def forward(self, x):
            return self.linear(x)
        
        #nn.Functional
        import torch.nn.functional as F

        def forward(self, x):
            weight = self.weight  # Define yourself in __init__
            bias = self.bias
            return F.linear(x, weight, bias)
5. Difference between embedding formulations - nn.Embedding vs nn.Linear vs nn.Parameter - we can use all these to define embedding matrices but their forward functions operate differently.
    a. nn.Embedding selects the rows of the given matrix, given a list of integers
    b. nn.Linear does the einsum operation ...d, d e -> ...e
    c. nn.Parameter basically just makes a tensor trainable (receive gradients and updates on step). this is the lowest level you can go, so actually, you can define your entire deep neural network with just nn.Parameters and manually do all the above with gathers and matrix multiplies or einsums

# Implementation

## Byte Pair Encoding Tokenization

In [7]:
# GPT uses BPE, we will implement and experiment with wordpiece and unigram later
from collections import defaultdict

class BytePairEncoding(Tokenizer):
    def __init__(self, max_vocab_size):
        """
        Init method for Byte Pair Encoding
        :param vocab_size: maximum size of the vocabulary
        """
        self.max_vocab_size = max_vocab_size
        self.id2token_index = None
        self.token2id_index = None
        self.pad_token = None
        self.merge_rules = []

    def _preprocess_text(self, texts):
        tokenized_text = {}
        initial_vocab = set()
        for i, text in enumerate(texts):
            tokenized_text[i] = list(text.strip())
            initial_vocab.update(tokenized_text[i])
        return tokenized_text, initial_vocab
    
    def _get_most_frequent_pair(self, tokenized_text):
        token_freq_map = defaultdict(int)
        for i, tokens in tokenized_text.items():
            for token_1, token2 in zip(tokens, tokens[1:]):
                token_freq_map[(token_1, token2)] += 1
        if not token_freq_map:
            return None, None
        # max_frequency = max(token_freq_map.values())
        # max_freq_pair = [i for i in token_freq_map if token_freq_map[i]==max_frequency][0]
        max_freq_pair = max(token_freq_map, key=lambda x: token_freq_map[x])
        return max_freq_pair, token_freq_map[max_freq_pair]
    
    def _merge_tokens(self, tokenized_text, max_freq_pair):
        """
        This method merges the occurrence of max_freq_pair tokens in tokenized_text
        :param tokenized_text: - A dictionary of index to list of tokens mapping
        :param max_freq_pair: - a tuple of two tokens which have the max frequency of occurring
        """
        for i, tokens in tokenized_text.items():
            updated_tokens = []
            j = 0
            while j < len(tokens) - 1:
                if tokens[j] == max_freq_pair[0] and tokens[j + 1] == max_freq_pair[1]:
                    updated_tokens.append(tokens[j] + tokens[j + 1])
                    j += 2
                else:
                    updated_tokens.append(tokens[j])
                    j += 1
            
            if j == len(tokens) - 1:  # Handle the last token if it doesn't form a pair with max_freq_pair
                updated_tokens.append(tokens[-1])
            
            tokenized_text[i] = updated_tokens
        return tokenized_text


    def create_index(self, texts):
        """
        Step:
        1. Start with an initial vocab
        2. Find the frequency of pair of tokens in the input texts based the vocab
        3. Add the most frequently occuring pair in the vocab
        4. Repeat step 2-3 until vocab_size is reached
        
        :param text: The text based on which we are generating the vocab
        :param initial_vocab: The initial vocabulary 

        """
        tokenized_text, vocab = self._preprocess_text(texts)
        self.merge_rules = []
        prev_vocab_size = len(vocab)
        while len(vocab)<self.max_vocab_size:
            max_freq_pair, max_frequency = self._get_most_frequent_pair(tokenized_text)
            if not max_freq_pair:
                break
            self.merge_rules.append(max_freq_pair)
            tokenized_text = self._merge_tokens(tokenized_text, max_freq_pair)
            vocab.add("".join(max_freq_pair))
        
        self.id2token_index = {i: token for i, token in enumerate(vocab)}
        self.token2id_index = {token: i for i, token in enumerate(vocab)}
        return self.token2id_index
    
    def get_index(self):
        return self.token2id_index

    def get_reverse_index(self):
        return self.id2token_index

    def encode(self, text):
        """
        Based on the toneization technique, this method uses the index to generate the index.
        """
        tokens = {0: list(text.strip())}
        for rule in self.merge_rules:
            tokens = self._merge_tokens(tokens, rule)
        return [self.token2id_index[token] for token in tokens[0]]
         
        
    def decode(self, tokens):
        return [self.id2token_index[token_idx] for token_idx in tokens]


In [8]:
bpe = BytePairEncoding(20)
corpus = [
    "low",
    "lower",
    "lowest",
    "newer",
    "newest",
    "wide",
    "wider",
    "widest",
    "high",
    "higher",
    "highest",
    "bright",
    "brighter",
    "brightest",
    "dark",
    "darker",
    "darkest",
    "slow",
    "slower"
]
index = bpe.create_index(corpus)
print("Encoding slowest", bpe.encode("slowest"))
print("Decoding [19, 2, 10, 18, 1]", bpe.decode([19, 2, 10, 18, 1]))

Encoding slowest [11, 0, 7, 8, 12]
Decoding [19, 2, 10, 18, 1] ['t', 'er', 'igh', 'n', 'k']


## Transformer Implementation

In [None]:
class InputEmbedding(nn.Module):
    def __init__(self, vocab_size, d_model):
        super().__init__()
        self.vocab_size = vocab_size
        self.d_model = d_model
        self.embedding_weights = nn.Embedding(vocab_size, d_model, dtype=torch.float32)
    
    def forward(self, x):
        """
        This method takes in a batch of input tokens of size B X N and returns embeddings corresponding to them of size B X N X d_model
        :param x: Input tokens of shape B X N 
        """
        return self.embedding_weights(x)
        
# Ideally this is not necessary but we will keep it in case we want to plug in another type of initialization

class PositionalEncoding(nn.Module):
    def __init__(self, seq_len, d_model):
        super().__init__()
        self.seq_len = seq_len
        self.d_model = d_model
        self.pos_embeddings = nn.Embedding(self.seq_len, self.d_model, dtype=torch.float32)
    
    def forward(self, x):
        return self.pos_embeddings(x)
    
class TransformerLayer(nn.Module):
    """
    This class performs the following equations
    x = x + Dropout(MHA(Layer_Norm1(x)))
    x = x + Dropout(FeedForwardNetwork(Layer_Norm2(x)))
    """
    def __init__(self, d_model, num_heads, dropout_p_mha, dropout_p_ffn):
        super().__init__()
        self.MHA = MultiHeadAttention(num_heads, d_model)
        self.layer_norm_1 = LayerNorm(d_model)
        self.dropout_1 = Dropout(dropout_p_mha)
        self.ffn  = FeedForwardNetwork(d_model)
        self.layer_norm_2 = LayerNorm(d_model)
        self.dropout_2 = Dropout(dropout_p_ffn)

    def forward(self, inputs):
        """
        Takes input of size B X N X d_model and returns output of size B X N X d_model
        """
        x, attention_mask = inputs[0], inputs[1]
        x = x + self.dropout_1(self.MHA(self.layer_norm_1(x), attention_mask))
        x = x + self.dropout_2(self.ffn(self.layer_norm_2(x)))
        return [x, attention_mask]

class LayerNorm(nn.Module):
    """
    This class performs the Layer Normalization. 
    x_norm = scale*(x-mean)/std_dev + shift where scale and shift are learable parameter of size (d_model)
    """
    def __init__(self, d_model, eps=1e-5):
        super().__init__()
        self.d_model = d_model
        self.scale = nn.Parameter(torch.ones(self.d_model, dtype=torch.float32))
        self.shift = nn.Parameter(torch.zeros(self.d_model, dtype=torch.float32))
        self.eps = eps

    def forward(self, x):
        """
        Takes input of size B X N X d_model and performs normalization across the last dimension.
        :param x: Tensor of shape B X N X d_model
        """
        return (x - x.mean(dim=-1, keepdim=True))/(x.std(dim=-1, keepdim=True) + self.eps) * self.scale + self.shift



class FeedForwardNetwork(nn.Module):
    """
    This class applies a feed forward network over the input. It essentially passes the input through two linear layers
        1. First Layer d_model X 4*d_model
        2. Non-linearity - Relu, Gelu etc
        4. Second Layer - 4*d_model X d_model
    Takes input of size B X N X d_model and applies a feed forward network over it and returns an output of size B X N X d_model

    Few models started using 8/3*d_model as hidden_dim
    """
    def __init__(self, d_model, hidden_dim=None):
        """
        """
        super().__init__()
        hidden_dim = hidden_dim or 4*d_model
        self.ffn = nn.Sequential(
            nn.Linear(d_model, 4*d_model, dtype=torch.float32),
            nn.GELU(),
            nn.Linear(4*d_model, d_model, dtype=torch.float32)
        )
    
    def forward(self, x):
        """
        Input and Output are of dimension B X N X d_model 
        """
        return self.ffn(x)


class Dropout(nn.Module):
    """
    """
    def __init__(self, prob):
        super().__init__()
        self.prob = prob

    def forward(self, x):
        """
        Takes input of size B X N X d_model and applies a Dropout layer over it and returns an output of size B X N X d_model
        """
        bernoulli_distribution = torch.distributions.Bernoulli(probs = 1-self.prob)
        mask = ~bernoulli_distribution.sample(x.shape).bool()
        # alternative
        # mask = (torch.rand_like(x) < self.prob)  
        x.masked_fill_(mask, 0)
        x = x/(1-self.prob) # inverted dropout
        return x

def stable_softmax(x, dim=-1):
    x = x - torch.max(x, dim=dim, keepdim=True).values
    return torch.softmax(x, dim=dim)

class MultiHeadAttention(nn.Module):
    """
    This class applies the MultiHeadAttention Layer over the input. The Self-Attention mechanism is run in parallel each time with a different set of learned parameters for query, key and value projections. If the number of heads are represented num_heads, to keep the input and output dimensions constant we compute the the hidden size d_k = d_model/num_heads. All the outputs from the head are concatenated and passed through a Linear Layer and a dropout is applied on them.
    MHA(x) = Linear(Concat([Self_Attention(x)_1, ...., Self_Attention(x)_num_heads]))
    """
    def __init__(self, num_heads, d_model):
        super().__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        if self.d_model%num_heads:
            raise Exception("d_model {d_model} is not divisible by num_heads {num_heads}")
        self.d_k = self.d_model/num_heads
        self.W_Q = nn.Linear(self.d_model, self.d_model, bias=False, dtype=torch.float32)
        self.W_K = nn.Linear(self.d_model, self.d_model, bias=False, dtype=torch.float32)
        self.W_V = nn.Linear(self.d_model, self.d_model, bias=False, dtype=torch.float32)
        self.out = nn.Linear(self.d_model, self.d_model, dtype=torch.float32)


    def forward(self, x, attention_mask):
        """
        Takes input of size B X N X d_model and applies a MHA layer over it and returns an output of size B X N X d_model. Uses attention_mask to stop computing attention for padding tokens. Attention mask is of shape B X N
        Steps:
        1. Compute Query Key and Value Vectors
        2. Split the Q, K, V tensors to inclue num_heads dimension B X N X d_model -> B X num_heads X N X d_k
        3. Compute Attention scores
        4. Create a mask combining both causal masks and attention mask
        4. Compute attention weights using softmax
        5. Multiply with values and reshape
        6. Apply final linear layer and return
        """
        batch_size, seq_length, _ = x.shape
        queries, keys, values = self.W_Q(x), self.W_K(x), self.W_V(x)
        queries = queries.reshape(batch_size, seq_length, self.num_heads, -1).transpose(1, 2) # B X num_heads X N X d_k
        keys = keys.reshape(batch_size, seq_length, self.num_heads, -1).transpose(1, 2) # B X num_heads X N X d_k
        values = values.reshape(batch_size, seq_length, self.num_heads, -1).transpose(1, 2) # B X num_heads X N X d_k

        attention_scores = queries @ keys.transpose(2, 3) # B X num_heads X N X N
        causal_mask = torch.triu(torch.ones(seq_length, seq_length, dtype=attention_scores.dtype), diagonal=1).bool()
        attention_scores.masked_fill_(causal_mask, -torch.inf)

        attention_mask = (1-attention_mask).bool().unsqueeze(1).unsqueeze(2)
        attention_scores.masked_fill_(attention_mask, -torch.inf)
        
        attention_weights = stable_softmax(attention_scores/(self.d_k**0.5)) # B X num_heads X N X N
        contexts = attention_weights @ values # B X num_heads X N X d_model
        contexts = contexts.transpose(1, 2).reshape(batch_size, seq_length, -1) # B X N X d_model
        contexts = self.out(contexts) # B X N X d_model
        return contexts


class Transformer(nn.Module):
    """
    x = x + Dropout(MHA(Layer_Norm1(x)))
    x = x + Dropout(FeedForwardNetwork(Layer_Norm2(x)))
    """
    def __init__(self, vocab_size, max_seq_length, num_layers, d_model, num_heads, dropout_p_ffn, dropout_p_mha):
        super().__init__()
        self.embedding = InputEmbedding(vocab_size=vocab_size, d_model=d_model)
        self.positional_encoding = PositionalEncoding(seq_len=max_seq_length, d_model=d_model)
        self.transformer_blocks = nn.Sequential(*[TransformerLayer(d_model, num_heads, dropout_p_ffn, dropout_p_mha) for _ in range(num_layers)])
        self.layer_norm = LayerNorm(d_model)
        self.out_head = nn.Parameter(self.embedding.embedding_weights.weight)

    def forward(self, input_ids, position_ids, attention_mask):
        """
        Takes input of size B X N text token indices and attention maps and applies the Transformer over it and returns an output of size B X N X V
        """
        x = self.embedding(input_ids) + self.positional_encoding(position_ids) # B X N X d_model
        x, _ = self.transformer_blocks([x, attention_mask]) # B X N X d_model
        x = self.layer_norm(x) # B X N X d_model

        # we are using weight tying here, to untie them initialise a new 
        # V X d_model
        logits = x @ self.out_head.T # B X N X V
        return logits
    
class TransformerLayerModified(nn.Module):
    """
    This class performs the following equations
    x = x + Dropout(MHA(Layer_Norm1(x)))
    x = x + Dropout(FeedForwardNetwork(Layer_Norm2(x)))
    """
    def __init__(self, d_model, num_heads, dropout_p_mha, dropout_p_ffn):
        super().__init__()
        self.MHA = nn.MultiheadAttention(d_model, num_heads, batch_first=True)
        self.layer_norm_1 = nn.LayerNorm(d_model)
        self.dropout_1 = nn.Dropout(dropout_p_mha)
        self.ffn  = FeedForwardNetwork(d_model)
        self.layer_norm_2 = nn.LayerNorm(d_model)
        self.dropout_2 = nn.Dropout(dropout_p_ffn)

    def forward(self, inputs):
        """
        Takes input of size B X N X d_model and returns output of size B X N X d_model
        """
        x, attention_mask = inputs[0], inputs[1]
        x = x + self.dropout_1(self.MHA(self.layer_norm_1(x), attention_mask))
        x = x + self.dropout_2(self.ffn(self.layer_norm_2(x)))
        return [x, attention_mask]


# class TransformerModified(nn.Module):
#     """
#     x = x + Dropout(MHA(Layer_Norm1(x)))
#     x = x + Dropout(FeedForwardNetwork(Layer_Norm2(x)))
#     """
#     def __init__(self, vocab_size, max_seq_length, num_layers, d_model, num_heads, dropout_p_ffn, dropout_p_mha):
#         super().__init__()
#         self.embedding = nn.Embedding(vocab_size, d_model, dtype=torch.float32)
#         self.positional_encoding = nn.Embedding(max_seq_length, d_model, dtype=torch.float32)
#         self.transformer_blocks = nn.Sequential(*[TransformerLayer(d_model, num_heads, dropout_p_ffn, dropout_p_mha) for _ in range(num_layers)])
#         self.layer_norm = nn.LayerNorm(d_model)
#         self.out_head = nn.Parameter(self.embedding.weight)

#     def forward(self, input_ids, position_ids, attention_mask):
#         """
#         Takes input of size B X N text token indices and attention maps and applies the Transformer over it and returns an output of size B X N X V
#         """
#         x = self.embedding(input_ids) + self.positional_encoding(position_ids) # B X N X d_model
#         x, _ = self.transformer_blocks([x, attention_mask]) # B X N X d_model
#         x = self.layer_norm(x) # B X N X d_model

#         # we are using weight tying here, to untie them initialise a new 
#         # V X d_model
#         logits = x @ self.out_head.T # B X N X V
#         return logits



In [10]:
vocab_size = 50
max_seq_length = 8 
num_layers = 2 
d_model = 4 
num_heads = 2 
dropout_p_ffn = dropout_p_mha = 0.4

In [None]:
inp_emb = InputEmbedding(vocab_size, d_model)
pos_enc = PositionalEncoding(max_seq_length, d_model)
trfm_layer = TransformerLayer(d_model, num_heads, dropout_p_mha, dropout_p_ffn)
layer_norm = LayerNorm(d_model)
ffn = FeedForwardNetwork(d_model)
dropout = Dropout(dropout_p_ffn)
mha = MultiHeadAttention(num_heads, d_model)
transformer = Transformer(vocab_size, max_seq_length, num_layers, d_model, num_heads, dropout_p_ffn, dropout_p_mha)
transformer_modified = TransformerModified(vocab_size, max_seq_length, num_layers, d_model, num_heads, dropout_p_ffn, dropout_p_mha)

In [13]:
batch_size = 2
seq_len = 8

# max seq len, full attention mask
input_ids = torch.randint(0, vocab_size, (batch_size, seq_len), dtype=torch.long)
position_ids = torch.arange(seq_len, dtype=torch.long).repeat(batch_size, 1)
attention_mask = torch.ones(batch_size, seq_len, dtype=torch.long) 

emb = inp_emb(input_ids)
print(emb.shape)
pos_emb = pos_enc(position_ids)
print(pos_emb.shape)

print(emb[0][0][:])
# Python standard dev computed sample variance i.e. /(n-1)
emb_norm = layer_norm(emb)
print("After normalization")
print(emb_norm[0][0][:])

ffn_out = ffn(emb)
print(ffn_out.shape)

dropout_out = dropout(emb)
print(dropout_out.shape)


logits = mha(emb+pos_emb, attention_mask)
print(logits.shape)

layer_output = trfm_layer([emb, attention_mask])
print(layer_output[0].shape)


# final_output = transformer(input_ids, position_ids, attention_mask)
# print(final_output.shape)

final_output_2 = transformer_modified(input_ids, position_ids, attention_mask)
print(final_output_2.shape)

torch.Size([2, 8, 4])
torch.Size([2, 8, 4])
tensor([ 1.0608,  0.2083, -0.5778,  0.3255], grad_fn=<SliceBackward0>)
After normalization
tensor([ 1.2024, -0.0684, -1.2403,  0.1063], grad_fn=<SliceBackward0>)
torch.Size([2, 8, 4])
torch.Size([2, 8, 4])
torch.Size([2, 8, 4])
torch.Size([2, 8, 4])
torch.Size([2, 8, 50])


In [19]:
transformer = Transformer(vocab_size, max_seq_length, num_layers, d_model, num_heads, dropout_p_ffn, dropout_p_mha)
final_output = transformer(input_ids, position_ids, attention_mask)
print(final_output.shape)

torch.Size([2, 8, 50])


In [20]:
final_output

tensor([[[-6.7823e-01, -1.7284e-01,  8.8187e-01,  2.6132e+00, -8.7541e-02,
           2.2772e+00,  2.6209e+00,  2.8109e-01, -2.6724e+00,  2.4952e+00,
          -6.5319e-01, -6.2310e-01, -7.1868e-01, -1.0253e-01, -2.4708e+00,
           7.0503e-01, -7.6724e-02, -1.1606e+00,  3.9110e+00, -1.0433e+00,
          -2.6682e+00,  5.2357e-01, -3.6021e-01,  1.4262e+00, -2.0906e+00,
          -1.9009e+00,  9.2040e-01,  5.3079e-01,  1.3948e+00, -3.7902e+00,
           1.1983e+00, -9.6277e-01, -9.7645e-01,  1.7302e+00,  1.0513e+00,
          -1.8984e+00,  1.7777e+00, -4.0452e-01,  3.5390e-01, -1.2254e+00,
          -5.9447e-01, -2.5599e+00,  1.7808e+00, -4.1721e-01,  4.5951e-01,
           2.4297e+00,  1.0210e+00,  2.4919e+00, -9.7310e-02, -1.1566e+00],
         [ 3.5443e-01, -1.6540e+00,  1.1289e+00, -2.7177e+00,  1.1591e+00,
          -9.0957e-01, -5.1747e-01,  1.2390e-01,  2.9665e+00, -1.7177e+00,
           5.4432e-01,  1.6880e+00,  8.8386e-01,  3.7137e-01,  3.5914e+00,
          -4.4106e-01, -

In [14]:
# x = [0.6751, 0.1021, 1.4837, 0.1549]
# mean = sum(x)/len(x)
# std = (sum([(i-mean)**2 for i in x])/3)**0.5
# [(i-mean)/(std+10**-5) for i in x]

# sum(x)/len(x)
# x - 
# x - x.mean(dim=-1, keepdim=True)
# x.std(dim=-1) 



# for name, param in dropout.named_parameters():
#     print(name, type(param), param.size())


# dropout_out
final_output_2

tensor([[[-1.0437e+00, -3.1565e+00, -2.1469e+00,  2.9604e-01, -1.4018e+00,
          -7.8756e-01, -1.2427e+00,  1.0890e+00, -2.5194e+00, -8.3359e-01,
          -3.0718e+00, -1.3658e+00, -5.1145e-01, -9.1401e-01, -3.6644e+00,
          -1.3340e+00,  3.8116e+00,  8.2318e-01, -1.1969e+00,  4.1260e+00,
          -1.5146e+00, -1.5938e+00, -1.1659e+00, -8.2088e-01, -7.6612e-01,
          -1.6289e+00, -1.7641e+00,  2.9192e+00,  3.0643e+00, -1.5594e+00,
           1.4681e+00,  2.7616e-01, -1.7640e+00, -8.8701e-01,  2.1285e+00,
           2.9416e+00,  1.6224e+00, -1.5445e+00,  1.2851e+00, -2.7506e+00,
          -4.8113e-01,  2.6820e-01, -8.0649e-01,  3.4474e+00,  2.6646e+00,
          -4.5111e+00, -2.4852e-01,  5.2903e+00,  2.1930e-01, -1.8119e+00],
         [ 3.8756e-01,  2.7280e+00,  8.7870e-01, -6.6129e-01,  6.9335e-01,
          -1.0358e+00, -9.5327e-02, -1.7258e+00,  4.5726e+00,  1.9338e+00,
          -6.0985e-01,  5.4086e-02,  3.9197e-01,  2.8689e+00,  2.0765e+00,
           8.9890e-01, -

In [None]:
tensor(0.0775)
tensor(6.7979)

# Testing how close the values 

# Things to note
1. Inverted Dropout vs Regular Dropout: 
https://stackoverflow.com/questions/54109617/implementing-dropout-from-scratch
Regular Dropout
During training, randomly set a fraction p of input units to zero. At test time, scale is not applied — raw values are passed as is. This causes a mismatch between train and test time behavior. Therefore in regular dropout, the test time input units are multiplied with (1-p) to scale them down. This results in modifying test time model based on the dropout chosen.

Inverted Dropout
During training, randomly drop units with probability p, but scale the remaining active units by 1 / (1 - p). At test time, no scaling is needed — output is already normalized during training.

2. Why are biases removed in the attention layers? The attention layers in transformers are usually followed by normalization layers which effectively remove the bias. Addind bias thus is nothing but an overhead. https://ai.stackexchange.com/questions/40252/why-are-biases-typically-not-used-in-attention-mechanism

3. Which attention scores are masked with padding?
Padding tokens are added to equalize sequence lengths for efficient batching—not for learning. To avoid text tokens learning from pads, we mask the key positions of pad tokens during attention. This prevents queries (real tokens) from attending to them.
But why not mask query positions of pad tokens?
As explained in this blog, masking pad queries results in a uniform attention distribution (1/n), causing their context vectors to become just averages of other tokens. This erases any distinct signal and is usually unnecessary, since outputs from pad positions are typically ignored later.
https://gmongaras.medium.com/how-do-self-attention-masks-work-72ed9382510f

4. masked_fill_ does in place masking and broadcasts mask as long as it has the right dimensions

5. unsqueeze adds a dimension 1 in the specified dim value



In [75]:
# m = torch.distributions.Bernoulli(0.7)
# a  = m.sample((2, 2, 3, 3))

In [74]:
# x = torch.randn(2, 3)
# pos_embeddings = nn.Embedding(4, 3)
# batch_size, seq_len = x.shape
# pos_embeddings(x)

In [70]:
import torch
attention_scores = torch.randn(2, 2, 3, 3)
attention_mask = torch.tensor([[1, 1, 0],[1, 0, 0]])
# mask = 
# 
mask = torch.triu(torch.ones(2, 3, 3), diagonal=1)
# attention_mask = 1-attention_mask
# attention_mask = attention_mask.reshape(2, 1, -1)
# mask = mask + attention_mask

# print(attention_mask)
attention_mask = (1-attention_mask).bool().unsqueeze(1).unsqueeze(2)
print(attention_mask.shape)
# attention_mask.expand(2, 2, 3, 3)
attention_scores.masked_fill_(attention_mask, -np.inf)

# torch.tril(temp)
# temp - torch.max(temp, dim=-1, keepdim=True).values
# temp.view(1, 2, )


torch.Size([2, 1, 1, 3])


tensor([[[[ 0.0588, -1.0991,    -inf],
          [ 1.3585, -0.4391,    -inf],
          [ 2.9100,  0.2492,    -inf]],

         [[-0.1958, -0.9341,    -inf],
          [-0.6880,  0.2391,    -inf],
          [-0.4189, -1.3787,    -inf]]],


        [[[ 1.8609,    -inf,    -inf],
          [ 0.8796,    -inf,    -inf],
          [-2.5083,    -inf,    -inf]],

         [[ 1.3594,    -inf,    -inf],
          [-0.7312,    -inf,    -inf],
          [-0.7218,    -inf,    -inf]]]])

In [33]:
import ollama
    
# # Example 1: Simple chat interaction
# response = ollama.chat(model='llama3', messages=[{'role': 'user', 'content': 'Why is the sky blue?'}])
# print(response['message']['content'])

# # Example 2: Streaming responses
# stream = ollama.chat(model='llama3', messages=[{'role': 'user', 'content': 'Tell me a story'}], stream=True)
# for chunk in stream:
#     print(chunk['message']['content'], end='', flush=True)

prompt = """Can you optimize this code and rewrite it:
 def _merge_tokens(self, tokenized_text, max_freq_pair):
        # This method merges the occurence of max_freq_pair tokens in tokenized_text
        # tokenized_text - A dictionary of index to list of tokens mapping
        # max_freq_pair - a tuple of two tokens which have the max frequency of occuring
        for i, tokens in tokenized_text.items():
            merge_indices = set()
            j = 0
            while j < (len(tokens)-1):
                if tokens[j] == max_freq_pair[0] and tokens[j+1] == max_freq_pair[1]:
                    merge_indices.add(j)
                    j += 1
                j += 1
            updated_tokens = []
            j = 0
            while j< len(tokens):
                if j in merge_indices:
                    updated_tokens.append(tokens[j]+tokens[j+1])
                    j += 1
                else:
                    updated_tokens.append(tokens[j])
                j += 1
            tokenized_text[i] = updated_tokens
        return tokenized_text


"""


prompt = """
Is there any optimal way to write this code by combining both sample and reshape methods
bernoulli_dist = torch.distributions.Bernoulli(torch.tensor([1-self.prob]))
return bernoulli_dist.sample(x.shape).reshape(x.shape)
"""
# Example 3: Generate text
response = ollama.generate(model='qwen2.5-coder:latest', prompt=prompt)
# print(response)

In [34]:
print(response['response'])

Certainly! You can combine the `sample` and `reshape` methods more efficiently by using a single operation to generate samples with the desired shape directly. Here's an optimized version of your code:

```python
import torch

def sample_bernoulli_like(x, prob):
    bernoulli_dist = torch.distributions.Bernoulli(1 - prob)
    return bernoulli_dist.sample(x.shape)

# Example usage:
x = torch.randn(3, 4)  # Replace with your actual tensor
prob = 0.5  # Replace with your actual probability
sampled_tensor = sample_bernoulli_like(x, prob)
print(sampled_tensor)
```

In this optimized version:
- We create a `Bernoulli` distribution directly using the desired probability.
- We use the `.sample()` method on the distribution and pass `x.shape` as an argument to generate samples with the same shape as `x`.

This approach avoids unnecessary reshaping operations, making the code more concise and potentially more efficient.


# DataLoader

In [None]:
# In data loaders, we use multiple workers to load the data in parallel so that GPU doesn't sit idle. Note: DataParallelism is different from using workers. While the former is GPU side optimization the latter is CPU side. Coming to the original point, inside each worker pytorch by default sets the random seed as parent's random seed + worker_id. However seeds for other libraries may be duplicated during worker initialization, causing workers to produce identical random numbers. therefore, we initialize the seed for other libraries and pass it in the worker_init_fn of the Dataloader. For more details check - https://pytorch.org/docs/stable/data.html#data-loading-randomness

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

tensor([[[1, 1, 1, 0, 0]]])
tensor([[[[1, 1, 1, 0, 0]]]])
tensor([[[[1, 1, 1, 0, 0]]]])
tensor([[[[0.7374, 0.0480, 0.0809,   -inf,   -inf],
          [0.3284, 0.6390, 0.8858,   -inf,   -inf],
          [0.4830, 0.2081, 0.7727,   -inf,   -inf],
          [0.4163, 0.5497, 0.0868,   -inf,   -inf],
          [0.4099, 0.7877, 0.0380,   -inf,   -inf]]]])
