## 1 Implementing GPT2 architecture

* We will implement GPT-2 small which is the smallest version of GPT-2 which has 124 million parameters.
* In deep learning parameters refers to the trainable weights of the model.

In [5]:
##specifying configuration of small GPT-2 model
GPT_CONFIG_124M = {
    "vocab_size":50257,#vocabulary size
    "context_size":1024,
    "emb_dim":768,
    "n_heads":12,
    "n_layers":12,
    "drop_rate":0.1,#dropout rate
    "qkv_bias":False #quer-key-value bias
}

* `vocab_size` refers to a vocabulary of 50,257 words, as used by the BPE tokenizer.
* `context_length` denotes the maximum number of input tokens the model can handle.
* `emb_dim` represents the embedding size, transforming each token into 768-dimensional vector.
* `n_heads` indicates the numer of attention heads in multi-head attention.
`n_layers` specifies the number of the transformer block in the model.
`drop_rate` represents the intensity of the dropout mechanism.
`qkv_bias` determines whether to include a bias vector in the Linear layers of the multi-head attention for query,key,value computations.

In [9]:
##placeholder for GPT model architecture class
import torch
import torch.nn as nn
class DummyGPTModel(nn.Module):
  def __init__(self,cfg):
    super().__init__()
    self.tok_emb = nn.Embedding(cfg["vocab_size"],cfg["emb_dim"])
    self.pos_emb = nn.Embedding(cfg["context_size"],cfg["emb_dim"])
    self.drop_emb = nn.Dropout(cfg["drop_rate"])
    self.trf_blocks = nn.Sequential(
        *[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])]
    )
    self.final_norm = DummyLayerNorm(cfg["emb_dim"])
    self.out_head = nn.Linear(
        cfg["emb_dim"],cfg["vocab_size"],bias=False
    )

  def forward(self,in_idx):
    batch_size,seq_len = in_idx.shape
    tok_embeds = self.tok_emb(in_idx)
    pos_embeds = self.pos_emb(
      torch.arange(seq_len,device=in_idx.device)
   )
    x = tok_embeds + pos_embeds
    x = self.drop_emb(x)
    x = self.trf_blocks(x)
    x = self.final_norm(x)
    logits = self.out_head(x)
    return logits



class DummyTransformerBlock(nn.Module):
  def __init__(self,cfg):
    super().__init__()

  def forward(self,x):
    return x

class DummyLayerNorm(nn.Module):
  def __init__(self,normalized_shape, eps=1e-5):
    super().__init__()

  def forward(self,x):
    return x

In [3]:
##use case
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
batch = []
txt1 =" GPUs are high edge computing devices"
txt2 = "Google developed their TPUs"

batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch,dim=0)
print(batch)

tensor([[32516,   389,  1029,  5743, 14492,  4410],
        [11708,  4166,   511,   309,  5105,    82]])


* The result is the token IDs for the two texts.
* Next, we intialize a new 124-million parameter `DummyGPTModel` instance and feed it the tokenized batch

In [10]:
torch.manual_seed
model = DummyGPTModel(GPT_CONFIG_124M)
logits = model(batch)
print("Output shape: ",logits.shape)
print(logits)

Output shape:  torch.Size([2, 6, 50257])
tensor([[[ 0.5425,  0.7848,  1.4292,  ..., -0.0170, -1.3822, -1.4976],
         [-1.1349,  0.1493, -0.5685,  ...,  0.9085, -0.7673, -1.2608],
         [-1.4393,  0.6421, -0.4532,  ...,  0.2140,  0.0739, -0.0706],
         [ 1.0051, -0.3934,  0.5284,  ...,  0.0330,  2.4598, -0.5697],
         [ 1.0517, -0.0035,  1.1096,  ..., -0.3031,  0.0613, -2.2065],
         [ 0.3834, -0.4588, -1.1387,  ...,  0.8991,  0.6771, -0.0166]],

        [[ 0.5440,  0.2645,  0.7814,  ...,  0.4434, -1.1043,  0.7317],
         [-0.7989, -1.1918, -0.1744,  ...,  0.4107, -1.0413, -1.5725],
         [-0.8457,  0.9681, -0.1388,  ...,  0.4167, -0.2729,  0.8394],
         [-0.2989, -0.7869, -0.5825,  ..., -1.4383,  1.2647, -0.0954],
         [ 0.4437,  0.8516, -0.3977,  ...,  0.6611, -0.6979, -0.7165],
         [-0.2315, -0.5627, -0.8839,  ...,  0.2181,  0.3894,  0.2681]]],
       grad_fn=<UnsafeViewBackward0>)


* Output consists to two text sampes. Each text sample consists of 6 tokens; each is a 50,257-dimensional vector, which matches the size of the tokenizer's vocabulary.

## 1.2 Normalizing activations with layer normalization.


* Training deep neural networks with many layers can sometimes prove challenging due to problems like vanishing or exploding gradients.
* These problems lead to unstable training dynamics and makes it difficult for the network to effectively adjust its weight, which means the learning process struggles to find a set of a paramets for the neural network that minimizes the loss function.
* The main idea behind layer normalization is to adjust the activations (outputs) of a neural network layer to have a mean of 0 and a variance 0f 1, also known as unit variance.

In [11]:
## example
torch.manual_seed(123)
batch_example = torch.randn(2,5)
layer = nn.Sequential(nn.Linear(5,6),nn.ReLU())
out = layer(batch_example)
print(out)

tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],
        [0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],
       grad_fn=<ReluBackward0>)


In [13]:
## mean and varaince
mean = out.mean(dim=-1,keepdim=True)
var  = out.var(dim=-1,keepdim=True)
print(f"Mean: {mean}, Variance: {var}")

Mean: tensor([[0.1324],
        [0.2170]], grad_fn=<MeanBackward1>), Variance: tensor([[0.0231],
        [0.0398]], grad_fn=<VarBackward0>)


* `keepdim=True` in operations like mean or variance calculation ensures that the output tensor retains the same number of dimensons as the input tensor.
* `dim` specifies the dimension along which the calculation of the statistic will be applied.
* `dim=-1 or 1` calculates mean across the columns dimension to obtain one mean per row.
* `dim=0` calculates  mean across row dimension to obtain one mean per column.

In [15]:
#applying layer norm
out_norm = (out - mean) / torch.sqrt(var)
mean = out_norm.mean(dim=-1,keepdim=True)
var = out_norm.var(dim=-1,keepdim=True)
print(f"Normalized layer output:\n",out_norm)
print(f"Mean:\n {mean}")
print(f"Variance:\n{var}")

Normalized layer output:
 tensor([[ 1.4877e+00,  2.2845e+00, -6.5409e-08,  1.4591e+00, -6.5409e-08,
         -6.5409e-08],
        [ 1.0688e+00,  1.1998e+00, -9.9561e-08,  2.6049e+00,  1.6524e+00,
         -9.9561e-08]], grad_fn=<DivBackward0>)
Mean:
 tensor([[0.8719],
        [1.0876]], grad_fn=<MeanBackward1>)
Variance:
tensor([[1.],
        [1.]], grad_fn=<VarBackward0>)


In [18]:
## layernorm class
class LayerNorm(nn.Module):
  def __init__(self,emb_dim):
    super().__init__()
    self.eps = 1e-5
    self.scale = nn.Parameter(torch.ones(emb_dim))
    self.shift = nn.Parameter(torch.zeros(emb_dim))

  def forward(self,x):
    mean = x.mean(dim=-1,keepdim=True)
    var = x.var(dim=-1,keepdim=True)
    norm_x = (x - mean) / torch.sqrt(var + self.eps)
    return self.scale * norm_x + self.shift

In [19]:
ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)
mean = out_ln.mean(dim=-1,keepdim=True)
var = out_ln.var(dim=-1,keepdim=True)
print("Mean:\n",mean)
print("Variance:\n",var)

Mean:
 tensor([[-1.4901e-08],
        [ 2.3842e-08]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)


## 1.3 Implementing a feed forward network with GELU activations