In [None]:
# import tinystories dataset
!pip install datasets



In [None]:
from datasets import load_dataset
ds = load_dataset("roneneldan/TinyStories")

now that the dataset is installed  we need to tokenize the dataset
vocbulary -> list of all the words , higher vocublury size more computation power required
 types of token( has words , characters and subwords ) uses the stem of uncommon words or splits it in to sub words which are not reppeated to reduce the vocabulary size
byte pair encoding gpt 2-> gives token to the amount of vocabularies you want -> this is wat is used here

each token has a token id
store the tokens in .bin file


In [None]:
!pip install tiktoken



In [None]:
import tiktoken #(bpe library)
import os
import numpy as np
from tqdm.auto import tqdm #progress bar for loops

enc = tiktoken.get_encoding("gpt2")

def get_ids(story):
  ids= enc.encode_ordinary(story["text"])
  output={"ids":ids,"len":len(ids)}
  #enc.encode(text)          # normal encode, respects special tokens
  #enc.encode_ordinary(text) # ignores special tokens   choose based on the type of dataset
  return output

if not os.path.exists ("train.bin"):
  tokenized = ds.map(
      get_ids,
      remove_columns=["text"],# no need to text column anymore only need the ids
      desc="tokenizing the splits", #progress bar label
      num_proc=8  #runs parallel across 8 CPU processes
  )

  for split, dset in tokenized.items():
    #[("train", train_dataset), ("val", val_dataset)] gives output like this
    # we are using uint -> unsigned int since these tokens cannot be negative values
    arr_len = np.sum(dset['len'],dtype=np.uint64)
    filename= f'{split}.bin'
    arr= np.memmap(filename,dtype = np.uint16, mode='w+',shape=(arr_len,))
    total_batches=1024 #batch will contain roughly 1/1024 to avoid memory overload

    idx=0
    for batch_idx in tqdm(range(total_batches),desc=f'writing {filename}'):
      batch = dset.shard(num_shards=total_batches,index=batch_idx, contiguous=True).with_foramt('numpy')
      #.with_foramt('numpy')-> converts dataset shard to NumPy arrays, which are faster for numerical operations
      #contiguous=True -> ensures the examples in the shard are taken consecutively from the dataset, not randomly scattered.
      arr_batch=np.concatenate(batch['ids'])
      arr[idx:idx+len(arr_batch)] = arr_batch
      idx +=len(arr_batch)
    arr.flush()
    #arr.flush() forces all pending changes to be written to the actual .bin file on disk.




tokenizing the splits (num_proc=8):   0%|          | 0/2119719 [00:00<?, ? examples/s]

Process ForkPoolWorker-4:


KeyboardInterrupt: 

## Step 3: Create Input-Output batches for the dataset



In [None]:
def get_batch(split): #returns one input and one output matrix
  if split == 'train':
    data = np.memmap('train.bin',dtype=np.unit16,mode='r')
  else :
    data = np.memmap('validation.bin',dtype=np.unit16,mode='r')

  xi= torch.randint(len(data) -block_size, (batch_size,)) # (batch_size,) -> size (tuple) – a tuple defining the shape of the output tensor.
  x= torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in xi])
  # creates a batch of input sequences (x) by stacking token slices of length block_size from random positions in the dataset.
  y= torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in xi])

  if device_type == "cuda":
    x,y =x.pin_memory().to(device, non_blocking =True),y.pin_memory().to(device, non_blocking =True)
    # (x.pin_memory()) locks the CPU memory for these tensors, so the GPU can access it directly without extra copying
    # (non_blocking=True) lets the CPU continue running other tasks while the data is being copied to the GPU.
  else:
    x, y = x.to(device), y.to(device)

  return x, y



The SLM Model Architecture

3 main steps
-> input
-> processor
-> output

INPUT
Converts all the token ids to embedding vectors.
The embedding vectors are stored in a token embedding matrix where (x, y) — x = size of vocab and y is an experimental value.

This token embedding matrix acts as a lookup table for all the ids, so if an id is entered then the corresponding row is fetched.

The token embedding vectors are all randomly initialized and can be trained later.

IN INPUT: TOKENIZED TEXT -> TOKEN EMBEDDINGS -> POSITIONAL EMBEDDINGS

Positional embeddings are added so the model knows the order of tokens in the sequence, since attention alone does not encode positions.

PROCESSOR:

Layer Normalization keeps token activations stable after embeddings or attention layers, preventing exploding or vanishing gradients.

Self-attention -> computes each token in a sentence with respect to all the other tokens, capturing relationships between words.

Multi-head attention -> looks at different aspects of the context in parallel. After this step, the dimension size stays the same, but the vectors now contain richer information about the context.

We divide by sqrt of keys dimension to scale the dot product and keep values in a normal range, preventing extremely large numbers.

After we get the attention heads, we do causal attention, which masks the upper half of the attention matrix (sets them to -inf so after softmax they become 0), ensuring that tokens cannot “see” future words. This is essential for autoregressive training, otherwise the model would cheat.

Softmax is applied to get attention probabilities, and dropout can also be applied to improve generalization.

The attention weights are multiplied with the values matrix to get context vectors. Another dropout can be applied here.

A small neural network can be added at the end (optional) with dropout for additional processing.

Shortcut connections are also added, similar to ResNet, allowing the input of a layer to go directly to the next layer without going through multi-head attention and dropout. This helps prevent vanishing gradient problems when multiple layers are stacked.

Layer normalization ensures the output has mean = 0 and variance = 1.

Next is the feed-forward (FF) network:

A FF neural network is applied.The feed-forward network looks at each token on its own and improves its features, helping the model understand each token better after attention.

GELU activation function is applied (experimental choice).

Dropout layer can be added for regularization.

OUTPUT :
final normalixzation layer
output
gives the logits matrix for next prediction task
Each number in this vector is a logit, which is the model’s “score” for how likely each word in the vocabulary is the next token.


In [None]:
import torch
import torch.nn as nn
#  nn.Module is the base class for all neural network layers and models in PyTorch, handling parameter storage, registration, and forward computation.
import torch.nn.functional as F
import math
from dataclass import dataclass
import numpy as np
from tqdm.auto import tqdm
from contextlib import nullcontext
import os

class LayerNorm(nn.Module):
  # The learnable parameters (weight and bias) are stored inside your LayerNorm (nn) object and automatically registered and managed by PyTorch’s nn.Module system.
  def __init__(self,ndim,bias):
    super().__init__()
    self.weight = nn.Parameter(torch.ones(ndim))
    self.bias =nn.Parameter(torch.zeros(ndim)) if bias else None
  def forward(self,x):
    return F.layer_norm(x,self.weight.shape,self.weight,self.bias,1e-5) #1e-5 added to denominator to ensure its never zero

class CausalSelfAttention(nn.Module):
  def __init__(self, config):
    super().__init__()
    assert config.n_embd % config.n_head == 0
    # ensures that n_embd can be cleanly divided into equal parts
    self.c_attn = nn.Linear(config.n_embd, 3* config.n_embd,bias=config.bias) #computes queries (Q), keys (K), and values (V) at once.
    self.c_proj = nn.Linear(config.n_embd, config.n_embd,bias=config.bias) # Linear layer after attention to project the concatenated heads back to original embedding dimension.
    self.attn_dropout=nn.Dropout(config.dropout)
    self.resid_dropout=nn.Dropout(config.dropout) # dropout after the residual connection.
    self.n_head =config.n_head
    self.n_embd =config.n_embd
    # Saves the number of heads and embedding dimension for later use.
    self.flash =hasattr(F,'scaled_dot_product_attention')
    if not self.flash :
      self.register_buffer("bias", torch.tril(torch.ones(config.block_size,config.block_size)).view(1,1,config.block_size , config.block_size))

    #  lower-triangular matrix (torch.tril) to mask out future tokens in attention.
    #  Shape [1,1,block_size,block_size] allows broadcasting across batches and heads.
    # .view() reshapes a tensor without copying data, requiring it to be contiguous

  def forward(self,x):
    B,T,C = x.size()
    q,k,v = self.c_attn(x).split(self.n_embd,dim=2)
    k=k.view(B,T, self.n_head , C//self.n_head).transpose(1,2)
    q=q.view(B,T, self.n_head , C//self.n_head).transpose(1,2)
    v =v.view(B,T, self.n_head , C//self.n_head).transpose(1,2)

    if self.flash :
      y= F.scaled_dot_product_attention(q,k,v,attn_mask=None,dropout_p=self.attn_dropout.p if self.training else 0.0 , is_causal=True)
      # This line computes causal scaled dot-product attention by combining queries, keys, and values, applying softmax, optional dropout, and masking future tokens, producing the attended output y.
    else:
      att = (q @ k.transpose(-2,-1)) * (1.0 / math.sqrt(k.size(-1))) #“scaled” dot-product attention.
      att = att.masked_fill(self.bias[:,:,:T,:T]== 0 , float('-inf'))
      # This line applies the causal mask to the attention scores, setting future positions to -inf so they get zero probability after softmax.
      att = F.softwax(att,dim=-1)
      att = self.attn_dropout(att)
      y= att@v

    y=y.transpose(1,2).contiguous().view(B,T,C)
    y = self.resid_dropout(self.c_proj(y))
    # applied to the residual connection
    return y


class MLP(nn.Module):
  def __init__(self,config):
    super().__init__()
    # is the first linear layer of the Transformer feed-forward block, expanding the embedding dimension by 4× before applying a non-linear activation.
    self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
    self.gelu = nn.GELU()
    self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
    self.dropout = nn.Dropout(config.dropout)
  def forward(self, x):
      return self.dropout(self.c_proj(self.gelu(self.c_fc(x))))

class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln1 = LayerNorm(config.n_embd, config.bias)
        self.attn = CausalSelfAttention(config)
        self.ln2 = LayerNorm(config.n_embd, config.bias)
        self.mlp = MLP(config)
    def forward(self, x):
      # shortcut connection or residual connection
      # We need the shortcut connection to preserve original information Because without it, each layer would overwrite the input, causing information loss and
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

@dataclass
class GPTConfig:
    block_size: int # Maximum context length
    vocab_size: int
    n_layer: int # Number of Transformer blocks
    n_head: int
    n_embd: int # hidden size of each token vector.
    dropout: float = 0.0
    bias: bool = True


class GPT(nn.Module):
  def __init__(self,config):
    super().__init__()
    self.config = config
    self.transformer = nn.ModuleDict(dict(
      wte=nn.Embedding(config.vocab_size, config.n_embd),
      # embedding lookup table -> Converts each input token ID into a dense vector of dimension
      wpe=nn.Embedding(config.block_size, config.n_embd),# Positional Embedding
      drop=nn.Dropout(config.dropout),
      h=nn.ModuleList([Block(config) for _ in range(config.n_layer)]),#block class
      ln_f=LayerNorm(config.n_embd, config.bias),
    ))
    self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
    # maps the final token embeddings to logits over the vocabulary
    self.transformer.wte.weight = self.lm_head.weight  # weight tying
    # The vector used to represent a token as input is the same vector used to score it as output.
    self.apply(self._init_weights)
    # Recursively goes through all submodules of the model (all layers, blocks, embeddings, etc.) and calls _init_weights(module) on each.
    for pn, p in self.named_parameters():
        if pn.endswith('c_proj.weight'):
            nn.init.normal_(p, mean=0.0, std=0.02 / math.sqrt(2 * config.n_layer))
            # We start the attention projection weights smaller so that the model doesn’t blow up when it’s deep.


    def _init_weights(self, module):
        if isinstance(module, nn.Linear):#Checks if the layer is a fully connected (Linear) layer.
            nn.init.normal_(module.weight, mean=0.0, std=0.02) #Initializes the Linear layer’s weights with small random numbers from a normal distribution.
            if module.bias is not None:
                nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):# Checks if the layer is an Embedding layer.
            nn.init.normal_(module.weight, mean=0.0, std=0.02) # Initializes all token vectors in the Embedding layer with small random numbers.

    def forward(self, idx, targets=None):
      device = idx.device
      b, t = idx.size()
      assert t <= self.config.block_size
      pos = torch.arange(0, t, dtype=torch.long, device=device)

      tok_emb = self.transformer.wte(idx) # Looks up the token embeddings for each input ID.
      pos_emb = self.transformer.wpe(pos)
      x = self.transformer.drop(tok_emb + pos_emb)
      for block in self.transformer.h:
          x = block(x)
      # Passes x through each Transformer block (h = stack of Block).
      # Each block applies self-attention + feedforward + residual connections.
      x = self.transformer.ln_f(x)

      if targets is not None:
          logits = self.lm_head(x) # Projects final embeddings x to vocabulary logits for each token.
          loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
          # Computes cross-entropy loss between predicted logits and target token IDs.

          return logits, loss
      else:
          logits = self.lm_head(x[:, [-1], :])
          return logits, None


    @torch.no_grad() # Tells PyTorch not to compute gradients during this function.

    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        for _ in range(max_new_tokens):# maximum number of tokens we generate
            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
            # Keep only the last block_size tokens if the sequence is too long, otherwise use the full sequence.
            logits, _ = self(idx_cond)
            # Pass the current sequence through GPT to get predicted scores for the next token.
            logits = logits[:, -1, :] / temperature # Selects the last token’s logits from the sequence. we only care about the next token after the current sequence.
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                # Find the top-k highest scores from the logits to limit candidate tokens for sampling.
                logits[logits < v[:, [-1]]] = -float('Inf')
                # Set all logits outside the top-k to negative infinity so they cannot be sampled.
            probs = F.softmax(logits, dim=-1)
            # Convert the logits into probabilities for all tokens so we can randomly sample the next token.
            idx_next = torch.multinomial(probs, num_samples=1)
            # Randomly pick the next token from the probability distribution predicted by the model.
            idx = torch.cat((idx, idx_next), dim=1)
        return idx

SyntaxError: expected ':' (ipython-input-1739638768.py, line 22)

In [None]:
config = GPTConfig(
    vocab_size=50257,     # use the tokenizer's vocab size
    block_size=128,       # or whatever context size you're training with
    n_layer=6,
    n_head=6,
    n_embd=384,
    dropout=0.1,
    bias=True
)

model = GPT(config)

In [None]:
 def estimate_loss(model):
  out ={}
  model.eval()
  # Sets the model to evaluation mode:Disables dropout. Freezes batch normalization updates.ensures the loss reflects inference-time behavior, not training randomness.
  with torch.inference_model():
    for split in ['train','val']:
      losses= torch.zeros(eval_iters) #Creates a tensor to store loss values for each mini-batch .
      for k in range(eval_iters):
        X, Y = get_batch(split)
        with ctx:
          logits,loss = model(X, Y)
        losses[k] = loss.item()
      out[split] = losses.mean()
  model.train()
  return out


if out outoput is i had alwasy though ans the coresspoindgin logits are given we get thr correspoding toke id romt his , ex fro token id of 23 to be max if token id 23 contas the correct next word to the current word, so we neeed the propablitly of 23 to be high  after soft max

when we get softmax of the vocab list after the traing so if we have word had then after the softwmax we need to make sure always which is the next correct word has the highest propbailty

flateen , multiple outputs are clubbed toger
cress_entrpoy loss -taking negative log likehood for ther tokens we have

estiome loss -> extimates losses per batch this loss  and we taken mean of losses , we run the model for a prescribed no of iteraations , for each iteration we get diff batch we pass abtch through mode and get logit and pass the logits throuigjh the model and we get the loss ,
from logits we get the tokens whic coressponsds to the decoded next token
for sentence  "i had always thought" the output will be "had always thougth that"

eaCH WORD SHOULD HAVE THE PROBALITYL OF THE NEXT

 WOORD

logits= 4 by vocab size
flattend all logits matrix togeter into one big logits

DEFINE SLM TRAINING CONFIG

In [None]:
import torch
from contextlib import nullcontext

learning_rate = 1e-4 #more stable training , earlier 1e-3
max_iters = 20000  # limits the number of times the model’s weights are updated. ,no of training iterations
warmup_steps = 100
# The model starts updating weights immediately, from step 1.
# What warmup does is start with a very small learning rate at step 1 and gradually increase it over the first 1000 steps.
# By step 1000, the learning rate reaches its full intended value. After that, training continues normally with the full learning rate.
min_lr = 5e-4
# min_lr = 5e-4 → the learning rate won’t go lower than 0.0005.
eval_iters = 500 #number of times the model is evaluated during training
batch_size = 32 #number of samples the model sees before it updates the weights once.
block_size = 128 # the number of tokens (words/characters) the model looks at in one input sequence. context size

gradient_accumulation_steps = 32
# Normally, the model updates weights after every batch. With gradient accumulation, the model waits for 32 batches, adds up the gradients, and then updates weights once.This is like saving up small steps before taking a bigger step.


device =  "cuda" if torch.cuda.is_available() else "cpu"
device_type = 'cuda' if 'cuda' in device else 'cpu'

dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16'
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]

ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)
# chooses the “mode” for calculations depending on whether you are on CPU or GPU.


torch.set_default_device(device)
torch.manual_seed(42)


In [None]:
from torch.optim.lr_scheduler import LinearLR,SequentialLR, CosineAnnealingLR

optimizer =  torch.optim.AdamW(model.parameters(),lr=learning_rate, betas =(0.9,0.95),weight_decay=0.1,eps=1e-9)

scheduler_warmup = LinearLR(optimizer, total_iters = warmup_steps) #Implement linear warmup
scheduler_decay = CosineAnnealingLR(optimizer,T_max = max_iters - warmup_steps, eta_min = min_lr) #Implement lr decay
scheduler = SequentialLR(optimizer, schedulers=[scheduler_warmup, scheduler_decay], milestones=[warmup_steps]) #Switching from warmup to decay

scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))
#Essentially, the GradScaler temporarily boosts the tiny gradients so they don't disappear in float16, and then it corrects the boost just in time for the update.

inpit, ouput  -> llm -> logits-> loss -> loss propagated backawrasd and so on

gradient accumulation -> if gpu can only accpt 32 samples
so we can d updated paramter after every 32 steps instead of 1024 step , so we run 32 iteratoins so we do in batches of 32 so not doingparameter update after 32 instead of 1024

gradient_accumulation_steps = 32
 accumulating greadtuis untill this points then updating


 we use adamoptimizer with weight decay

 for learning rate we have linear warmup ie initially then we have a cosine decay


 In Adam (and AdamW), two moving averages are maintained:

m_t → the first moment (mean of past gradients)

v_t → the second moment (mean of squared gradients)


weight_decay=0.1 → Adds regularization by slightly shrinking large weights toward zero.



eps=1e-9
the average of squared gradient becomes very small, the denominator could approach zero → causing division by zero or huge updates.

Adding a small constant eps (like 1e-9) ensures that doesn’t happen.

β₁ controls how much past gradient direction (momentum) is remembered, while β₂ controls how much past gradient magnitude (squared gradients) is remembered for adaptive learning rates.

In [None]:
best_val_loss = float('inf')
best_model_params_path = "best_model_params.pt"
train_loss_list, validation_loss_list = [], []

# Ensure model is on the correct device
model = model.to(device)

# In your training loop
for epoch in tqdm(range(max_iters)):
    if epoch % eval_iters == 0 and epoch != 0:
        # Ensure estimate_loss uses the correct device
        losses = estimate_loss(model)
        print(f"Epoch {epoch}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        print(f"The current learning rate: {optimizer.param_groups[0]['lr']:.5f}")
        train_loss_list += [losses['train']]
        validation_loss_list += [losses['val']]

        if losses['val'] < best_val_loss:
            best_val_loss = losses['val']
            torch.save(model.state_dict(), best_model_params_path)

    # Ensure X and y are on the correct device
    X, y = get_batch("train")
    X, y = X.to(device), y.to(device)

    with ctx:
        logits, loss = model(X, y)
        loss = loss / gradient_accumulation_steps
        scaler.scale(loss).backward()

    if ((epoch + 1) % gradient_accumulation_steps == 0) or (epoch + 1 == max_iters):
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad(set_to_none=True)
    scheduler.step()

In [None]:
import matplotlib.pyplot as plt
train_loss_list_converted = [i.cpu().detach() for i in train_loss_list]
validation_loss_list_converted = [i.cpu().detach() for i in validation_loss_list]

plt.plot(train_loss_list_converted, 'g', label='train_loss')
plt.plot(validation_loss_list_converted, 'r', label='validation_loss')
plt.xlabel("Steps - Every 100 epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()



In [None]:
#Load the model
model = GPT(config)  # re-create the model with same config
device =  "cuda" if torch.cuda.is_available() else "cpu"
best_model_params_path = "best_model_params.pt"
model.load_state_dict(torch.load(best_model_params_path, map_location=torch.device(device))) # load best model states


In [None]:
sentence = "Once upon a time there was a pumpkin."
context = (torch.tensor(enc.encode_ordinary(sentence)).unsqueeze(dim = 0))
y = model.generate(context, 200)
print(enc.decode(y.squeeze().tolist()))

input goes to model  , i get decoded token from trained model , decoded token is appended to input goes to model and so on