# Training a Small Language Model (SLM) from scratch

An SLM with 15M parameters is trained on the TinyStories dataset (contains stories for 3-4 year olds, constructed from GPT4) to generate a story from a prompt.

SLMs are not trained on large datasets instead it is trained on a small/subset of the entire data set, for a specific purpouse. Goal is to have a small model learn from this small dataset, understanding the english grammar and some meaning to be able to come up with stories on its own. The model learns to output coherent stories/text.

## Step 1: load datasets

In [2]:
# !pip install datasets

In [1]:
from datasets import load_dataset

ds = load_dataset("roneneldan/TinyStories")

# Choose a custom cache dir so we know where it lives
# ds = load_dataset("roneneldan/TinyStories", cache_dir="./hf_cache")

# # Optional: save as Arrow or JSON for fully offline use later
# ds.save_to_disk("./tiny_stories_local")

# ds = load_from_disk("./tiny_stories_local")

  from .autonotebook import tqdm as notebook_tqdm


## step 2: Data pre-processing

Since neural networks or computers understand only numerical data, convert the sentences/words/tokens into vectors.

1. Word based tokenisation (vocabulary/tabel stored for each word/token vs the token ids) (vocabulary size plays an important role in resources used)

2. Subword tokenisation (tokens can be words, characters or subwords)
  Byte pair encoding (Tokeniser used here), also used by GPT2.

3. Character based tokenisation

Tokenising here is done for every story and instead of maintaining a dictionaly of token ids for each story, the tokens are merged and stored in a .bin file for all sories together. .bin file has the advantage of being stored on the disk directly instead of using the RAM. Not using .bin file here slows down our code and is one of the techniques used to speed up our code. Also we do not need to retokenise each time we want to train our model as it is already stored on our disk.

We then creat/use a memory mapped array to read/write into disk, without overloading the RAM.
We create np.memmap, i.e. file is stored on disk, but can be used like a numpy array, where we can write to it chunk by chunk wihout storing everything in the ram.

The entire traning dataset of 2.12M is then made into 1024 batches and then the token ids from each batch of stories is put into a 'train.bin' file.

tiktoken is a OpenAI library to get tokeniser/encodings used for different GPT models.

At end of this block
1. We have tokenised the dataset into tokens
2. Create files called 'train.bin' and 'validation.bin' containing token Ids from the respective parts of the entire dataset.
3. We store tokenIds on disk rather than on RAM for efficient computations

In [6]:
# !pip install ticktoken

import tiktoken
import os
import numpy as np
from tqdm.auto import tqdm

# we go thorugh all stories and every story is converted into token ids and is stored along with its length
def process(example):
  import tiktoken
  enc = tiktoken.get_encoding("gpt2")
  ids = enc.encode_ordinary(example["text"])
  out = {'ids': ids, 'len': len(ids)}
  return out

if not os.path.exists("train.bin"):
  tokenized = ds.map(process,
                     remove_columns={'text'},
                     desc="tokenising the splits",
                     num_proc=8)

  #concat all ids in each dataset into 1 large file for training
  for split,dset in tokenized.items():
    arr_len = np.sum(dset['len'], dtype=np.uint64)
    filename = f'{split}.bin' # create a bin file for each training and validation
    dtype = np.uint16 #since enc.max_token_value == 50265 is < 2**16
    arr = np.memmap(filename, dtype=dtype, mode='w+', shape=(arr_len,)) # this is the memory mapped array
    total_batches = 1024 #entire dataset is split into 1024 batches

    idx = 0
    for batch_idx in tqdm(range(total_batches), desc=f'writing {filename}'):
      #batch samples for faster write
      batch = dset.shard(num_shards=total_batches, index=batch_idx, contiguous=True).with_format('numpy')
      arr_batch = np.concatenate(batch['ids']) # for each batch token ids are collected and stored into the array
      # write into mmap
      arr[idx:idx+len(arr_batch)] = arr_batch
      idx += len(arr_batch)
    arr.flush() #stores everything on disc finally

tokenising the splits (num_proc=8): 100%|██████████| 2119719/2119719 [01:07<00:00, 31237.34 examples/s]
tokenising the splits (num_proc=8): 100%|██████████| 21990/21990 [00:05<00:00, 3790.59 examples/s] 
writing train.bin:  22%|██▏       | 230/1024 [02:54<10:00,  1.32it/s]


KeyboardInterrupt: 

## Step 3: create input output pairs from dataset

The loss function used for training the model is finally based on these inuput, output pairs.
In regression or classification tasks in ML the datasets have the input and the correct output, but in language modelling we create them from the dataset itself.<br>
Traditional language model predict the next token given the exisiting sequence of tokens. (next token prediction)

To create these pairs
1. fix context size: max length of tokens the model looks at, at a time to predict the next token. Chunk the entire dataset according to the context size. (in our case each chunk conains 4 token ids).
2. Batch size: training is done for each of the batches, loss is backpropagated per batch after calculating batch loss.

Finally input batches are created like matrices with <br>
number of rows = batch size <br>
number of cols = context size <br>

The output batch is then created by just sliding the input window/token sequence by 1 token.

When we look at a training pair we have 4 prediction tasks here and not just 1.

Training objective is then just token prediction.

In [2]:
def get_batch(split): # this function just returns 1 input matrix and 1 output matrix
  if split == 'train':
    data = np.memmap('train.bin', dtype=np.uint16, mode='r')
  else:
    data = np.memmap('validation.bin', dtype=np.uint16, mode='r')
  ix = torch.randint(len(data) - block_size, (batch_size,)) #randomly samples 4 (batch size) integers; block_size = context size
  x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix]) #stacks together x1, x2, x3 and x4
  y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix]) #stacks together y1, y2, y3 and y4
  #for computational efficiency
  if device_type == 'cuda':
    x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
  else:
    x, y = x.to(device), y.to(device)
  return x, y

- pin memory locks memory of tensor in RAM allowing for fatser transfer of tensor to GPU.
- .to(device) usually blocks the CPU until copying is done. non_blocking = true ensures CPU can continue other work (like next batch preparation).

## Step 4: SLM model architecture

The architecture consists of 3 parts
1. Input block
2. Processor
3. Output block

![alt text](images/transformer_block.png)

The journey of 1 row from the batch:
1. Token embedding: The token ids of context length are converted into higher dimensional vectors. This is required because every word carries some meaning and contributes to the sentences. Words have some semantic notion and hence they are represented in some higher dimensional spaces that capture some info of the meaning and relationships. Similar tokens have vectors closer together than the ones with opposite or unrelated tokens.<br>
Done with Token embedding matrix with a fixed embedding dimension for each of the token/embedding vector, in number tokens = number of words in our vocabulary.<br>
The matrix is then just used as a lookup table to retrieve the embedding corresponding to the token id.
The embeddings are initialised randomly and can be trained through backpropagation like the network weights.

2. Layer norm prevents the internal covariate shift and makes training easier.

3. Multihead attention: captures how one token is related to the other token. It augments the input embedding vector, so that for every vector we have the information about the neighbouring vectors. This block converts the input embedding vector into a context vector, by which we take into account how the token relates to all the other tokens and then the calculated scores are used to go from input embedding vector to the context vector.<br>
This context vector is much richer as it has information of how it relates to tokens surrounding it.<br>
When we look at attention scores, causality should exist. Ideally every token should have an attention score with only with that token and the token appearing before it and not after it. Hence we set the attention scores to 0 above the diagonal of the attention scores matrix. This is known as causal attention.
Context capturing is offloaded onto the trainable weight matrices q (query), k (key) and v (value) matrices as we could not device any other method to capture context.

4. Feed fwd network (MLP): is like an expansion-compression neural network, with input dimension same as output dimension. input and output layer's dim = 768; hidden layer dim = 4*768=3072. Due to expansion + contraction and the additional trainable parameters the model learns to explore a much richer space to learn new things. Without this network the model cant learn about context, patterns underlying the data and cant answer queries well. Addition of this network changes the performance of the model. Activation function used here is Gelu, shown to have good results here.

5. output network layer (output head):converts every vector to vocabulary size. Output is the logits tensor.
1 batch: 1(batch size)\*4(number of tokens)\*50275(vocabsize)


LayerNorm(x) = x-μ / \sqrt{σ^2 + ϵ} * weight + bias

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from dataclasses import dataclass
import numpy as np
from tqdm.auto import tqdm
from contextlib import nullcontext
import os

class LayerNorm(nn.Module):
  def __init__(self,ndim, bias):
    super().__init__()
    self.weight = nn.Parameter(torch.ones(ndim))
    self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None
  def forward(self,x):
    return F.layer_norm(x, self.weight.shape, self.weight, self.bias, 1e-5) #normalises across specified last N dimensions, then optionally applies scaling(weight) and shifting (bias) parameters.
    # x = input tensor to normalise
    # self.weight.shape = dimension to normalise over
    # weight and bias are learnable parameters  (γ, β in layer norm equation)
    # 1e-5 added to variance to avoid division by zero

class CausalSelfAttention(nn.Module): #self attention because we calc attention of the tokens of the sentence with themselves.
  def __init__(self,config):
    super().__init__()
    assert config.n_embd % config.n_head == 0
    self.c_attn = nn.Linear(config.n_embd, 3*config.n_embd, bias=config.bias)
    self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias = config.bias)
    self.attn_dropout = nn.Dropout(config.dropout)
    self.resid_dropout = nn.Dropout(config.dropout)
    self.n_head = config.n_head
    self.n_embd = config.n_embd
    self.flash = hasattr(F, 'scaled_dot_product_attention')
    if not self.flash:
      self.register_buffer("bias",torch.tril(torch.ones(config.block_size, config.block_size))
                           .view(1, 1, config.block_size, config.block_size))
  def forward(self,x):
    B, T, C = x.size()
    q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
    q = q.view(B, T, self.n_head, C//self.n_head).transpose(1,2) # query matrix
    k = k.view(B, T, self.n_head, C//self.n_head).transpose(1,2) # key matrix
    v = v.view(B, T, self.n_head, C//self.n_head).transpose(1,2) # value matrix

    if self.flash:
      y = F.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.attn_dropout.p if self.training else 0.0, is_causal=True)
    else:
      att = (q @ k.transpose(-2,-1)) ** (1.0 / math.sqrt(k.size(-1))) # attention score matrix (q*k^Transpose). divide by sqrt(keys_dimention) to contain the variance of the dot product, to stabilise the training.
      att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf')) # implements causal attention. -inf becomes 0 after softmax
      att = F.softmax(att, dim=-1) # attention score matrix is converted to an attention weights matrix. Every row sums up to 1.
      att = self.attn_dropout(att) #improves generalisation
      y = att @ v # attention weights matrix * values = context vector matrix

    y = y.transpose(1,2).contiguous().view(B, T, C)
    y = self.resid_dropout(self.c_proj(y)) #output projection layer. Final neural n/w added at the output. Preserves the shape of the output. + 1 more layer of dropout
    return y

class MLP(nn.Module): # expansion and contraction neural n/w. These additional trainable params allow neural net to explore a much bigger space.
  def __init__(self, config):
    super().__init__()
    self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias) # hidden layer has a dimension of 4 * embedding dim
    self.gelu = nn.GELU()
    self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias) # o/p layer again has dim of embedding dim/input-layer
    self.dropout = nn.Dropout(config.dropout)

  def forward(self,x):
    return self.dropout(self.c_proj(self.gelu(self.c_fc(x))))

class Block(nn.Module): #implements the transformer block by assembling the layer norm, multihead attention and Feed Fwd nn
  def __init__(self, config):
    super().__init__()
    self.ln1 = LayerNorm(config.n_embd, bias=config.bias)
    self.attn = CausalSelfAttention(config)
    self.ln2 = LayerNorm(config.n_embd, bias=config.bias)
    self.mlp = MLP(config)
  def forward(self, x):
    x  = x + self.attn(self.ln1(x)) # skip connections added to avoid vanishing gradients by providing the gradient an alternative path to flow. (Like in ResNet)
    x = x + self.mlp(self.ln2(x))
    return x

@dataclass
class GPTConfig:
  block_size: int
  vocab_size: int
  n_layer: int
  n_head: int
  n_embd: int
  dropout: float = 0.0
  bias: bool = True

class GPT(nn.Module):
  def __init__(self, config):
    super().__init__()
    self.config = config
    self.transformer = nn.ModuleDict(dict(
        wte = nn.Embedding(config.vocab_size, config.n_embd),
        wpe = nn.Embedding(config.block_size, config.n_embd),
        drop = nn.Dropout(config.dropout),
        h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]), # we use multiple transformer blocks, n_layers tell us how many blocks are used
        ln_f = LayerNorm(config.n_embd, bias=config.bias)
    ))
    self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
    self.transformer.wte.weight = self.lm_head.weight #weight tying

    self.apply(self._init_weights)
    for pn, p in self.named_parameters():
      if pn.endswith('c_proj.weight'):
        torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))

  def _init_weights(self, module): # different initialisations done for different sections of the n/w
    if isinstance(module, nn.Linear):
      torch.nn.init.normal_(module.weight, mean=0.0, std=0.02) # gaussian init
      if module.bias is not None:
        torch.nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Embedding):
      torch.nn.init.normal_(module.weight, mean=0.0, std=0.02) # gaussian init

  def forward(self, idx, targets=None):
    device = idx.device
    b, t = idx.size()
    assert t <= self.config.block_size, "context length exceeded"
    pos = torch.arange(0, t, dtype=torch.long, device=device)

    tok_emb = self.transformer.wte(idx) #token embedding
    pos_emb = self.transformer.wpe(pos) #position embedding
    x = self.transformer.drop(tok_emb + pos_emb) # the sum of both embeddings are passed as input to the transformer blocks
    for block in self.transformer.h: # multiple transformer blocks are used by the GPT
      x = block(x)
    x = self.transformer.ln_f(x) # final layer norm here before the output layer

    if targets is not None:
      logits = self.lm_head(x) # (output head)neural network with number of rows = embedding size & n.o of cols = vocab size.
      loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1) # the predicted token is compared with actual target tokens and cross entropy loss is calculated.
      return logits, loss
    else:
      logits = self.lm_head(x)
      return logits, None # softmax over logits used for predicting the next output. cross entropy loss used when training.

  @torch.no_grad()
  def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
    for _ in range(max_new_tokens):
      idx_cond = idx if idx.size(-1) <= self.config.block_size else idx[:,-self.config.block_size:]
      logits, _ = self(idx_cond)
      logits = logits[:,-1,:] / temperature
      if top_k is not None:
        v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
        logits[logits < v[:, [-1]]] = -float('Inf')
      probs = F.softmax(logits, dim=-1)
      idx_next = torch.multinomial(probs, num_samples=1)
      idx = torch.cat((idx, idx_next), dim=1)
    return idx

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
config = GPTConfig(
    vocab_size=50257,     # use the tokenizer's vocab size
    block_size=128,       # or whatever context size you're training with
    n_layer=6,            # number of transformer blocks we have
    n_head=6,             # number of attention heads we have in the attention blocks
    n_embd=384,           # embedding dimension
    dropout=0.1,
    bias=True
)

model = GPT(config)

weight initialisation for different layers:
![alt text](images/weight_initialisations.png)

## Step 5: Define the loss function

From logits of the nn we get the output probabilities of the true target token IDs. We want these probabilities to be as close as possible to 1. The loss function we define and use is also based on these probabilities. We use NLL here with these probabilities. -1/4(log p1 + log p2 + log p3 + log p4...). Since we want to maximise p1, p2, ... we minimise the NLL. For the NLL to be min, the probabilities p1, p2, ... have to be close to 1. This NLL is also called cross entropy loss, through which backpropagation is done for language models. Taking loss of p1*p2*p3*p4 is not ideal as we would want to calculate the derivative and calculating the derivative of products isnt easy, hence we take log and then taking the derivative of the sum of log is easier for back propagation.

In [4]:
def estimate_loss(model): #gets batches and calculates the loss for those few batchs
  out = {}
  model.eval()
  with torch.inference_mode():
    for split in ['train', 'val']:
      losses = torch.zeros(eval_iters)
      for k in range(eval_iters):
        x, y = get_batch(split)
        with ctx:
          logits, loss = model(x, y)
        losses[k] = loss.item()
      out[split] = losses.mean()
  model.train()
  return out

## Step 6: Define the training loop

1. Use autocast to make the training faster. Enables automatic mixed precision (AMP), where model uses float16 where it is safe and float32 where needed.
![alt text](images/AMP.png)

2. Gradient accumulation: When we want to train with a batch size of 1024, but GPU only fits a batch of 32 then we set

    gradient_accumulation_steps = 32

    run 32 iterations with batch_size = 32

    accumulate gradients during loss.backward()
    then do optimiser.step() after 32 steps. {PS: 32 * 32 = 1024}

    Parameters are updated after 32 steps with the final accumulated gradient, no mean is used here.

3. AdamW optimiser used with learning rate with a warmup and then a cosine decay regime.
![alt text](images/learning_rate.png)

4. Pretraining the SLM<br>
    For each of the training iteration we perform the following steps:<br>
    choose x & y -> pass x through model to get logits -> compute loss b/w logits and y (Cross Entropy) -> Backpropagate loss -> Accumulate gradients till we reach gradient accumulation steps -> update parameters -> update LR -> evaluate and save best



In [6]:
# SLM training configuration

import torch
from contextlib import nullcontext

max_iters = 20000
learning_rate = 1e-4
warmup_steps = 1000
min_lr = 5e-4
eval_iters = 500
batch_size = 32
block_size = 128 #context size

gradient_accumulation_steps = 32

device = 'cuda' if torch.cuda.is_available() else 'cpu'
device_type = 'cuda' if 'cuda' in device else 'cpu'#for use in autocast

dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16'
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]

ctx = nullcontext() if device_type =='cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

torch.set_default_device(device)
torch.manual_seed(42)

<torch._C.Generator at 0x227047d9130>

In [7]:
from torch.optim.lr_scheduler import LinearLR, SequentialLR, CosineAnnealingLR

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, betas=(0.9,0.95), weight_decay=0.1, eps=1e-9) #weight decay for regularisation

scheduler_warmup = LinearLR(optimizer, total_iters=warmup_steps) #linear warmup
scheduler_decay = CosineAnnealingLR(optimizer, T_max = max_iters - warmup_steps, eta_min = min_lr) #cosine decay
scheduler = SequentialLR(optimizer, [scheduler_warmup, scheduler_decay], milestones=[warmup_steps]) #combines both

scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16')) #for mixed precision


  scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16')) #for mixed precision


In [8]:
# pretraining the SLM

best_val_loss = float('inf')
best_model_params_path = "best_model_params.pt"
train_loss_list, validation_loss_list = [], []

# Ensure model is on the correct device
model = model.to(device)

# In your training loop
for epoch in tqdm(range(max_iters)):
    if epoch % eval_iters == 0 and epoch != 0:
        # Ensure estimate_loss uses the correct device
        losses = estimate_loss(model)
        print(f"Epoch {epoch}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        print(f"The current learning rate: {optimizer.param_groups[0]['lr']:.5f}")
        train_loss_list += [losses['train']]
        validation_loss_list += [losses['val']]

        if losses['val'] < best_val_loss:
            best_val_loss = losses['val']
            torch.save(model.state_dict(), best_model_params_path)

    # Ensure X and y are on the correct device
    X, y = get_batch("train")
    X, y = X.to(device), y.to(device)

    with ctx:
        logits, loss = model(X, y)
        loss = loss / gradient_accumulation_steps
        scaler.scale(loss).backward()

    if ((epoch + 1) % gradient_accumulation_steps == 0) or (epoch + 1 == max_iters):
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad(set_to_none=True)
    scheduler.step()

  2%|▎         | 500/20000 [01:39<1:11:33,  4.54it/s]

Epoch 500: train loss 9.4312, val loss 9.4369
The current learning rate: 0.00007


  4%|▍         | 782/20000 [04:05<1:40:40,  3.18it/s]  


KeyboardInterrupt: 

Plotting the SLM loss function

In [9]:
import matplotlib.pyplot as plt
train_loss_list_converted = [i.cpu().detach() for i in train_loss_list]
validation_loss_list_converted = [i.cpu().detach() for i in validation_loss_list]

plt.plot(train_loss_list_converted, 'g', label='train_loss')
plt.plot(validation_loss_list_converted, 'r', label='validation_loss')
plt.xlabel("Steps - Every 100 epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()



: 

## Step 7: Inference

loop over
1. Index of highest value taken
2. Token ID extracted
3. Decoded back to text
4. Produces next token
5. Append to previous token
6. Pass to model and cont. loop.

In [3]:
#Load the model
model = GPT(config)  # re-create the model with same config
device =  "cuda" if torch.cuda.is_available() else "cpu"
best_model_params_path = "best_model_params.pt"
model.load_state_dict(torch.load(best_model_params_path, map_location=torch.device(device))) # load best model states


  model.load_state_dict(torch.load(best_model_params_path, map_location=torch.device(device))) # load best model states


<All keys matched successfully>

In [4]:
import tiktoken
sentence = "Once upon a time there was a pumpkin."
enc = tiktoken.get_encoding("gpt2")
context = (torch.tensor(enc.encode_ordinary(sentence)).unsqueeze(dim = 0))
y = model.generate(context,200)
print(enc.decode(y.squeeze().tolist()))

Once upon a time there was a pumpkin. It was so big and soft and nowhere to go. Every day the soil would not feel Molly was never like she continued to play. She was always taller in its backyard, so she got to rest. 

One night, kittens decided up to explore and discover what she had done. Whenever Kitty's son said it was time to show her surprises, but Molly said it could have an best. 

So Jane swam and her friends were very proud of themselves.

Anna was so happy that Mia hurt. She had a so much fun time when she said goodbye as delivered them to her family. 

"Yes, I don't think it is a very important lesson. I don't Okay before you want you to help from disgusting animals us.â€

The family smiled and thanked Daisy's mother and hugged her young brother harder excitedly. 

The endill and Abina became best friends and they understanding.Ben and Jill were dreaming to


In [None]:
# from google.colab import runtime
# runtime.unassign()

ModuleNotFoundError: No module named 'google'