# Lab 5 - our own Large (or not that large) Language Model (LLM)

In this lab, we will build our own Large Language Model (LLM) from scratch … or, to be precise, a highly simplified version that will allow us to understand the basics without requiring enormous computational resources or massive datasets.

We will of course use the transformer architecture, which we introduced in the previous class. But first, let’s go through some important concepts.


What is a Language Model (LM)?

A language model is a probabilistic model that operates on sequences of words or characters and predicts the next word or character in a sequence based on the previous ones. Language models are used in many natural language processing (NLP) applications, such as machine translation, speech recognition, text generation, and many others.

In recent years, transformer-based language models have achieved remarkable results across a wide range of NLP tasks. These models are trained on massive text datasets and are capable of generating and understanding long and coherent texts.

In this lab, however, we will focus on two main types of transformer-based language models:
1. **Autoregressive Model**: This type of model generates text sequentially, predicting the next word based on the previous ones. An example is GPT (Generative Pre-trained Transformer), developed primarily by OpenAI. Autoregressive models are often used for tasks such as text generation, machine translation, and text completion.
2. **Masked Language Model**: This type of model is trained to predict missing words in a sentence, where some words are “masked” (hidden). A well-known example is BERT (Bidirectional Encoder Representations from Transformers), developed by Google. Masked models are often used for tasks such as text classification, sentiment analysis, and question answering.


How do these two models differ?

The simplest difference lies in their architecture. Autoregressive models typically consist only of decoders, while masked models consist only of encoders. Of course, there are also hybrid approaches, such as T5 (Text-to-Text Transfer Transformer) and BART (Bidirectional and Auto-Regressive Transformers), which combine features of both. However, for the purpose of this lab, we will focus on the pure implementations.

In autoregressive models (e.g. GPT), the decoder is trained to predict the next word in a sequence based on all previous words. To achieve this, masking is applied to prevent the decoder from accessing future words during training (causal masking, which we discussed in the previous class). During text generation, the model produces words one by one, using its previously generated output as context.

In masked language models (e.g. BERT), the encoder is trained to predict masked words in a sentence by using context from both the left and the right. During training, some words are randomly masked, and the model learns to predict them based on the remaining words. Masked models are generally used for tasks that require understanding the full context of a sentence or document.

<img src=https://raw.githubusercontent.com/vision-agh/DNN-Course-media/refs/heads/main/lab4_5_transformer/figures/gpt_vs_bert.png width=600>

Although, based on what we have learned so far, we could implement both types of models, in this lab we will focus on the autoregressive model. It is more intuitive, easier to understand for beginners, and more interesting in the context of text generation.

For those interested, I encourage you to explore and experiment with masked models such as BERT to broaden your knowledge of different language model architectures. For example, you can follow this tutorial: [link](https://medium.com/@adnanmasood/a-tiny-bert-style-model-from-scratch-a-detailed-exploration-5bc47d59bff5).

Finally, the code below is partly adapted from this tutorial: [link](https://www.youtube.com/watch?v=kCc8FmEb1nY) and original paper [Attention is All You Need](https://arxiv.org/abs/1706.03762).


## Import and Setup

As usual, we start by importing the necessary libraries and setting up the device.

In [6]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import os, glob

from typing import Optional

import wandb

wandb.login()  # Log in to your W&B account

torch.manual_seed(42)  # For reproducibility

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

Using device: cpu


Since we are going to implement an autoregressive model, we will need to initialize and use the following components:

1. `PositionalEncoding` class – for adding positional information to the input embeddings.
2. `make_causal_mask` function - for creating a causal mask that prevents the model from attending to future tokens during training.
3. `Head`, `MultiHeadAttention` and `DecoderBlock` classes - for constructing the architecture.

If we prepared part1 correctly, we can simply import these components directly from part1.ipynb.

In [7]:
# --- Positional Encoding ---
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        pos = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(pos * div)
        pe[:, 1::2] = torch.cos(pos * div)
        self.register_buffer("pe", pe.unsqueeze(0))

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

In [8]:
# --- Single Attention Head ---
class Head(nn.Module):
    def __init__(self, emb_size, head_size, dropout=0.0, bias=False):
        super().__init__()
        self.key   = nn.Linear(emb_size, head_size, bias=bias)
        self.query = nn.Linear(emb_size, head_size, bias=bias)
        self.value = nn.Linear(emb_size, head_size, bias=bias)
        self.scale = math.sqrt(head_size)  
        self.dropout = nn.Dropout(dropout)
        self.attn_weights = None          

    def forward(self, q, k, v, mask=None):
        Q = self.query(q)  # [batch, seq_len, head_size]
        K = self.key(k)    # [batch, seq_len, head_size]
        V = self.value(v)  # [batch, seq_len, head_size]
        return self.scaled_dot_product_attention(Q, K, V, mask)
    
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale 

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn = F.softmax(scores, dim=-1)
        attn = self.dropout(attn)  
        self.attn_weights = attn.detach()
        out = torch.matmul(attn, V)  
        return out

In [9]:
# --- Multi-Head Attention ---
class MultiHeadAttention(nn.Module):
    def __init__(self, emb_size, num_heads, dropout=0.0, bias=False):
        super().__init__()
        assert emb_size % num_heads == 0
        head_size = emb_size // num_heads
        self.heads = nn.ModuleList([
            Head(emb_size, head_size, dropout, bias) for _ in range(num_heads)
        ])
        self.linear = nn.Linear(head_size*num_heads, emb_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, q, k, v, mask=None):
        out = [head(q, k, v, mask) for head in self.heads]
        out = torch.cat(out, dim=-1)
        out = self.dropout(self.linear(out))
        return self.dropout(out)

    def get_attention_maps(self):
        return [h.attn_weights for h in self.heads if h.attn_weights is not None]

In [10]:
# --- Decoder Block ---
class FeedForwardBLock(nn.Module):
    def __init__(self, emb_size, expansion, dropout):
        super().__init__()
        self.linear1 = nn.Linear(emb_size, emb_size*expansion)
        self.gelu = nn.GELU()
        self.linear2 = nn.Linear(emb_size*expansion, emb_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = self.linear1(x)
        x = self.gelu(x)
        x = self.linear2(x)
        x = self.dropout(x)

        return x
    
class DecoderBlock(nn.Module):
    def __init__(self, emb_size, num_heads, dropout=0.0, expansion=4, use_cross_attn=False):
        super().__init__()
        self.cross_attn = None

        self.self_attn = MultiHeadAttention(emb_size, num_heads, dropout)
        self.norm1 = nn.LayerNorm(emb_size)

        if use_cross_attn:
            self.cross_attn = MultiHeadAttention(emb_size, num_heads, dropout)
            self.norm2 = nn.LayerNorm(emb_size)

        self.norm3 = nn.LayerNorm(emb_size)
        self.ff = FeedForwardBLock(emb_size, expansion, dropout)

        self.dropout = nn.Dropout(dropout)

    def forward(self,
                x: torch.Tensor,
                enc_out: Optional[torch.Tensor] = None,
                tgt_mask: Optional[torch.Tensor] = None):
        if self.cross_attn is None and enc_out is not None:
            raise ValueError("Cross-attention is not enabled in this DecoderBlock, but enc_out was provided.")

        self_attn_out = self.self_attn(x, x, x, tgt_mask)   
        x = x + self.dropout(self_attn_out)                  
        x = self.norm1(x)

        if enc_out is not None and self.cross_attn is not None:
            cross_attn_out = self.cross_attn(x, enc_out, enc_out)
            x = x + self.dropout(cross_attn_out)
            x = self.norm2(x)


        ff_out = self.ff(x)
        x = x + self.dropout(ff_out)
        x = self.norm3(x)

        return x

## Now let's implement our own autoregressive Mini-GPT model

We will now implement our own autoregressive Mini-GPT model using the components we have prepared earlier.

Compared to our previous Transformer implementation, this model introduces one additional parameter — `block_size`, which defines the maximum sequence length that the model can handle. This parameter is crucial for autoregressive models because it determines how many previous tokens the model can consider when predicting the next one.

If we allowed the model to use all previous tokens without any limit, we could quickly run into memory issues, especially for long sequences. The `block_size` acts as a sliding context window to keep computation and memory manageable.

#### Text generation with generate()

We will also implement a generate() method for text generation. This method will allow the model to produce text autoregressively — one token at a time — starting from an initial input sequence.

In addition, we will use a `temperature` parameter to control the randomness of the predictions.
- A higher `temperature` results in more random and creative outputs.
- A lower `temperature` makes the model’s predictions more focused and deterministic.

> How does it work?
> Once we obtain the model’s output logits, we can convert them into probabilities using the softmax function. The temperature is applied by dividing the logits by the temperature value before softmax.

After obtaining the probability distribution, we sample from it to select the next token.

For example, if the logits for the next token are [2.0, 1.0, 0.1] and the temperature is set to 0.5, the scaled logits become [4.0, 2.0, 0.2].
Applying softmax to these scaled values results in a probability distribution that is more concentrated on the highest value — making it more likely that the token corresponding to the largest logit will be selected.

- `__init_`:
    - Initializes the parameter `block_size`.
    - Initializes token and positional embeddings.
    - Creates a list of decoder blocks (note that `use_cross_attn=False`, since we use only the decoder architecture).
    - Adds a final linear layer to project the decoder output to the vocabulary size.

- `forward`:
    Similar to the previous Transformer implementation:
    1. The input sequence `src` is passed through the embedding layer, positional encoding, and then through the decoder blocks. Since this is a decoder-only model, we use a `causal mask` and set `enc_out=None`.
    2. The output is projected to the vocabulary dimension using the final linear layer.
    3. We compute the cross-entropy loss inside the model for convenience.
        - If `targets` is None, the loss is set to None.
        - Otherwise, the predicted logits and target tokens are reshaped to match the expected format for `F.cross_entropy`:
            - `logits` -> shape `(batch_size * seq_len, vocab_size)`
            - `targets` -> shape `(batch_size * seq_len)`
    4. The forward() method returns both the logits and the loss.

- `generate`:
    This method generates text autoregressively.
    1. It takes an initial input sequence `src` and generates `max_new_tokens` additional tokens.
    2. For each iteration:
        - Slice the input to ensure it does not exceed `block_size` (src[:, -self.block_size:] works even when shorter).
        - Create a causal mask for the current sequence length.
        - Pass the current input and mask to the model to obtain logits.
        - Extract the logits for the last token in the sequence (`logits[:, -1, :]`) and divide them by the temperature to adjust randomness.
        - If `top_k` is specified, apply top-k filtering to keep only the top `k` most likely tokens, setting the rest to negative infinity:

            ```python
            v, _ = torch.topk(logits, top_k)
            logits[logits < v[:, [-1]]] = -float('Inf')
            ```
        - Apply softmax to obtain a probability distribution over the vocabulary.
        - Sample the next token from this distribution using `torch.multinomial`.
        - Concatenate the sampled token to `src` for the next iteration.
    3. Finally, the method returns the extended sequence containing both the original and generated tokens.    

In [11]:
# --- MiniGPT model ---
class MiniGPT(nn.Module):
    def __init__(self, voc_size, emb_size=64, num_heads=2, num_layers=1, block_size=32, use_cross_attn=False):
        super().__init__()
        self.block_size = block_size
        self.token_emb = nn.Embedding(voc_size, emb_size)
        self.pos_emb   = nn.Embedding(block_size, emb_size)

        self.blocks = nn.ModuleList([
            DecoderBlock(emb_size, num_heads, 0.1, use_cross_attn=use_cross_attn)
            for _ in range(num_layers)
        ])
        self.fc_out = nn.Linear(emb_size, voc_size)

    def forward(self, src, targets=None, tgt_mask=None):
        x = self.token_emb(src) + self.pos_emb(torch.arange(src.size(1), device=src.device))
        for block in self.blocks:
            x = block(x, tgt_mask=tgt_mask)
        logits = self.fc_out(x)

        if targets is None:
            return logits, None

        B, T, C = logits.shape
        logits = logits.view(B*T, C)
        targets = targets.view(B*T)  # <- poprawka
        loss = F.cross_entropy(logits, targets)
        return logits, loss


    def generate(self, src, max_new_tokens, temperature=1.0, top_k=None):
        for _ in range(max_new_tokens):
            # Crop src do ostatnich block_size tokenów
            src_cut = src[:, -self.block_size:]
            B, T = src_cut.shape
            pos_idx = torch.arange(T, device=src.device).unsqueeze(0).expand(B, T)
            mask = torch.tril(torch.ones(T, T, device=src.device)).unsqueeze(0)  # (1, T, T)

            logits, _ = self.forward(src_cut, targets=None, tgt_mask=mask)
            logits = logits[:, -1, :] / temperature  # ostatni token

            # top_k filtr
            if top_k is not None:
                topk_vals, _ = torch.topk(logits, top_k, dim=-1)
                mask_topk = logits < topk_vals[:, [-1]]
                logits[mask_topk] = -float('Inf')

            probs = torch.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            src = torch.cat([src, next_token], dim=1)

        return src

Now that we have our model implemented, we can proceed to prepare the dataset and train it on some text data. We could use any publicly available text dataset already prepared for language modeling, but that would be too easy. Instead, we’ll create our own small dataset based on the poems of Adam Mickiewicz, one of the greatest Polish poets.

Although the dataset will be relatively small, it should be sufficient for our educational purposes. It will consist of the following works:

1. "Pan Tadeusz" 
2. "Dziady" part II
3. "Dziady" part III
3. "Dziady" part IV

All these poems are available in PDF format from [Wolne Lektury](https://wolnelektury.pl/katalog/autor/adam-mickiewicz/). To read the PDFs and extract the text, we will use the PyMuPDF library (you need to install it if you do). To remove unnecessary content such as prefaces, introductions, and appendices, we will skip the first and last pages of each PDF document.

Since the texts are in Polish, we must also ensure that our tokenizer correctly handles Polish characters. In some PDFs, Polish letters may be read incorrectly due to encoding issues, so we may need to remap them from UTF-8 to Latin-1. Additionally, we will remove page numbers and any other extraneous elements.

To generate the dataset, we will extract the text from the PDFs, clean it by removing unwanted characters and lines, and concatenate all the content into a single string. This text will then serve as the input for creating a custom dataset to train our Mini-GPT model.

Of course, you could use different text data in any language, but I encourage you to try this approach with other works as well. Just make sure the text is in the public domain or that you have the right to use it for your experiments.

In [12]:
import fitz

def read_pdf_fixed(path):
    doc = fitz.open(path)
    text = ""
    for page in doc[1:-1]:                  # skip first and last page
        page_text = page.get_text("text")
        if page_text:
            text += page_text + "\n"

    # mapping of bad characters to good ones
    mapping = {
        "Ê": "Ę", "ê": "ę",
        "Œ": "Ś", "œ": "ś",
        "³": "ł", "£": "Ł",
        "¯": "Ż", "¿": "ż",
        "Ÿ": "Ź", "ÿ": "ź",
        "ñ": "ń", "Ñ": "Ń",
        "¹": "ą", "¡": "Ą",
        "æ": "ć", "Æ": "Ć",
        "¢": "ó",  
    }

    # replace bad characters
    for bad, good in mapping.items():
        text = text.replace(bad, good)

    lines = text.splitlines()
    cleaned = []
    for line in lines:
        if line.strip().isdigit():          # remove page numbers
            continue
        cleaned.append(line)                # append cleaned line
    
    return "\n".join(cleaned)               # return cleaned text


# --- Read and clean text from multiple PDF files ---
text = ""
for file in ["data/tadeusz.pdf", "data/dziady2.pdf", "data/dziady3.pdf", "data/dziady4.pdf"]: 
    text += read_pdf_fixed(file) + "\n"


# --- Print a snippet of the cleaned text and its length ---

print(text[:1000])
print("-" * 20)
print("Size of the text:", len(text))

KSIĘGA PIERWSZA
Gospodarstwo
Powrót panicza — Spotkanie się pierwsze w pokoiku, drugie u stołu — Ważna
Sędziego nauka o grzeczności — Podkomorzego uwagi polityczne nad modami
— Początek sporu o Kusego i Sokoła — Żale Wojskiego — Ostatni Woźny
Trybunału — Rzut oka na ówczesny stan polityczny Litwy i Europy
Litwo! Ojczyzno moja! ty jesteś jak zdrowie:
Ile cię trzeba cenić, ten tylko się dowie,
Kto cię stracił. Dziś piękność twą w całej ozdobie
Widzę i opisuję, bo tęsknię po tobie.
Panno święta, co Jasnej bronisz Częstochowy
I w Ostrej świecisz Bramie! Ty, co gród zamkowy
Nowogródzki ochraniasz z jego wiernym ludem!
Jak mnie dziecko do zdrowia powróciłaś cudem
(Gdy od płaczącej matki, pod Twoją opiekę
Ofiarowany, martwą podniosłem powiekę;
I zaraz mogłem pieszo, do Twych świątyń progu
Iść za wrócone życie podziękować Bogu),
Tak nas powrócisz cudem na Ojczyzny łono.
Tymczasem przenoś moją duszę utęsknioną
Do tych pagórków leśnych, do tych łąk zielonych,
Szeroko nad błękitnym Niemnem rozcią

### Hyperparameters and Data Preparation

In this part, we define the main hyperparameters and prepare the dataset for training our Mini-GPT model.

First, we set parameters such as the batch size, embedding dimension, number of attention heads and layers, and the `block_size`, which defines how many previous tokens the model can see when predicting the next one.

Next, we create functions to **encode** text into integers and **decode** integers back into text.  
We also build our vocabulary by finding all unique characters in the dataset.

Then, we split the data into **training** and **validation** sets (90% / 10%).

Finally, we define the `get_batch()` function, which randomly selects short sequences of tokens from the dataset.  
For each batch:
- `src` is the input sequence,  
- `tgt` is the same sequence shifted one step ahead, so the model learns to predict the next token.


In [13]:
# --- Hyperparameters ---

batch_size = 16         # how many independent sequences will we process in parallel?
emb_size = 128          # embedding dimension for each token ()
num_heads = 8           # number of attention heads in each decoder block
num_layers = 6          # number of decoder blocks
block_size = 64         # maximum context length for predictions

learning_rate = 1e-3    # learning rate
max_iters = 5000        # number of training iterations
eval_interval = 50      # interval for evaluation
eval_iters = 200        # number of iterations for evaluation

# --- Prepare the encode/decode functions ---

uniq_chars = sorted(list(set(text)))            # get all unique characters that occur in this text
vocab_size = len(uniq_chars)                    # the size of the vocabulary

char2idx = { ch:i for i,ch in enumerate(uniq_chars) }       # mapping from characters to integers
idx2char = { i:ch for i,ch in enumerate(uniq_chars) }       # mapping from integers to characters

encode = lambda seq: [char2idx[c] for c in seq]             # encoder: take a string, output a list of integers
decode = lambda seq: ''.join([idx2char[i] for i in seq])    # decoder: take a list of integers, output a string

# --- Prepare the dataset ---

data = torch.tensor(encode(text), dtype=torch.long)         # encode the entire text dataset and store it in a torch.Tensor
split = int(0.9*len(data)) 
train_data, val_data = data[:split], data[split:]           # split the data into train and validation sets

# --- Function to get a batch of data ---
def get_batch(split):
    data = train_data if split == 'train' else val_data             # choose the dataset
    ix = torch.randint(0, len(data) - block_size, (batch_size,))    # random starting indices for the batch (from 0 to len(data) - block_size)
    src = data[ix.unsqueeze(1) + torch.arange(block_size)]          # get the sequences of length block_size for each index in the batch
    tgt = data[ix.unsqueeze(1) + torch.arange(1, block_size + 1)]   # targets are the same as src but shifted by one position
    return src, tgt

src, tgt = get_batch('train')

print(src[0])
print(tgt[0])

print("-" * 20)

print(decode(src[0].tolist()))
print(decode(tgt[0].tolist()))

tensor([ 1, 66, 52, 47, 69, 52, 62, 69,  1, 58, 61, 87, 68,  1, 69, 87, 58, 63,
        48,  6,  1, 62, 61, 48, 45, 61, 57, 48, 16,  1, 50, 47, 68,  1, 63, 61,
        85, 45, 44, 46, 69, 48,  0, 32, 58, 45, 64, 47, 54, 85,  1, 27, 58, 90,
        46, 52, 64, 62, 69, 54, 58, 66, 62, 54])
tensor([66, 52, 47, 69, 52, 62, 69,  1, 58, 61, 87, 68,  1, 69, 87, 58, 63, 48,
         6,  1, 62, 61, 48, 45, 61, 57, 48, 16,  1, 50, 47, 68,  1, 63, 61, 85,
        45, 44, 46, 69, 48,  0, 32, 58, 45, 64, 47, 54, 85,  1, 27, 58, 90, 46,
        52, 64, 62, 69, 54, 58, 66, 62, 54, 81])
--------------------
 widzisz orły złote, srebrne? gdy trębacze
Pobudkę Kościuszkowsk
widzisz orły złote, srebrne? gdy trębacze
Pobudkę Kościuszkowską


### Model Initialization and Training Setup

Here we create an instance of our **Mini-GPT** model and prepare the optimizer and learning rate scheduler.

- The model is initialized with the vocabulary size and hyperparameters defined earlier, then moved to the selected device (CPU or GPU).
- We use the **AdamW** optimizer, which is well-suited for training transformer-based models due to its weight decay regularization.
- A **Cosine Annealing** learning rate scheduler gradually decreases the learning rate over time (`T_max = max_iters`), helping the model converge more smoothly during training.


In [14]:
model = MiniGPT(vocab_size, emb_size, num_heads, num_layers, block_size).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=max_iters)

In [15]:

def make_causal_mask(seq_len):
    return torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0)


class Trainer():
    def __init__(self, 
                 model,
                 optimizer,
                 scheduler,
                 get_batch,
                 make_causal_mask,
                 device,
                 max_iters=5000,
                 eval_interval=50,
                 eval_batches=10,          # number of batches to average during evaluation
                 name="Patryk Lorenc"):
        
        self.model = model
        self.optimizer = optimizer
        self.scheduler = scheduler
        self.get_batch = get_batch
        self.make_causal_mask = make_causal_mask
        self.device = device
        self.max_iters = max_iters
        self.eval_interval = eval_interval
        self.eval_batches = eval_batches
        self.name = name
        self.best_val_loss = float('inf')

    def training_step(self):
        """Perform one training step on a randomly sampled batch."""
        self.model.train()
        src, target = self.get_batch('train')
        mask = self.make_causal_mask(src.size(1))

        src, target, mask = src.to(self.device), target.to(self.device), mask.to(self.device)

        logits, loss = self.model(src, target, mask)

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        self.scheduler.step()

        return loss.item()

    def evaluate(self):
        """Evaluate model performance on multiple validation batches."""
        self.model.eval()
        total_loss = 0.0

        with torch.no_grad():
            for _ in range(self.eval_batches):
                src, target = self.get_batch('val')
                mask = self.make_causal_mask(src.size(1))
                src, target, mask = src.to(self.device), target.to(self.device), mask.to(self.device)

                _, val_loss = self.model(src, target, mask)
                total_loss += val_loss.item()

        avg_loss = total_loss / self.eval_batches
        return avg_loss

    def train(self):
        """Main training loop with Weights & Biases logging and model checkpointing."""
        wandb.init(
            project="lab5-mini-gpt",
            entity="deep-neural-network-course",
            group="mini-gpt",
            name=self.name,
            settings=wandb.Settings(save_code=False)
        )
        wandb.watch(self.model, log="all")

        for step in range(self.max_iters):
            train_loss = self.training_step()
            wandb.log({"train/loss": train_loss, "lr": self.scheduler.get_last_lr()[0]}, step=step)

            if step % self.eval_interval == 0:
                val_loss = self.evaluate()
                wandb.log({"val/loss": val_loss}, step=step)
                print(f"Step [{step}/{self.max_iters}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")

                # Save best model
                if val_loss < self.best_val_loss:
                    self.best_val_loss = val_loss
                    torch.save(self.model.state_dict(), "best_mini_gpt.pth")
                    print(f"New best model saved (Val Loss: {self.best_val_loss:.4f})")
        
        wandb.unwatch(model)
        wandb.finish()

trainer = Trainer(model, optimizer, scheduler, get_batch, make_causal_mask, device, max_iters, eval_interval)

In [11]:
trainer = Trainer(
    model=model,
    optimizer=optimizer,
    scheduler=scheduler,
    get_batch=get_batch,
    make_causal_mask=make_causal_mask,
    device=device,
    max_iters=max_iters,
    eval_interval=eval_interval,
    name="Patryk Lorenc"
)

trainer.train()

Step [0/5000], Train Loss: 4.7450, Val Loss: 4.1695
New best model saved (Val Loss: 4.1695)
Step [50/5000], Train Loss: 2.7319, Val Loss: 2.6906
New best model saved (Val Loss: 2.6906)
Step [100/5000], Train Loss: 2.7320, Val Loss: 2.6457
New best model saved (Val Loss: 2.6457)
Step [150/5000], Train Loss: 2.5622, Val Loss: 2.6111
New best model saved (Val Loss: 2.6111)
Step [200/5000], Train Loss: 2.6070, Val Loss: 2.6057
New best model saved (Val Loss: 2.6057)
Step [250/5000], Train Loss: 2.5215, Val Loss: 2.5652
New best model saved (Val Loss: 2.5652)
Step [300/5000], Train Loss: 2.7155, Val Loss: 2.5709
Step [350/5000], Train Loss: 2.5033, Val Loss: 2.5393
New best model saved (Val Loss: 2.5393)
Step [400/5000], Train Loss: 2.4787, Val Loss: 2.4739
New best model saved (Val Loss: 2.4739)
Step [450/5000], Train Loss: 2.5215, Val Loss: 2.4841
Step [500/5000], Train Loss: 2.4420, Val Loss: 2.4158
New best model saved (Val Loss: 2.4158)
Step [550/5000], Train Loss: 2.4348, Val Loss: 2.

0,1
lr,████████▇▇▇▇▇▇▇▇▆▆▆▆▆▆▅▅▅▄▃▃▂▂▂▂▂▂▂▂▁▁▁▁
train/loss,██▇▇▅▅▄▅▄▅▄▄▄▄▃▃▃▄▃▃▃▃▁▃▂▂▃▂▂▂▂▂▂▃▂▂▃▂▂▂
val/loss,█▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
lr,0.0
train/loss,1.75381
val/loss,1.84115


In [18]:
# generate from the model
# model.load_state_dict(torch.load("best_mini_gpt.pth"))
model.load_state_dict(torch.load("best_mini_gpt.pth", map_location=torch.device('cpu')))

context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated = model.generate(context, max_new_tokens=1000, temperature=0.5, top_k=2)[0].tolist()
print(decode(generated))


W koniec podobnie siedzie z niebie,
Potem szeli się w polski zabardzieli się z nie pod drogiej wyradzie,
Podkomorzył się z przed do siedział i pod się z polskiej wiedzie,
A nie pod nie stoła szaliwy się przyjecie,
W sobie w dobrze po starony wyrządził w szlachtym w polskiem.
Przeciwszy się z pomierzył w do szlachty,
A w tam nie podamienie, jak za na drugiem się z pod na skrzydła,
A który się z pod starzenia zamku wystrzymał i podawie,
I sztuki się z na podarze i nim się nie podobny,
A wyszał się w police w przed nie wielkiej widoście,
W powiedział się w kolec wieszczone wieściej w pod się wiedzieć,
Pod podawienia w powodał się w polskiej stary.
Widział szczęśli z nim za przystał się z przeciesia,
A to przecież na przyjacie się pod nie pod drogiem,
A w polskiego w krzykładnie w szary się nie przeczą.
Wyszczak wystać w drogi się w drugi strzelach,
W kończnie szlachta z nimię w podkomowił,
I w koło się stary się w dzieciał się po szlachcia,
A tak pod pod stronie pod na pod strzelach się 

In [19]:
input_text = "Litwo! Ojczyzno moja!"
context = torch.tensor(encode(input_text), dtype=torch.long, device=device).unsqueeze(0)
generated = model.generate(context, max_new_tokens=500, temperature=0.7, top_k=10)[0].tolist()
print(decode(generated))

Litwo! Ojczyzno moja!
Jedyli szpodzie zając w teraki do ma słowo
Już nie podli wyskocie, przebany wszystkie do wiem,
Jeśli przed szlicy zacze obrocinie,
Odgród z tyłowa otwarzego w zgodziem drogiem
Jak odgrębić dotko się w przostawie, i z tyle ostatnie
Wzneterze szarska przy walce kochania i skoroszy
W nim wierzach znasz i szalał się szczernie,
To na całe polskach, przed grzeby do woli osłące

Jakim niemi polisz i pełnie skroni, a nie w nie zabajem
I strona wszystkiem wystał na przeciesiał pierany,
I stojąc w to swo


In [20]:
print("Model architecture:")
print(model)

print("Number of parameters:", sum(p.numel() for p in model.parameters() if p.requires_grad) / 1_000_000, "M")

Model architecture:
MiniGPT(
  (token_emb): Embedding(100, 128)
  (pos_emb): Embedding(64, 128)
  (blocks): ModuleList(
    (0-5): 6 x DecoderBlock(
      (self_attn): MultiHeadAttention(
        (heads): ModuleList(
          (0-7): 8 x Head(
            (key): Linear(in_features=128, out_features=16, bias=False)
            (query): Linear(in_features=128, out_features=16, bias=False)
            (value): Linear(in_features=128, out_features=16, bias=False)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (linear): Linear(in_features=128, out_features=128, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (norm3): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (ff): FeedForwardBLock(
        (linear1): Linear(in_features=128, out_features=512, bias=True)
        (gelu): GELU(approximate='none')
        (linear2): Linear(in_features=512, out_featu

As we can see, we now have our own Mini-GPT model capable of generating fairly coherent text in Polish, despite being trained on a small dataset. The model itself is quite compact — with only about **1.2 million parameters**, it can be easily trained on a single GPU.

For comparison, the original GPT-3 model contains around **175 billion parameters**, while GPT-4 is estimated to have about **1.8 trillion parameters** (although during inference it uses only a subset of them thanks to the Mixture of Experts architecture). This highlights just how much smaller and lighter our implementation is.

One potential improvement would be to cache the results of the attention mechanism during text generation. This would allow the model to avoid recomputing attention outputs for all previously generated tokens at each step — a significant optimization for longer sequences. However, for the sake of simplicity, we did not implement this feature here.