In [1]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

from transformers import GPT2TokenizerFast

# Use GPT-2 tokeniser so we don't have to build one from scratch yet.
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
vocab_size = tokenizer.vocab_size


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

-----


This code sets up some basic tools for working with numbers and text in a computer program.

*   **`import math`**: This line brings in a collection of math tools, like how to calculate square roots.
*   **`import torch`**: This brings in a big tool called 'PyTorch,' which is good for making smart computer programs (like AI).
*   **`import torch.nn as nn`**: This part of PyTorch helps build the 'brain' (neural network) for our smart program.
*   **`import torch.nn.functional as F`**: These are more useful functions for the 'brain' of our program.
*   **`from transformers import GPT2TokenizerFast`**: This line gets a special tool from a library called 'transformers.' This tool helps break sentences into words or parts.
*   **`tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")`**: This makes a tool that knows how to break text into pieces just like the famous 'GPT2' program does.
*   **`vocab_size = tokenizer.vocab_size`**: This finds out how many different unique words or pieces of words the 'GPT2' tool understands. It's like counting all the words in its dictionary.

-----

In [2]:
# Model shape
block_size = 128     # max context length (how many tokens the model can "see")
n_embd     = 256     # embedding size (vector size for each token)
n_head     = 8       # attention heads
n_layer    = 4       # number of transformer blocks
dropout    = 0.1

# Training
batch_size = 8
lr         = 3e-4
steps      = 2000

device = "cuda" if torch.cuda.is_available() else "cpu"


---


This part of the code sets up some important numbers that control how our smart computer program (the AI model) will be built and trained.

### Model Shape (How the AI is Built):
*   **`block_size = 128`**: This means the AI can look at 128 pieces of text (tokens) at a time to understand what's happening. It's like its short-term memory length.
*   **`n_embd = 256`**: Each piece of text (token) will be turned into a special code (vector) that is 256 numbers long. This helps the computer understand the meaning of each token.
*   **`n_head = 8`**: Our AI has 8 different 'attention' parts. Each part helps the AI focus on different important words in the text at the same time.
*   **`n_layer = 4`**: The AI has 4 main layers, like levels in a building. Each layer helps process information more deeply.
*   **`dropout = 0.1`**: This is a trick to make the AI smarter. During training, 10% of the connections in the AI's 'brain' are randomly turned off. This prevents the AI from relying too much on any single connection.

### Training (How the AI Learns):
*   **`batch_size = 8`**: The AI will learn by looking at 8 examples (batches) of text at once before updating its knowledge.
*   **`lr = 3e-4`**: This is the 'learning rate.' It's a small number (0.0003) that tells the AI how big of a step to take when adjusting its knowledge during learning.
*   **`steps = 2000`**: The AI will go through the learning process 2000 times.

### Device (Where the AI Runs):
*   **`device = "cuda" if torch.cuda.is_available() else "cpu"`**: This checks if your computer has a special fast chip called a 'GPU' (often called 'cuda' for NVIDIA GPUs). If it does, the AI will use that to learn much faster. If not, it will use the main computer chip ('cpu').

-----

In [3]:
# Read text
with open("/content/xasan.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Convert to token IDs
data = torch.tensor(tokenizer.encode(text), dtype=torch.long)

# Train/val split
n = int(0.9 * len(data))
train_data = data[:n]
val_data   = data[n:]


Token indices sequence length is longer than the specified maximum sequence length for this model (7449 > 1024). Running this sequence through the model will result in indexing errors


----

---

This code is about getting the text ready for our AI model to learn from.

### Reading and Preparing Text:
*   **`with open("/content/xasan.txt", "r", encoding="utf-8") as f: text = f.read()`**: This line opens a file named "xasan.txt" from your computer, reads all the words inside it, and puts them into a variable called `text`.
*   **`data = torch.tensor(tokenizer.encode(text), dtype=torch.long)`**: Here, the AI's special `tokenizer` (the tool we set up earlier) breaks down all the words in `text` into numbers. These numbers are like a secret code that the computer understands. Then, these numbers are stored as `data` in a special format called a `torch.tensor`.

### Splitting Data for Learning and Testing:
*   **`n = int(0.9 * len(data))`**: This calculates a point, `n`, which is 90% of the way through our `data` (the list of numbers).
*   **`train_data = data[:n]`**: The first 90% of the data is put into `train_data`. This is the part the AI will use to learn.
*   **`val_data = data[n:]`**: The last 10% of the data is put into `val_data`. This is the part the AI will use to test itself, to see how well it has learned without ever having seen this part before.

In [4]:
def get_batch(split: str):
    source = train_data if split == "train" else val_data

    # Random starting positions
    ix = torch.randint(0, len(source) - block_size - 1, (batch_size,))

    # Build batch by slicing
    x = torch.stack([source[i : i + block_size] for i in ix])
    y = torch.stack([source[i + 1 : i + 1 + block_size] for i in ix])

    return x.to(device), y.to(device)


### What this function does (simple explanation)

- Picks either the **training** data or the **validation** data based on the input.
- Randomly chooses several starting positions in the data.
- From each position, it takes a short continuous chunk of tokens.
- These chunks become the **input batch**.
- The **target batch** is the same chunks, but shifted one step forward.
- Both inputs and targets are moved to the chosen device (CPU or GPU).
- Returns the input–target pair ready for model training.


---------

In [5]:
class CausalSelfAttention(nn.Module):
    def __init__(self, n_embd, n_head, block_size, dropout):
        super().__init__()
        assert n_embd % n_head == 0

        self.n_head = n_head
        self.head_dim = n_embd // n_head

        # One layer to produce Q, K, V together
        self.qkv = nn.Linear(n_embd, 3 * n_embd)
        self.proj = nn.Linear(n_embd, n_embd)
        self.drop = nn.Dropout(dropout)

        # Causal mask: lower-triangular matrix
        mask = torch.tril(torch.ones(block_size, block_size))
        self.register_buffer("mask", mask)

    def forward(self, x):
        B, T, C = x.shape  # Batch, Time, Channels(=n_embd)

        qkv = self.qkv(x)                       # (B, T, 3C)
        q, k, v = qkv.split(C, dim=2)           # each (B, T, C)

        # reshape into heads: (B, n_head, T, head_dim)
        q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)

        # attention scores: (B, n_head, T, T)
        att = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)

        # apply causal mask (only allow looking backward)
        att = att.masked_fill(self.mask[:T, :T] == 0, float("-inf"))

        # softmax -> probabilities
        att = F.softmax(att, dim=-1)
        att = self.drop(att)

        # weighted sum of values -> (B, n_head, T, head_dim)
        out = att @ v

        # merge heads back: (B, T, C)
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        out = self.drop(self.proj(out))
        return out


---

This code defines a special part of our AI's 'brain' called `CausalSelfAttention`. Think of it like this:

Imagine the AI is reading a sentence. When it tries to understand a word, it doesn't just look at that word; it looks at *all the words before it* to get context. This `CausalSelfAttention` mechanism helps the AI figure out how important each previous word is to the current word it's focusing on.

### How it works (simplified):

**`__init__` (Setting things up):**
*   It gets details like the size of the word codes (`n_embd`), how many 'focus points' the AI has (`n_head`), how many words it can look back at (`block_size`), and a `dropout` value to prevent overthinking.
*   It creates special layers (`qkv`, `proj`, `drop`) that help transform the word codes.
*   It prepares a 'mask' (`self.register_buffer("mask", mask)`) that makes sure the AI *only* looks at words that came before the current word, not future words. This is what 'causal' means – it respects the order of events.

**`forward` (Processing the words):**
*   **Input `x`**: This is the batch of words (as numerical codes) the AI is currently processing.
*   **`q, k, v = qkv.split(C, dim=2)`**: The input words are first transformed into three different versions: 'query' (Q), 'key' (K), and 'value' (V). Think of Q as what you're looking for, K as what's available, and V as the actual information.
*   **`q = q.view(...)`, `k = k.view(...)`, `v = v.view(...)`**: These lines rearrange the Q, K, V versions so that each of the AI's 'focus points' (`n_head`) can work on them independently.
*   **`att = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)`**: This is where the AI calculates 'attention scores'. It compares each 'query' (Q) with all the 'keys' (K) of previous words. The higher the score, the more relevant that previous word is.
*   **`att = att.masked_fill(...)`**: This applies the 'causal mask' to ensure the AI doesn't accidentally look at words that come *after* the current word.
*   **`att = F.softmax(att, dim=-1)`**: The attention scores are turned into probabilities, so they sum up to 1. This means the AI has a clear idea of how much 'attention' to give to each previous word.
*   **`out = att @ v`**: The AI then takes a weighted average of the 'values' (V) of the previous words, using the calculated attention probabilities. This creates a new, enriched representation of the current word, incorporating relevant context.
*   **`out = out.transpose(...)`**: Finally, the results from all the independent 'focus points' are combined back together.
*   **`out = self.drop(self.proj(out))`**: The combined result is passed through a final layer and dropout, producing the output for this attention step.

In essence, this block helps the AI intelligently combine information from earlier parts of the text to better understand the current part, but *only* from the past, not the future.

--------

In [7]:
class FeedForward(nn.Module):
    def __init__(self, n_embd, dropout):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.GELU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)


---

This code defines another important building block for our AI's 'brain' called `FeedForward`.

After the AI has used its 'attention' (from `CausalSelfAttention`) to understand how different words relate to each other, this `FeedForward` part helps it think more deeply about each individual word's meaning.

### How it works (simplified):

**`__init__` (Setting things up):**
*   It takes in the size of the word codes (`n_embd`) and a `dropout` value.
*   It creates a sequence of layers (`self.net`) that will process the word codes:
    *   **`nn.Linear(n_embd, 4 * n_embd)`**: This is like a 'thought expander'. It takes the word's code and makes it much longer (4 times its original size). This gives the AI more room to process complex ideas about the word.
    *   **`nn.GELU()`**: This is a special 'thinking' step that adds non-linearity, allowing the AI to learn more complex patterns than just simple additions and multiplications.
    *   **`nn.Linear(4 * n_embd, n_embd)`**: This is a 'thought compressor'. After expanding and thinking, it brings the word's code back to its original size. It's like distilling the refined meaning of the word.
    *   **`nn.Dropout(dropout)`**: Just like before, this randomly turns off some connections to prevent the AI from overthinking or memorizing specific examples too much.

**`forward` (Processing the words):**
*   When `FeedForward` receives a word's processed code (`x`), it simply passes it through the sequence of layers (`self.net`) it set up. This allows the AI to perform a quick, independent 'analysis' on each word's representation before moving on.

In short, this `FeedForward` block helps the AI enrich the understanding of each word by first expanding its representation, applying a non-linear transformation, and then compressing it back, making the word's meaning more robust and informative.

-------

In [8]:
class Block(nn.Module):
    def __init__(self, n_embd, n_head, block_size, dropout):
        super().__init__()
        self.ln1 = nn.LayerNorm(n_embd)
        self.attn = CausalSelfAttention(n_embd, n_head, block_size, dropout)
        self.ln2 = nn.LayerNorm(n_embd)
        self.ff = FeedForward(n_embd, dropout)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))  # residual connection
        x = x + self.ff(self.ln2(x))    # residual connection
        return x


---

## The `Block`: One Full Thinking Unit

This code defines a **`Block`** — a core building unit in the model.  
If you’re thinking in Transformer terms, this is one complete **reasoning step** in the network.

A `Block` combines:
- **Causal Self-Attention** (context awareness)
- **FeedForward networks** (per-token reasoning)
- **Layer Normalisation** (stability)
- **Residual connections** (information preservation)

Together, these make the model deep *without* making it fragile.

---

## What happens inside the Block

### 1. `__init__` — wiring the machinery

The constructor sets up everything the block needs:

- **Inputs**
  - `n_embd`: embedding size (how rich each token’s representation is)
  - `n_head`: number of attention heads
  - `block_size`: maximum context length
  - `dropout`: regularisation strength

- **Components**
  - `ln1 = nn.LayerNorm(n_embd)`  
    Normalises the input before attention. This keeps activations well-behaved and training stable.
  - `attn = CausalSelfAttention(...)`  
    Lets each token selectively attend to relevant *past* tokens.
  - `ln2 = nn.LayerNorm(n_embd)`  
    Another normalisation pass, this time before deeper processing.
  - `ff = FeedForward(...)`  
    Applies non-linear, per-token transformation — essentially the model “thinking harder” about each word.

---

### 2. `forward` — doing the actual work

This is where the data flows:

- **Attention + residual**
  ```python
  x = x + self.attn(self.ln1(x))


- **FeedForward + residual**

  ```  
  x = x + self.ff(self.ln2(x))  
  
  ```
    Output is normalised again

    Passed through the FeedForward network

    Added back to preserve the signal

------

### Why this design works

A Block:

Stabilises signals with layer normalisation

Understands context with attention

Deepens meaning with FeedForward layers

Preserves information using residual connections

Stack many of these blocks, and you get a model that is deep, expressive, and trainable — exactly what modern language models rely on.

In [9]:
class SmallLanguageModel(nn.Module):
    def __init__(self, vocab_size, block_size, n_embd, n_head, n_layer, dropout):
        super().__init__()
        self.block_size = block_size

        # Token + position embeddings
        self.token_emb = nn.Embedding(vocab_size, n_embd)
        self.pos_emb   = nn.Embedding(block_size, n_embd)

        # Transformer blocks
        self.blocks = nn.Sequential(*[
            Block(n_embd, n_head, block_size, dropout) for _ in range(n_layer)
        ])

        self.ln_f = nn.LayerNorm(n_embd)
        self.head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # positions: 0..T-1
        pos = torch.arange(0, T, device=idx.device)

        x = self.token_emb(idx) + self.pos_emb(pos)  # (B, T, n_embd)
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.head(x)                        # (B, T, vocab_size)

        # If targets given, compute loss
        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, vocab_size),
                targets.view(-1)
            )

        return logits, loss


---

## `SmallLanguageModel`: A Minimal Transformer Language Model

This class defines **`SmallLanguageModel`**, a compact but complete Transformer-based language model. It takes token indices as input and predicts the next token at every position.

Conceptually, it is made of:
- Token and position embeddings
- A stack of Transformer `Block`s
- A final normalisation and projection to vocabulary logits

Despite being “small”, this structure mirrors the core design of modern large language models.

---

## Model structure

### `__init__` — building the model

The constructor assembles all major components:

- **Parameters**
  - `vocab_size`: number of unique tokens in the vocabulary
  - `block_size`: maximum context length
  - `n_embd`: embedding dimension
  - `n_head`: number of attention heads per block
  - `n_layer`: number of Transformer blocks
  - `dropout`: regularisation strength

- **Embeddings**
  - `self.token_emb = nn.Embedding(vocab_size, n_embd)`  
    Converts token IDs into dense vector representations.
  - `self.pos_emb = nn.Embedding(block_size, n_embd)`  
    Encodes positional information so the model knows token order.

- **Transformer stack**
  - `self.blocks = nn.Sequential(...)`  
    A stack of `n_layer` identical `Block`s, each refining the representation using attention and FeedForward layers.

- **Output layers**
  - `self.ln_f = nn.LayerNorm(n_embd)`  
    Final normalisation before prediction.
  - `self.head = nn.Linear(n_embd, vocab_size)`  
    Projects embeddings to vocabulary-sized logits.

---

## Forward pass

### `forward(self, idx, targets=None)`

- **Inputs**
  - `idx`: token indices of shape `(B, T)`
  - `targets` (optional): ground-truth tokens for loss computation

- **Step-by-step flow**

  - Extract batch and sequence length:
    ```python
    B, T = idx.shape
    ```

  - Create positional indices:
    ```python
    pos = torch.arange(0, T, device=idx.device)
    ```

  - Combine token and position embeddings:
    ```python
    x = self.token_emb(idx) + self.pos_emb(pos)
    ```
    This produces a `(B, T, n_embd)` tensor representing both meaning and position.

  - Pass through Transformer blocks:
    ```python
    x = self.blocks(x)
    ```

  - Apply final normalisation:
    ```python
    x = self.ln_f(x)
    ```

  - Project to vocabulary logits:
    ```python
    logits = self.head(x)
    ```
    Shape: `(B, T, vocab_size)`

---

## Loss computation (optional)

If `targets` are provided, the model computes cross-entropy loss:

```python
loss = F.cross_entropy(
    logits.view(-1, vocab_size),
    targets.view(-1)
)

```


## Output

```
return logits, loss

```


#### logits: raw predictions for each token

#### loss: training loss (only during training)

----

## Summary

SmallLanguageModel embeds tokens, injects positional information, refines representations through stacked Transformer blocks, and predicts the next token at every position. It is a minimal yet faithful implementation of a modern autoregressive language model

In [10]:
model = SmallLanguageModel(
    vocab_size=vocab_size,
    block_size=block_size,
    n_embd=n_embd,
    n_head=n_head,
    n_layer=n_layer,
    dropout=dropout
).to(device)

optim = torch.optim.AdamW(model.parameters(), lr=lr)

for step in range(steps):
    model.train()
    x, y = get_batch("train")

    logits, loss = model(x, y)

    optim.zero_grad(set_to_none=True)
    loss.backward()
    optim.step()

    if step % 100 == 0:
        model.eval()
        with torch.no_grad():
            vx, vy = get_batch("val")
            _, vloss = model(vx, vy)
        print(f"step {step} | train loss {loss.item():.4f} | val loss {vloss.item():.4f}")


step 0 | train loss 10.9717 | val loss 10.9417
step 100 | train loss 5.7057 | val loss 7.9696
step 200 | train loss 4.1628 | val loss 7.5968
step 300 | train loss 3.0546 | val loss 8.4096
step 400 | train loss 2.1384 | val loss 9.2352
step 500 | train loss 1.7609 | val loss 8.5423
step 600 | train loss 1.2647 | val loss 9.3160
step 700 | train loss 0.9250 | val loss 9.3784
step 800 | train loss 0.5399 | val loss 9.1878
step 900 | train loss 0.3689 | val loss 9.9592
step 1000 | train loss 0.2546 | val loss 9.5096
step 1100 | train loss 0.2093 | val loss 10.3400
step 1200 | train loss 0.1582 | val loss 10.2176
step 1300 | train loss 0.1195 | val loss 10.1162
step 1400 | train loss 0.1250 | val loss 9.9437
step 1500 | train loss 0.1185 | val loss 10.0561
step 1600 | train loss 0.0991 | val loss 10.6218
step 1700 | train loss 0.0957 | val loss 10.5738
step 1800 | train loss 0.0679 | val loss 9.8559
step 1900 | train loss 0.0937 | val loss 10.7936


---

This code block is where we bring our `SmallLanguageModel` to life and teach it to understand language. It sets up the model, prepares how it will learn, and then runs a training process.

### 1. **`model = SmallLanguageModel(...)`** (Creating the AI):
*   This line creates an actual instance of our `SmallLanguageModel` using all the settings we defined earlier (like `vocab_size`, `block_size`, `n_embd`, `n_head`, `n_layer`, `dropout`).
*   `.to(device)`: This moves the entire model to the selected device, either the super-fast 'cuda' (GPU) if available, or the 'cpu' (main processor).

### 2. **`optim = torch.optim.AdamW(model.parameters(), lr=lr)`** (The Learning Coach):
*   This sets up an 'optimizer' named `AdamW`. Think of this as the coach who helps the AI adjust its 'brain' to learn better.
*   `model.parameters()`: These are all the adjustable parts of the AI's brain that the coach will help tune.
*   `lr=lr`: This is the 'learning rate' (a small number like 0.0003) which tells the coach how big of a step to take when making adjustments.

### 3. **`for step in range(steps):`** (The Training Sessions):
*   This starts the main training loop, which will run for `steps` (e.g., 2000) times.

    *   **`model.train()`**: This puts the model in 'training mode'. Some layers (like `dropout`) behave differently during training.

    *   **`x, y = get_batch("train")`**: We get a batch of training data. `x` is the input (the text the model sees), and `y` is the target (what the model should predict next).

    *   **`logits, loss = model(x, y)`**: The model looks at `x`, tries to predict `y`, and then calculates how wrong it was. This 'how wrong' is called the `loss`.

    *   **`optim.zero_grad(set_to_none=True)`**: Before the coach makes new adjustments, it clears out any old instructions from the previous step.

    *   **`loss.backward()`**: This is where the AI figures out how each part of its 'brain' contributed to the `loss` (how wrong it was). It calculates 'gradients'.

    *   **`optim.step()`**: The coach uses these gradients to update the AI's brain, making it a little better at predicting.

    *   **`if step % 100 == 0:`** (Checking Progress):
        *   Every 100 steps, the model pauses training to check how well it's doing.
        *   **`model.eval()`**: It switches to 'evaluation mode', where `dropout` layers are turned off because we want consistent predictions.

In [11]:
@torch.no_grad()
def generate(model, prompt: str, max_new_tokens=80, temperature=1.0, top_k=50):
    model.eval()

    idx = torch.tensor(tokenizer.encode(prompt), dtype=torch.long, device=device).unsqueeze(0)

    for _ in range(max_new_tokens):
        # Keep only last block_size tokens (context limit)
        idx_cond = idx[:, -block_size:]

        logits, _ = model(idx_cond)
        logits = logits[:, -1, :] / temperature  # last position

        # top-k sampling (optional, helps quality)
        if top_k is not None:
            v, _ = torch.topk(logits, top_k)
            logits[logits < v[:, [-1]]] = -float("inf")

        probs = F.softmax(logits, dim=-1)
        next_id = torch.multinomial(probs, num_samples=1)

        idx = torch.cat([idx, next_id], dim=1)

    return tokenizer.decode(idx[0].tolist())

print(generate(model, "Xasan "))


Xasan  People former Chief of Staff and Benadir regional administration in an effort to avail itself of the occasion to plant as the occasion to plant as the U.S. They also announced that the top Somali would not attend the establishment of the Somali Film Council.


According to Hassan, the Somali federal government delegation is scheduled to meet with its U.S. partners to discuss further ways to promote


---

This code defines the **`generate` function**, which is responsible for taking a starting prompt and having our trained `SmallLanguageModel` creatively write new text based on it.

Think of it like giving the AI a sentence beginning and asking it to complete the story.

### How it works (simplified):

**`@torch.no_grad()`**: This is a special instruction that tells PyTorch we don't need to calculate gradients during this process. Since we're not training, this saves memory and makes text generation faster.

**`def generate(model, prompt: str, max_new_tokens=80, temperature=1.0, top_k=50):`**
*   **`model`**: This is our trained `SmallLanguageModel`.
*   **`prompt`**: This is the starting text (e.g., "Xasan ") that we give the model.
*   **`max_new_tokens`**: The maximum number of new words or parts of words the model should generate (default is 80).
*   **`temperature`**: This controls how 'creative' or 'random' the generated text is. A lower temperature (e.g., 0.5) makes the output more focused and predictable, while a higher temperature (e.g., 1.5) makes it more diverse and surprising.
*   **`top_k`**: This limits the model's choices for the next word to only the `k` most probable options. This often helps produce more coherent and relevant text.

**Step-by-step generation process:**

1.  **`model.eval()`**: Puts the model in 'evaluation mode', so it behaves consistently without things like dropout.

2.  **`idx = torch.tensor(tokenizer.encode(prompt), ...)`**: The initial `prompt` text is converted into numbers (tokens) that the model understands. It's also prepared as a PyTorch tensor and sent to the correct device (CPU or GPU).

3.  **`for _ in range(max_new_tokens):`**: This loop runs for each new token we want to generate.

    *   **`idx_cond = idx[:, -block_size:]`**: The model can only look at a certain number of previous tokens (`block_size`) for context. This line makes sure we only feed the most recent relevant tokens into the model.

    *   **`logits, _ = model(idx_cond)`**: The model takes the current context (`idx_cond`) and predicts what the *next* word should be. `logits` are the raw scores for each possible word in its vocabulary.

    *   **`logits = logits[:, -1, :] / temperature`**: We take the predictions for the *very last* word in the context and apply the `temperature` to them. This influences the randomness of the next word selection.

    *   **`if top_k is not None: ...`**: If `top_k` is set, it filters the `logits` so that only the top `k` most likely words have a chance of being picked. All other words get a score of negative infinity, meaning they will never be chosen.

    *   **`probs = F.softmax(logits, dim=-1)`**: The `logits` (raw scores) are turned into `probs` (probabilities), where all probabilities add up to 1.

    *   **`next_id = torch.multinomial(probs, num_samples=1)`**: Based on these probabilities, the model randomly picks *one* word (token ID) to be the next word. More probable words are more likely to be chosen.

    *   **`idx = torch.cat([idx, next_id], dim=1)`**: The newly generated word's ID is added to our sequence of tokens, extending the text.

4.  **`return tokenizer.decode(idx[0].tolist())`**: After generating all the new tokens, the entire sequence of token IDs is converted back into human-readable text and returned.

In essence, the `generate` function allows the trained language model to act like a text continuation engine, creatively extending a given prompt into a longer passage of text.
