

## Token-by-Token Generation in LLMs

When you input "how about" to an LLM during inference, here's what actually happens:

1. First, your input is tokenized. Depending on the tokenizer, "how about" might be split into separate tokens like ["how", "about"] or possibly even ["how", " about"] (note the space).

2. The model processes these tokens sequentially to generate the next token:
   - It first processes "how"
   - Then considers both "how" and "about" together

3. The model doesn't generate output for "how" and then separately for "about". Instead, it uses the full context ["how", "about"] to predict the next token.

## The Output Mechanism

For each prediction step, the final layer of the LLM maps from the embedding space to vocabulary space:

- The output shape is indeed [batch, seq, vocabulary_size]
- For the last position in the sequence, the model applies softmax to get a probability distribution over the entire vocabulary
- The token with the highest probability (or sampled according to some strategy like temperature sampling) becomes the next token in the sequence

## Concrete Example

Let's trace through what happens when you input "how about":

1. Input: "how about"
2. Tokenized: ["how", "about"]
3. Model processes both tokens
4. For position after "about", the model outputs a probability distribution over the entire vocabulary
5. Let's say "you" has the highest probability
6. "you" is generated and added to the context
7. New context: ["how", "about", "you"]
8. Model then uses this full context to predict the next token

The key insight is that LLMs don't generate a separate response for each token in isolation - they use the full accumulated context to predict the next token.

## About the Final Layer

You're correct about the final layer dimensions:
- Input to the final layer: [batch_size, sequence_length, embedding_dim]
- Output from the final layer: [batch_size, sequence_length, vocabulary_size]

For each position in the sequence, the model outputs a probability distribution over the entire vocabulary. But during inference, we typically only care about the prediction for the last position in the current sequence, as that's the next token we'll generate.


Let me clarify this point:

- During training, the model calculates losses across all positions in the sequence because we're teaching it to predict each token given the previous ones.

- During inference (text generation):
  1. We process the entire sequence through the model
  2. We only need the probability distribution for the last position
  3. We apply softmax only to that final position's logits
  4. We select the next token from this distribution
  5. We append this token to the sequence
  6. Repeat the process with the extended sequence

This is more computationally efficient than computing softmax for all positions when we only need the prediction for the final position. The model architecture still processes the entire sequence through all its layers - we just don't need to convert all the outputs to probability distributions.

Some implementations might still compute softmax for all positions and then only use the last one, but optimized inference engines will calculate it only for the position of interest.


In [4]:
import tiktoken
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]


def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True, num_workers=0):
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)

    return dataloader


class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads  # Reduce the projection dim to match desired output dim

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
        self.dropout = nn.Dropout(dropout)
        self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))

    def forward(self, x):
        b, num_tokens, d_in = x.shape

        keys = self.W_key(x)  # Shape: (b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)

        # We implicitly split the matrix by adding a `num_heads` dimension
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention (aka self-attention) with a causal mask
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

        # Original mask truncated to the number of tokens and converted to boolean
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # Use the mask to fill attention scores
        attn_scores.masked_fill_(mask_bool, -torch.inf)

        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2)

        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec)  # optional projection

        return context_vec

class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
            GELU(),
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
        )

    def forward(self, x):
        return self.layers(x)

class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) *
            (x + 0.044715 * torch.pow(x, 3))
        ))

class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift


In [8]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}

In [6]:
class TransformerBlock(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.MultiHeadAttentionLayer = MultiHeadAttention(
            d_in=config["emb_dim"],
            d_out=config["emb_dim"],
            context_length=config["context_length"],
            num_heads=config["n_heads"],
            dropout=config["drop_rate"],
            qkv_bias=config["qkv_bias"])

        self.ff = FeedForward(config)
        self.norm1 = LayerNorm(config["emb_dim"])
        self.norm2 = LayerNorm(config["emb_dim"])
        self.drop_shortcut = nn.Dropout(config["drop_rate"])

    def forward(self,x):
        residual = x
        x = self.norm1(x)
        x = self.MultiHeadAttentionLayer(x)
        x = self.drop_shortcut(x)
        x = x + residual

        residual = x
        x = self.norm2(x)
        x = self.ff(x)
        x = self.drop_shortcut(x)

        return residual + x




In [9]:
Transformer = TransformerBlock(GPT_CONFIG_124M)
x = torch.randn(8,4 ,768)

Transformer(x)

tensor([[[ 0.7727, -3.1875,  2.1552,  ..., -0.2826,  0.3347,  1.8124],
         [ 1.2231, -0.3580,  0.1217,  ...,  0.3407, -0.7173, -0.4626],
         [ 0.2187, -1.1171,  0.0045,  ..., -0.2622, -1.1334, -1.1482],
         [-1.4615, -0.8247,  0.8350,  ...,  0.4794,  0.5317, -0.1688]],

        [[ 0.2279, -0.4291, -0.5787,  ...,  1.5443,  1.0600,  0.1656],
         [ 1.3236, -1.4566,  0.4060,  ...,  0.0772,  1.1229, -1.9444],
         [ 0.1444,  1.3710,  0.4250,  ..., -1.1423, -1.0724,  1.0222],
         [-0.0227, -0.4061, -0.0684,  ...,  0.7543, -0.1317,  0.2042]],

        [[ 1.0829, -0.3685,  0.1287,  ..., -1.3880,  0.4187,  0.1547],
         [ 0.3416, -1.3509,  1.0612,  ..., -0.7149,  2.2754, -0.9094],
         [ 0.7451, -0.1145, -2.0107,  ..., -2.0818, -1.0672,  0.2608],
         [ 0.5343,  0.6967,  0.3529,  ..., -1.0157,  1.2025,  0.0644]],

        ...,

        [[-1.1599, -0.6663, -0.0805,  ...,  0.3496, -0.4883, -1.4553],
         [ 0.1229, -0.5143,  2.4017,  ..., -0.7917, -1.28

In [10]:
Transformer

TransformerBlock(
  (MultiHeadAttentionLayer): MultiHeadAttention(
    (W_query): Linear(in_features=768, out_features=768, bias=False)
    (W_key): Linear(in_features=768, out_features=768, bias=False)
    (W_value): Linear(in_features=768, out_features=768, bias=False)
    (out_proj): Linear(in_features=768, out_features=768, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (ff): FeedForward(
    (layers): Sequential(
      (0): Linear(in_features=768, out_features=3072, bias=True)
      (1): GELU()
      (2): Linear(in_features=3072, out_features=768, bias=True)
    )
  )
  (norm1): LayerNorm()
  (norm2): LayerNorm()
  (drop_shortcut): Dropout(p=0.1, inplace=False)
)

In [28]:
class GPTModel(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.embedding_layer = nn.Embedding(config["vocab_size"],config["emb_dim"])
        self.positional_embedding = nn.Embedding(config["context_length"],config["emb_dim"])
        self.dropout = nn.Dropout(config["drop_rate"])
        self.Transformer_blocks = nn.Sequential(
            *[TransformerBlock(config) for _ in range(config["n_layers"])]
        )
        self.Final_Layer_norm = LayerNorm(config["emb_dim"])
        self.output_layer = nn.Linear(
            config["emb_dim"], config["vocab_size"], bias=False
        )

    def forward(self,x):
        _,seq = x.shape
        embedding = self.embedding_layer(x)
        positional_encoding = self.positional_embedding(torch.arange(0,seq))
        x = embedding + positional_encoding
        x = self.dropout(x)
        x = self.Transformer_blocks(x)
        x = self.Final_Layer_norm(x)

        return self.output_layer(x)


In [29]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

batch = []

txt1 = "Every effort moves you"
txt2 = "Every day holds a"

batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)

tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])


In [31]:
Gpt = GPTModel(GPT_CONFIG_124M)
Gpt(batch).shape

torch.Size([2, 4, 50257])

In [33]:
total_num = sum(p.numel() for p in Gpt.parameters())
total_num

163009536