---
title: "Information-Theoretic Bounds and Training Dynamics of Transformers"
description: "Explaining transformer training through entropy, cross-entropy, and bits-per-token"
author: "Your Name"
date: "2025-09-14"
categories:
  - theory
  - transformers
  - information-theory
---


## Overview

This post sketches how training a transformer can be framed in information-theoretic terms:

- Training minimizes empirical cross-entropy, i.e., expected negative log-likelihood.
- The optimal achievable loss is the entropy rate of the data (given the model class).
- Bits-per-token (bpt) is simply loss in base-2 units.
- Generalization can be discussed via compression and mutual information.



## Image placeholder

Add your figure here. Replace the path once you upload the image file into this folder.

```{=html}
<div style="text-align:center;">
  <img src="/Users/idea/comm4190_F25_Using_LLMs_Blog/entropy-27-00589-g001-550.jpg" alt="Information-theoretic view of training" width="60%"/>
  <p><em>Figure 1. Placeholder for information-theoretic diagram. Replace with your actual image file.</em></p>
</div>
```



## Cross-entropy, entropy rate, and bits-per-token

- Cross-entropy (training loss) estimates how many nats per token the model spends to encode the data.
- Bits-per-token (bpt) is just loss in base-2: bpt = loss_nat / ln 2.
- The theoretical lower bound is the entropy rate H of the data-generating process; if the model class is misspecified, the optimum is H + D(P||Q*).



In [None]:
# Convert loss (nats/token) to bits-per-token and perplexity
import math
from typing import Iterable, Tuple


def loss_to_metrics(loss_nats: float) -> Tuple[float, float]:
    """Return (bits_per_token, perplexity) from loss in nats/token."""
    bits_per_token = loss_nats / math.log(2)
    perplexity = math.exp(loss_nats)
    return bits_per_token, perplexity


example_losses = [3.5, 2.8, 2.2, 1.9]
for step, loss in enumerate(example_losses, start=1):
    bpt, ppl = loss_to_metrics(loss)
    print(f"step={step:02d} loss={loss:.3f} nats -> bpt={bpt:.3f}, ppl={ppl:.2f}")


## Training dynamics as compression

- Minimizing cross-entropy is equivalent to minimizing expected code length.
- Early training reduces redundant predictability (frequent patterns); later, model learns rarer structure.
- Capacity vs. data curve: more parameters lower achievable cross-entropy until compute/data bottlenecks.
- Generalization: MDL view—good models compress both train and test data with similar code lengths.

