The dominant paradigm for improving language models has been scaling training: more parameters, more data, more compute. But there's another dimension: **test-time compute**—spending more resources at inference time to get better answers.

Humans do this naturally. Hard problems require more thought. You don't solve a complex proof in one mental step; you work through intermediate results, backtrack when stuck, and verify your reasoning. What happens when we teach AI systems to do the same?



## The Test-Time Compute Hypothesis

The core idea: a small model that "thinks" for 10 seconds might outperform a large model that answers in 100 milliseconds. If you're willing to spend compute at inference time, you can trade it for training-time compute.

This isn't free. Inference compute costs money and time. But for many applications—complex reasoning, code generation, mathematical proofs—accuracy matters more than latency. And inference compute is more flexible: you can allocate it dynamically based on problem difficulty.

The question is how to spend that compute productively. Simply generating more tokens doesn't help if those tokens are noise. You need structured ways to explore, evaluate, and refine.



## Chain-of-Thought Prompting

The simplest form of test-time compute: ask the model to show its work.

**Chain-of-thought (CoT)** prompting includes examples with step-by-step reasoning, then asks the model to produce similar intermediate steps. This dramatically improves performance on math, logic, and multi-step reasoning problems.

Why does it work? Several hypotheses:

- **Serialized computation**: Transformers have limited depth. Generating intermediate tokens effectively adds more "layers" of computation.
- **Error decomposition**: Breaking problems into steps exposes intermediate results that are easier to verify and correct.
- **Training distribution**: Models are trained on text that includes reasoning. Prompting them to reason recovers abilities learned during training.

**Zero-shot CoT**: Just adding "Let's think step by step" to the prompt induces reasoning without examples. This suggests CoT is unlocking something the model already knows how to do.

Limitation: CoT is a single forward pass. The model generates one chain and commits to it. It can't explore alternatives or backtrack.



## Best-of-N and Self-Consistency

Generate multiple answers; pick the best one.

**Best-of-N sampling**: Generate N independent completions, score them somehow (model confidence, ground truth if available, external verifier), and return the best. Simple but effective. Often 4-8 samples provide most of the gain.

**Self-consistency**: For problems with a single correct answer (math, factual questions), generate N reasoning chains and take the majority vote on the final answer. Different reasoning paths might make different intermediate errors but converge on the correct final answer.

These methods use test-time compute to reduce variance—sampling multiple times and aggregating. They work because model stochasticity produces diverse solutions, some of which are better than others.

Limitation: Linear cost in N. Each sample requires a full forward pass. And if all N chains converge to the same wrong answer, more sampling doesn't help.



## Tree Search for Language

Instead of generating complete sequences and scoring afterward, build a tree of partial sequences and search.

**Monte Carlo Tree Search (MCTS)**: The approach that powered AlphaGo. Treat sequence generation as a game tree. At each position, expand promising branches, simulate to completion, and backpropagate value estimates. Balance exploration (trying new branches) and exploitation (deepening good ones).

Applied to language:
1. Start with prompt
2. Generate several possible continuations (tokens or phrases)
3. Score each branch with a value estimate
4. Expand the most promising branches
5. Continue until a complete answer, then backpropagate

This enables backtracking: if a reasoning path leads to a dead end, you can return to an earlier branch and try differently.

**Process Reward Models (PRMs)**: Train a model to score intermediate reasoning steps, not just final answers. A PRM can identify when reasoning goes wrong before reaching the conclusion, enabling earlier pruning.

**Outcome Reward Models (ORMs)**: Score only final answers. Simpler to train (you just need answer labels) but less useful for guiding search.

Tree search with PRMs is how recent "reasoning models" achieve their performance. The model doesn't just generate one chain; it explores a tree of possibilities, guided by learned value estimates.



## Self-Critique and Iterative Refinement

Generate once, then improve.

**Self-critique**: Ask the model to evaluate its own answer. "What might be wrong with this solution?" "Are there any errors in this proof?" The model often identifies issues it failed to avoid during generation.

**Iterative refinement**: Generate → critique → revise → repeat. Each pass can fix errors from the previous one. Analogous to how humans edit their writing.

**Constitutional AI-style loops**: Define principles ("be helpful, harmless, honest"), generate candidate responses, rank them by the principles, and train on the rankings. The same idea applies at inference: generate, critique against principles, and revise.

Why can models catch errors on review that they made during generation? Partly because generation is autoregressive and committing—once tokens are emitted, they influence subsequent generation. Review operates on a complete artifact and can consider global coherence.



## Verifier-Guided Generation

Use a separate model to verify outputs, and let that verification guide generation.

**Code**: Generate code → run tests → if tests fail, generate again with error context. The test suite is an external verifier. This is how many code-completion systems work in practice.

**Math**: Generate proof steps → check with a formal verifier (Lean, Coq) → if verification fails, backtrack. The theorem prover provides ground truth.

**Factual claims**: Generate → retrieve sources → verify claims against sources → revise. Retrieval-augmented generation as verification.

External verifiers are powerful because they provide reliable signal. If your code doesn't compile, that's ground truth—no model uncertainty involved. The challenge is that not all tasks have clean external verification.



## Compute-Optimal Inference

When should you think more? Not every query deserves the same effort.

**Adaptive compute**: Estimate problem difficulty and allocate inference compute accordingly. Simple questions get one-shot answers; hard questions get tree search and verification.

Difficulty estimation is its own challenge:
- Model confidence (entropy over tokens) is one signal but unreliable
- Query characteristics (length, complexity, domain) provide hints
- Start with cheap computation; escalate if initial results seem uncertain

**Cascading**: Try a small, fast model first. If confidence is low, hand off to a larger model. Most queries might be handled cheaply; only hard cases pay the full cost.

**Speculative decoding**: Use a small model to draft continuations; verify with the large model. If verification passes, you've generated many tokens cheaply. If not, fall back to the large model. This accelerates easy continuations while maintaining large-model quality.



## OpenAI's "o-series" and the Reasoning Model Paradigm

OpenAI's o1 (and predecessors like "Strawberry") represent this paradigm taken seriously. Key characteristics:

- **Extended reasoning**: The model generates substantial internal reasoning before answering. This might be hidden from users but consumes inference compute.
- **Process supervision**: Trained with reward on intermediate reasoning steps, not just final answers.
- **Search**: Likely uses some form of tree search or best-of-N at inference time.
- **Specialized for reasoning**: Optimized for math, code, and logic where verification is possible.

The result: dramatically better performance on hard reasoning benchmarks (AIME math, competition programming) at the cost of higher latency and expense.

This points toward a future where you might choose between:
- Fast, cheap models for simple queries
- Slow, expensive reasoning models for hard problems

Compute becomes a knob you turn based on problem difficulty and quality requirements.



## Implications and Trade-offs

**Latency vs. quality**: More thinking means slower responses. Acceptable for some applications (research, coding, analysis) but not others (chat, real-time decisions).

**Cost**: Inference compute isn't free. Tree search with process reward models can cost 10-100× single-pass inference. This changes the economics of AI applications.

**Training incentives**: If models are deployed with test-time search, training should optimize for search performance, not just single-pass accuracy. This is an active research area.

**Transparency**: Hidden reasoning (as in some o1 deployments) trades interpretability for performance. Users may not understand why answers take time or why they cost more.

**Diminishing returns**: At some point, no amount of inference compute helps. If the model doesn't have the right knowledge or capability, thinking longer doesn't create it.



## The Bigger Picture

Test-time compute represents a shift in how we think about AI capability. Instead of a fixed model with fixed abilities, we have a continuum: spend more compute, get better answers.

This is closer to how intelligence works in nature. Humans don't have one-shot answers to hard problems. We think, revise, verify, and iterate. The question is whether current architectures—transformers with autoregressive generation—can fully exploit this paradigm, or whether deeper architectural changes are needed.

Either way, the message is clear: model quality at deployment isn't determined only by training. How you use the model matters too.





```{=html}
<div style="text-align:center;">
  <img src="image.png" alt="Figure" width="65%"/>
  <p><em>Figure 1. Trading off training compute vs. inference compute</em></p>
</div>
```

