Implementation of "HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness" (arXiv:2605.02396v1 · authors' original repository).
This repository is an independent packaging of the paper's skill as an installable Claude Code plugin. For the canonical reference implementation, evaluation code, and any official artifacts from the authors, see wjn1996/HeavySkill.
HeavySkill is a training-free reasoning amplification technique. Faced with a complex problem, an agent equipped with this skill:
- Spawns K independent reasoning agents in parallel — each one solves the problem from scratch with no access to the others' work.
- Performs sequential deliberation over the K trajectories — explicitly not a majority vote, but a critical meta-analysis that can override consensus, promote a well-reasoned minority answer, or reject all trajectories and re-derive the result.
The paper reports that the same SKILL.md document works unchanged under Claude Code and custom orchestration harnesses, and that the resulting HM@K consistently beats majority voting (V@K) and approaches the theoretical P@K upper bound on AIME25, HMMT25-Feb, and GPQA-Diamond.
HeavySkill/
├── .claude-plugin/
│ └── plugin.json # Claude Code plugin manifest
├── skills/
│ └── heavy-thinking/
│ └── SKILL.md # The canonical skill document (verbatim from the paper)
├── examples/
│ └── usage.md # Concrete activation examples
├── LICENSE
└── README.md
The SKILL.md file is the entire implementation. Everything else is packaging.
From inside Claude Code:
/plugin install https://github.com/Sandjab/HeavySkill
Or clone the repo and install locally:
git clone https://github.com/Sandjab/HeavySkill ~/.claude/plugins/heavyskillRestart Claude Code. The heavy-thinking skill becomes available globally and auto-activates on problems matching the description in the skill frontmatter.
Copy skills/heavy-thinking/SKILL.md into your global skills directory:
mkdir -p ~/.claude/skills/heavy-thinking
cp skills/heavy-thinking/SKILL.md ~/.claude/skills/heavy-thinking/The skill is harness-agnostic. The deliberation prompt template and the parallel-trajectories prompting function are both inside SKILL.md. A custom harness only needs:
- A way to call the policy
Ktimes in parallel on the same query. - The
build_deliberation_prompthelper shown inSKILL.md. - A final call to the policy on the deliberation prompt.
Recommended sampling parameters from the paper: temperature=1.0, top_p=0.95, top_k=10.
Once installed, the skill auto-activates on problems matching its trigger description. You can also invoke it explicitly:
Activate heavy-thinking and solve: Find the number of ordered pairs
$(a,b)$ of integers such that$|a + bi| \le 5$ and$a^2 + b^2$ is prime.
Behind the scenes:
- The main agent spawns K=3 parallel reasoning subagents via the
Agenttool in a single message. - Each subagent independently produces a complete reasoning chain.
- The main agent collects all three trajectories, runs the deliberation analysis itself (it does not delegate this step), and produces the final boxed answer.
Three scenarios that illustrate why deliberation outperforms voting — and why heavy thinking is more than triple-cost overhead. Each example is shown as it would unfold in a Claude Code session: three parallel trajectories, then the deliberation step that produces the final answer.
Problem. Determine all real numbers
$\alpha$ such that, for every positive integer$n$ , the sum$\lfloor \alpha \rfloor + \lfloor 2\alpha \rfloor + \cdots + \lfloor n\alpha \rfloor$ is divisible by$n$ .
Stage 1 — three parallel trajectories.
| Approach | Conclusion | |
|---|---|---|
| Thinker #1 | "If |
|
| Thinker #2 | Same intuition as #1, plus checks |
|
| Thinker #3 | Tests |
Stage 2 — deliberation. Two thinkers agree on "any integer" but only one tested specific cases. The main agent verifies Thinker #3's counterexample
Final answer.
$\alpha$ is an even integer (i.e.$\alpha \in 2\mathbb{Z}$ ).
A majority-vote scheme would have returned the wrong answer. Deliberation catches the trap because it re-checks claims rather than aggregating them.
Problem. Compute
$\displaystyle \sum_{k=1}^{\infty} \frac{1}{k^2 \, 2^k}$ in closed form.
Stage 1 — three parallel trajectories.
| Approach | Steps | |
|---|---|---|
| Thinker #1 | Power series. Starts from |
|
| Thinker #2 | Integral representation. Writes |
A double integral over |
| Thinker #3 | Special-function identity. Recognises the sum as |
Direct closed form |
Stage 2 — deliberation. Three completely different machines (real analysis, double integrals, polylogarithm identities) yield the same closed form. The deliberator notes that triple cross-validation across independent frameworks is far stronger evidence than three majority-voting trials, and accepts the answer with high confidence.
Final answer.
$\displaystyle \sum_{k=1}^{\infty} \frac{1}{k^2 \, 2^k} = \mathrm{Li}_2\!\left(\tfrac{1}{2}\right) = \frac{\pi^2}{12} - \frac{(\ln 2)^2}{2}.$
This is the textbook case where HeavySkill's HM@K approaches the P@K upper bound: any one of the three paths suffices, and their agreement makes the answer essentially certain.
Problem. I tell you "I have two children. At least one is a boy born on a Tuesday." Assuming each child is independently a boy or girl with probability
$1/2$ and each day of the week with probability$1/7$ , what is the probability that both children are boys?
Stage 1 — three parallel trajectories.
| Reasoning | Answer | |
|---|---|---|
| Thinker #1 | "The day of the week is irrelevant. Given at least one boy, |
|
| Thinker #2 | "The two children are independent and the day is given for one of them, so |
|
| Thinker #3 | Sets up Bayes but miscounts the favorable outcomes (14 instead of 13). |
Stage 2 — deliberation. The deliberator notes that the three answers contradict each other — there is no majority and no obvious quality ranking from the trajectories alone. It refuses to vote and re-derives the answer from scratch.
Enumerate ordered pairs (child 1, child 2) with each child labeled by sex and day,
The deliberator confirms that Thinker #3 had the right framework but the wrong count; the right answer is almost its answer, off by exactly one outcome (the double-count).
Final answer.
$\boxed{\dfrac{13}{27}}$ .
This is the strongest case for heavy thinking: every individual trajectory was wrong, yet the deliberation step produced the correct answer. A single chain-of-thought (no matter how careful) cannot do this; a majority vote across the three trajectories cannot do this. The synthesis step is the value.
| Pattern | Why naive aggregation fails | What deliberation does | |
|---|---|---|---|
| Ex. 1 | Minority correct | 2-vs-1 vote elects the wrong answer | Re-verifies the consensus claim, finds the counterexample, flips |
| Ex. 2 | Convergence | Voting is fine but provides no extra signal | Recognizes that independent derivations are stronger than repeated trials |
| Ex. 3 | All wrong | No majority; voting is undefined | Refuses the vote, re-derives, validates against the closest near-miss |
These are exactly the three failure modes of Best-of-N / majority voting that the HeavySkill paper sets out to fix. The skill is cheap (K=3 parallel agents run concurrently → ≈ single-trajectory latency, 3× tokens) and the gain is largest precisely on the problems where a single chain-of-thought would have failed.
The skill auto-triggers on:
- Competition math (AIME, HMMT, IMO, Putnam style)
- GPQA-style hard science / STEM questions
- Intricate logical deduction and proof problems
- Algorithmic / code-competition problems
- Any task where you feel uncertain about your initial approach
It deliberately does not trigger on:
- Simple factual lookups
- Casual conversation
- Straightforward code edits with an obvious solution
- Pure information-retrieval tasks
The paper's recommended settings, summarized:
| Parameter | Harness default | Workflow default | Notes |
|---|---|---|---|
K (parallel trajectories) |
3 | 8 or 16 | Larger K raises P@K ceiling but costs more |
K⁽¹⁾ (deliberation outputs) |
1 | 4 | For HM@K / HP@K reporting |
N (iterations) |
1 | 1 | Iterate 2–3× only if first pass is inconclusive |
temperature |
1.0 | 1.0 | High diversity across trajectories |
top_p |
0.95 | 0.95 | |
top_k |
10 | 10 |
When running offline evaluations of HeavySkill quality, report:
- M@K — Mean accuracy across
Kindependent trajectories - P@K — Proportion of problems with ≥1 correct trajectory (upper bound)
- V@K — Majority-vote accuracy across
Ktrajectories - HM@K — Heavy-Mean accuracy: mean over
K⁽¹⁾deliberation outputs - HP@K — Heavy-Pass: proportion where ≥1 of
K⁽¹⁾deliberation outputs is correct
The headline claim of the paper is that HM@K > V@K (deliberation beats voting) and HM@K ≈ P@K on frontier models (deliberation captures most of the reachable correct answers).
MIT — see LICENSE.
@article{heavyskill2026,
title = {HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness},
year = {2026},
eprint = {2605.02396},
archivePrefix = {arXiv}
}Authors' official repository: github.com/wjn1996/HeavySkill.
Rendered every 6 hours by Cartouche — vellum-davinci theme.