Skip to content

SENNGAMI/Real_reason

Repository files navigation

HeavySkill — Divergent-Think for Claude Code

arXiv Paper PDF License Skill Version

Cross-framing heavy reasoning for Claude Code — fixes the two failure modes (Appendix A "diversity collapse", Appendix B "iteration drift") documented in the HeavySkill paper. Ships as five composable skills with a single user-facing entry: /divergent-think.


Why this exists

The HeavySkill paper showed that parallel-reasoning + sequential-deliberation outperforms majority-voting / Best-of-N — but it also documented two limitations:

Paper finding Limitation Fix this repo ships
Appendix A K parallel trajectories share the same prompt → Max-Diversity selection ≈ Random selection auto-reframe generates K=6 axis-disjoint framings so each trajectory enters the problem through a structurally different conceptual lens
Appendix B Iterative deliberation: HM@K rises but HP@K falls (in-frame noise accumulates) frame-critic injects a fresh axis between iterations; sees only framings + summaries (never trajectory bodies), enforced by sub-agent context isolation

The divergent-think orchestrator combines both into one end-to-end pipeline; the original single-frame heavyskill is kept as a baseline for already-canonical problems (competition math, well-posed STEM).


What you get

Two complementary ways to use this repo:

Mode For who Entry point
A — Claude Code skill (recommended for interactive use) Anyone using Claude Code as their primary harness /divergent-think <query>
B — Python workflow (for paper repro & batch benchmarking) Researchers running ablations against open-weight models python scripts/run_divergent.py ...

Both modes implement the same pipeline shape and the same hard limits (K=6 framings, K¹=4 deliberation samples, N_max=3 critic iterations) defined in the paper.


Architecture

                 ┌──────────────────────────────────────────────┐
   User query ──▶│           divergent-think (orchestrator)     │
                 └──────────────────────────────────────────────┘
                          │
                          ▼
           ┌──────────────────────────────────┐
   Stage 1 │ Reframe in-context               │  reads prompts/01-reframe.md
           │   → 6 framings, strict JSON      │
           │   (domain / abstraction / actor /│
           │    goal / scale / analogy)       │
           └──────────────────────────────────┘
                          │
                          ▼
           ┌──────────────────────────────────┐
   Stage 2 │ Spawn 6 Agents in parallel       │  reads prompts/02-worker.md
           │   each with ONE framing          │  ─▶ 6 trajectories
           │ Deliberate in-context (×4)       │  reads prompts/03-deliberation.md
           │   → summary[iter]                │
           └──────────────────────────────────┘
                          │
                          ▼
           ┌──────────────────────────────────┐
   Stage 3 │ Spawn 1 Agent (frame-critic)     │  reads prompts/04-critic.md
           │   sees ONLY framings + summaries │  ─▶ STOP | CONTINUE
           │   (NEVER trajectory bodies)      │
           └──────────────────────────────────┘
                          │
              ┌───────────┴──────────────┐
              │                          │
       STOP   ▼                          ▼  CONTINUE (≤ N_max=3)
       Final answer                Stage 4 — Partial re-run:
                                   spawn 1 Agent for the new
                                   framing only, re-deliberate,
                                   loop to Stage 3

The five skills

Skill Role User-invocable?
divergent-think Orchestrator — only this skill executes the pipeline ✅ Primary entry point
heavyskill Single-frame baseline (K=3 parallel on the same prompt) ✅ For canonical math/STEM
auto-reframe Narrative reference for Stage 1 ❌ Documentation only
heavy-think-divergent Narrative reference for Stage 2 ❌ Documentation only
frame-critic Narrative reference for Stage 3 ❌ Documentation only

The four executable prompt templates

The orchestrator reads these at runtime — they are the single source of truth for prompt content:

.claude/skills/divergent-think/prompts/
├── 01-reframe.md         # Stage 1 protocol (6 axes + anti-anchoring + JSON schema)
├── 02-worker.md          # Stage 2 per-framing worker prompt
├── 03-deliberation.md    # Stage 2 cross-framing synthesis prompt
└── 04-critic.md          # Stage 3 STOP/CONTINUE prompt

Edit these to tune behavior — never edit the inlined prompt content in the SKILL.md files (there is none in v0.3.0; the orchestrator Reads these at each stage).


Quick start — Mode A (Claude Code skill)

Install the skill bundle

Download or copy the bundled tarball (dist/heavyskill-skill-v0.3.0.tar.gz) and extract it into either your project's .claude/skills/ or your user-level ~/.claude/skills/:

# Project-level install (recommended while iterating)
mkdir -p .claude/skills
tar -xzvf heavyskill-skill-v0.3.0.tar.gz -C .claude/skills/

# OR user-level install (available across all your projects)
mkdir -p ~/.claude/skills
tar -xzvf heavyskill-skill-v0.3.0.tar.gz -C ~/.claude/skills/

After extraction you should see five skill folders (divergent-think/, heavyskill/, auto-reframe/, heavy-think-divergent/, frame-critic/) and the divergent-think/prompts/ subdirectory with four .md files.

Verify the install

In a fresh Claude Code session inside the project:

/divergent-think 設計一個讓 SaaS 用戶留存率上升的功能(不能讓 DAU 下降)

You should observe:

  1. Stage 1's 6-framing JSON emitted inline by the orchestrator
  2. Six parallel Agent tool calls in the terminal (one per axis) — this is the key visible signal that the pipeline is working
  3. A 7th Agent call for the frame-critic
  4. A clean Chinese-language final answer (matching query language), no meta-narration prefix

If you only see Stage 1 JSON and no parallel Agents → the orchestrator isn't reading prompts/02-worker.md. Re-extract the tarball and confirm the prompts/ directory landed correctly.

Run

/divergent-think <multi-vector reasoning query>    # full divergent pipeline (1–3 min)
/heavyskill      <canonical math/STEM query>       # single-frame baseline (faster, cheaper)

The orchestrator auto-detects when a query is canonical and may hand off to heavyskill itself.


Quick start — Mode B (Python workflow, for paper repro)

git clone https://github.com/wjn1996/HeavySkill.git
cd HeavySkill
pip install -e .

Run the divergent pipeline (matches divergent-think skill semantics with K=6, K¹=4, N_max=3 defaults):

python scripts/run_divergent.py \
    --query "Find the number of paths of length 16 on an 8x8 grid that change direction exactly four times." \
    --model "deepseek-r1" \
    --api_base "http://localhost:8080" \
    --output "outputs/divergent_result.json" \
    --verbose

Run the single-frame baseline (matches heavyskill skill):

python scripts/run_heavyskill.py \
    --query "Your problem here" \
    --model "deepseek-r1" \
    --api_base "http://localhost:8080" \
    --reason_k 8 --summary_k 4 \
    --output "outputs/baseline_result.json"

Using a separate deliberation model:

python scripts/run_heavyskill.py \
    --query "..." \
    --model         "r1-distill-qwen-7b"   --api_base         "http://localhost:8080" \
    --summary_model "qwen3-32b"            --summary_api_base "http://localhost:8081" \
    --reason_k 16 --summary_k 4

Batch evaluation:

python scripts/run_heavyskill.py \
    --input_file "examples/example_math.json" \
    --model "deepseek-r1" --api_base "http://localhost:8080" \
    --output "outputs/batch_result.json"

The Python pipeline supports any OpenAI-compatible endpoint: vLLM, DeepSeek API, Together AI, OpenRouter, local Ollama, etc.


Pipeline detail — what each stage does

Stage 1 — Reframe (in-context)

Orchestrator reads prompts/01-reframe.md and executes the protocol in its own context. Produces 6 framings, one per required axis (domain, abstraction, actor, goal, scale, analogy). Anti-anchoring guards ensure the framing names reuse ≤ 30% of the query's content tokens — this prevents the "security audit" framing of a security-audit query.

Stage 2 — Parallel framed reasoning (6 sub-agents)

Orchestrator reads prompts/02-worker.md, substitutes the 5 placeholders per framing, and dispatches 6 Agent tool calls in parallel in a single response. Each sub-agent reasons in its framing's vocabulary, then translates back to the original query. Trajectories return into the orchestrator's context.

Stage 2 — Cross-framing deliberation (in-context, K¹=4 samples)

Orchestrator reads prompts/03-deliberation.md and runs the synthesis itself (must hold all 6 trajectories at once — cannot delegate). Produces 4 samples; picks the most internally consistent. Each summary explicitly names a cross-framing combination that no single framing produced alone, or honestly reports "no genuine combination found".

Stage 3 — Frame-critic (1 sub-agent, STOP/CONTINUE)

Orchestrator reads prompts/04-critic.md and spawns ONE Agent call with {query, framings, summaries_history, iterations_done, n_max, axes_unused}never the trajectory bodies. Sub-agent context isolation enforces the "critic outside the deliberation frame" contract. Critic returns strict JSON: STOP, or CONTINUE + new framing + axis to evict.

Stage 4 — Iteration (CONTINUE only)

Swap one (framing, trajectory) pair; spawn ONE Agent for the new framing's trajectory only; re-run Stage 2 deliberation over the updated 6-tuple; loop to Stage 3. Cost stays linear in N, not N × K.

Stage 5 — Final answer

Match the original query's language and format conventions. No "after deep thinking..." preamble. The user sees an answer that reads as if written directly in response to the query.


Cost / latency

Resource Typical full run
Input + output tokens ~200k – 400k (varies with query complexity)
Wall clock 1–3 minutes (depends on sub-agent throughput)
Sub-agent calls 6 (Stage 2 parallel) + 1 (Stage 3 critic), + up to N_max × (1 worker + 1 critic) on CONTINUE branches

The orchestrator tells the user "Running divergent-think (~1–3 min)" before starting so they can interrupt if metered-API cost is a concern. The hard max_total_tokens budget guard (set in the Python config at 1.5M) fires an early-STOP and returns the best summary so far if exceeded.

For canonical math/STEM where framing is already given, prefer /heavyskill — it runs K=3 parallel on a single prompt (no reframe, no critic), typically 1/3 the cost.


Repository layout

HeavySkill/
├── .claude/skills/                  # Mode A: Claude Code skill bundle
│   ├── divergent-think/
│   │   ├── SKILL.md                 # Orchestrator logic (v0.3.0)
│   │   └── prompts/                 # Executable prompt templates
│   │       ├── 01-reframe.md
│   │       ├── 02-worker.md
│   │       ├── 03-deliberation.md
│   │       └── 04-critic.md
│   ├── heavyskill/SKILL.md          # Single-frame baseline
│   ├── auto-reframe/SKILL.md        # Narrative reference for Stage 1
│   ├── heavy-think-divergent/SKILL.md  # Narrative reference for Stage 2
│   └── frame-critic/SKILL.md        # Narrative reference for Stage 3
│
├── workflow/                        # Mode B: Python pipeline
│   ├── config.py                    # HeavySkillConfig dataclass
│   ├── pipeline.py                  # Single-frame orchestration
│   ├── prompts.py                   # Prompt templates (general / STEM, CN / EN)
│   ├── divergent/                   # Divergent variant
│   │   ├── pipeline.py              # Critic-driven outer loop
│   │   ├── reframer.py              # auto-reframe equivalent
│   │   ├── cross_framing_deliberation.py
│   │   ├── frame_critic.py
│   │   ├── axes.py / distance.py / metrics.py / types.py
│   │   └── config.py                # K=6, K¹=4, N_max=3 defaults
│   └── agent/openai_compatible.py   # Async OpenAI-compatible client
│
├── scripts/
│   ├── run_heavyskill.py            # Mode B single-frame CLI
│   ├── run_divergent.py             # Mode B divergent CLI
│   ├── run_benchmark.py             # Batch benchmark harness
│   ├── judge_arena.py               # Arena-Hard auto-judge
│   └── evaluate.py                  # Accuracy evaluation utility
│
├── examples/
│   ├── example_math.json
│   ├── aime_hmmt_subset.json
│   ├── arena_hard_subset.json
│   └── ctf_seed_v0.json
│
├── paper/heavyskill.pdf             # Paper (arXiv:2605.02396)
├── tests/                           # pytest suite
├── dist/                            # Skill tarball releases (gitignored)
├── pyproject.toml
└── README.md

Contributing

Issues and PRs welcome. A few conventions to make review fast:

  • Prompt edits belong in .claude/skills/divergent-think/prompts/*.md, not in any SKILL.md. The SKILL files describe orchestration and concepts; prompt content is in the runtime files.
  • New axis or framing rule changes belong in prompts/01-reframe.md and also in the corresponding section of workflow/divergent/axes.py to keep the two modes consistent.
  • Smoke test before sending a PR: run /divergent-think <a short multi-vector query> and confirm the terminal shows 6 parallel Agent calls + 1 critic Agent call. If it doesn't, the pipeline is broken.
  • Hard limits (K=6, K¹=4, N_max=3) come from the paper — change them in the config file, not by hard-coding new numbers in prompts.

Citation

If you use this work, please cite the original paper:

@article{wang2026heavyskill,
  title={HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness},
  author={Wang, Jianing and Guo, Linsen and Chen, Zhengyu and Guo, Qi and Zang, Hongyu and Shi, Wenjie and Ma, Haoxiang and Xi, Xiangyu and Li, Xiaoyu and Wang, Wei and Cai, Xunliang},
  journal={arXiv preprint arXiv:2605.02396},
  year={2026},
  url={https://arxiv.org/abs/2605.02396}
}

The Claude Code skill bundle in .claude/skills/ is an independent implementation of the paper's divergent variant (and its single-frame baseline) packaged for the Claude Code agentic harness. Bug reports against the skill bundle are tracked separately from paper errata.


License

Apache-2.0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors