Six step pipeline by drewsungg · Pull Request #3 · Darrow8/ttt-binary

drewsungg · 2026-04-23T20:25:09Z

Six-Step Taxonomy Pipeline for Subproblem Generation

Problem

The naive pipeline (problem → skills → subproblems → dedupe) fails on problems
with high-confidence wrong attractors — where a model's first pattern-match
leads to a plausible but incorrect interpretation. For example, a problem about
"conics tangent to 5 conics" pattern-matches to the classical count of 3264,
but the actual question (average over F_p-points of a finite étale cover)
requires Chebotarev density and harmonic sums. Decomposing under the wrong
interpretation produces subproblems that test irrelevant skills.

Solution

Insert structural analysis and interpretation enumeration before skill
decomposition, and an adversarial coverage critic after. The full pipeline:

Structural feature extraction — analyze the problem statement itself
(answer type, operations, aggregation, regime, named objects, deviations
from the simplest phrasing, easily-overlooked details). Forces the model to
attend to structure before committing to any technique.
Interpretation enumeration — produce 3+ distinct readings of the
problem, including the surface-level one. Each interpretation commits to a
specific answer type, technique, and justification grounded in structural
features. An ignored_features field enforces honesty.
Skill decomposition — produce a union skill set covering all
interpretations. Each skill is tagged with which interpretations it serves
and which structural features it addresses. Skills must be atomic,
load-bearing prerequisites (not vague topic labels).
Adversarial coverage critic — clause-by-clause check that every part of
the problem statement is covered by some skill. Includes a stripping test
(remove covered clauses, check what remains) and an inverse stripping test
(could you reconstruct the problem from just the skills?). Loops back to
step 3 if gaps are found (capped at 2 revision rounds).
Per-skill subproblem generation — generates candidates one skill at a
time with agreement-window filtering (configurable, default 0.60–0.80).
Each candidate is solved N times (default 20) to get solver consensus.
Includes memory/diversity feedback: every prior attempt (kept or skipped)
is fed back to the generator so it actively diversifies. Top-up rounds
retry under-producing skills with full prior-attempt memory carried over.
Top-up can be dedupe-aware (counts deduped keeps, not raw) so post-dedupe
counts meet the minimum.
Embedding-based dedupe — BGE-large cosine similarity (default threshold
0.95). Supports per-skill mode (dedupe within each skill independently) or
global mode (across all skills).

Key design decisions

No answer access required. The pipeline never needs ground truth on the
parent problem. Validation comes from multiple interpretations, coverage
checking, and solver agreement filtering.
All outputs cached. Steps 1-4 produce features.json, interpretations.json,
skills.json, and critic.json. These are reused on subsequent runs for the
same target problem, so only step 5 (the expensive part) re-runs.
Incremental writes. keeps.json and skips.json are flushed after every
candidate, so a crash mid-run doesn't lose progress.
Domain-agnostic. None of the prompts mention domain-specific vocabulary.
They refer only to structural features, interpretations, skills, and
coverage — concepts that apply to math, physics, biology, or any domain
with well-defined problems.
Model hardcoded to openai/gpt-oss-120b-maas for all calls (generation,
solving, structural analysis). No max_tokens, no client-side timeouts.

Files changed

Stage1/taxonomy_generation.py — main pipeline implementation (1778 lines)
Stage1/dedupe_subproblems.py — standalone dedupe module (also callable as
library from the main pipeline)
subproblem_generation_pipeline.md — full pipeline spec with all prompts
docs/specs/2026-04-21-taxonomy-generation-design.md — design spec
docs/plans/2026-04-21-taxonomy-generation.md — implementation plan
tests/test_taxonomy_generation.py — unit tests for decomposition, skill
serialization, and model constant

CLI usage

python -u Stage1/taxonomy_generation.py \
    --problem-path data/target-problems/conics.txt \
    --runs-subdir conics-tangent-5-v2 \
    --n-skills 10 --problems-per-skill 15 --max-candidates-per-skill 200 \
    --n-samples 20 --agree-low 0.60 --agree-high 0.80 \
    --gen-workers 4 --max-workers 32 \
    --min-per-skill 10 --max-topup-rounds 2 \
    --dedupe-threshold 0.95 --dedupe-mode per-skill

Output structure

runs/<subdir>/
  features.json          # Step 1: structural features (cached)
  interpretations.json   # Step 2: 3+ interpretations (cached)
  skills.json            # Step 3: union skill set (cached)
  critic.json            # Step 4: coverage critique (cached)
  stage1_taxonomy/<timestamp>/
    keeps.json           # Accepted subproblems with solver traces
    skips.json           # Rejected candidates with reasons
    per_skill_stats.json # Per-skill generation stats
    keeps_deduped.json   # Final deduplicated output

Skeleton only: module header, non-negotiable constants, Skill dataclass. Decomposition and per-skill generation added in subsequent tasks. Design: docs/superpowers/specs/2026-04-21-taxonomy-generation-design.md

decompose_target() makes one LLM call, parses the JSON skills list, retries on parse failure, and returns Skill objects. save_skills / load_skills persist the decomposition with target hash and model metadata for reproducibility.

- save_skills: guard os.makedirs against bare-filename paths where dirname is '', which would raise FileNotFoundError on makedirs(''). - decompose_target: validate each skill entry has non-empty string fields for name/description/example_problem_hint before returning, and add TypeError to the caught retry exceptions so a malformed entry triggers a retry instead of slipping through with None values. - hashlib: move import to module top (was deferred inside target_text_hash for no reason).

generate_for_skill() pipelines candidates for one skill: generate, solve n_samples times, check agreement window + numeric. Stops at n_target keeps or max_candidates attempts. Reuses the solver and answer-extraction infrastructure from distinct_llm_prompting.py.

build_taxonomy_dataset() runs Phase 1 (with skills.json caching) then iterates Phase 2 per skill, persisting keeps/skips/per_skill_stats after each skill so a crash mid-run doesn't lose progress. CLI matches the existing Stage 1 shape but hardcodes the model and accepts --failed-solutions for symmetry (unused in v1).

Critical fix: the real solve_and_check_agreement in distinct_llm_prompting.py unconditionally calls pool.submit(...), so passing pool=None (the default) crashes on the first solve call. Tests masked this because they monkeypatch the solve function. Now build_taxonomy_dataset creates one ThreadPoolExecutor sized by --max-workers and threads it through generate_for_skill down to the real solve call. Also addresses the other final-review issues: - build_taxonomy_dataset now calls _lazy_import_get_client() at its top so direct callers (not just main) don't hit TypeError on a None get_client. - --max-workers is now actually used (was parsed and silently discarded) to size the solve pool. - skip reasons split into 'too_easy' vs 'too_hard' so skips.json distinguishes the two; previously both collapsed to 'out_of_window'. - _lazy_import_distinct_helpers split into narrower get_client / load_problem_from_txt variants so build_taxonomy_dataset doesn't eagerly pull in load_problem_from_txt (which only main() needs).

Previously generate_for_skill only printed a single summary line 'done: N/M passed' after all ~100 attempts finished, making each skill look frozen for up to 30 minutes with no output. Now each attempt prints one line showing: KEEP or skip reason, agreement rate, extracted answer, and a running [kept/target] counter. flush=True so the line appears immediately through tee pipes.

Removes the example_problem_hint field from the Skill dataclass, DECOMPOSE_PROMPT, GENERATE_PROMPT, and all tests. Rationale: hints in the decomposition prompt bias the generator toward a specific problem framing, narrowing the diversity we're trying to produce. name + description is enough context; the generator has more room to find its own angle on the skill. load_skills tolerates the legacy field so existing skills.json files from the earlier run still load (new test pins that behavior).

…lation Decomposition prompt now specifies *what makes a good decomposition* instead of just listing requirements. Adds: - Pairwise orthogonality check (could a student be strong at A but weak at B?) - Anti-restatement: skills must be components, not the target restated at smaller scale - Coverage of the single hardest insight: >=2 skills must build toward it explicitly - Difficulty spread: skill 1 undergrad-level, skill N close to mastering the target; no flat middle - Numerical-answer requirement pushed upstream so the decomposer factors it in rather than having downstream silently warp skills toward computational-only - Self-audit pass before producing the JSON Generation prompt now enforces: - Isolation from the other 9 skills (full taxonomy shown in-context) with an UNISOLATABLE sentinel escape hatch when a skill can't be cleanly separated -- useful decomposition-quality signal - No target leakage: no reuse of the target's setup, notation, or parameters - Difficulty calibrated to position #i of N, not a flat baseline Plumbing: - generate_for_skill gets skill_index / n_skills / other_skill_names - UNISOLATABLE candidates skip the solve pass (no compute spent on a sentinel), land in skips.json with reason='unisolatable' and reason_detail from the model - 20/20 tests pass, including new test_unisolatable_sentinel_skips_without_solving that pins the no-solve-on-sentinel behavior

Previous prompt got the *shape* of the hard insight (skills gesturing at excess intersection) but not its *content* (Chasles' characteristic numbers on the space of complete conics). The model also used 'difficulty spread' as license to pad with parametric sweeps like 'tangent to 2 / 3 / 4 conics', which are all the same Bezout skill applied at n=2,3,4. Two prompt-level fixes: 1. Name-the-technique (load-bearing): for the 2-3 hardest skills (the ones covering the hard insight), the name and description must name a specific technique, theorem, or named object -- 'Chebotarev density', 'Veronese embedding', 'Chasles characteristic numbers', 'Lang-Weil'. Goal-only descriptions like 'compute the correction term' or 'handle the excess locus' are explicitly disallowed and flagged during self-audit. 2. Anti-padding: n=2/3/4 variants of the same technique collapse into ONE skill. Paired facts about the same object (codimension vs. degree of the same divisor) also collapse. Difficulty progression must come from genuinely different reasoning, not from incrementing a parameter. Self-audit now runs 7 checks (was 5) covering both new rules.

Bug fix (critical, production-blocking): Vertex MaaS occasionally returns a raw string instead of a ChatCompletion object. The existing call_llm in distinct_llm_prompting.py guards against this with a tenacity retry, but call_llm bakes in timeout=180s which violates our no-timeouts constraint. New _robust_completion helper does the same guards (raw-string detection, empty-choices, empty-content) without any timeout, using hand-rolled jittered exponential backoff up to 8 retries. Both call sites (decompose_target, _generate_one_candidate) now route through it. Prompt tightening from mid-run audit: - Requirement 6 'No numbers in skill text': skill names and descriptions must not contain specific numerical answers, dimensions, degrees, or counts. With bad/good examples in the prompt. Mechanically checkable: reject any skill containing a digit. Fixes the '2^5 vs 6^5' factual-error mode by forcing the model to describe the technique rather than state the value. - Requirement 7 'No arithmetic-only skills': disallows postprocessing-only skills like 'compute floor(100*L)' or 'subtract the correction from the naive count'. These are target-leakage dressed up as skills. - Self-audit expanded from 7 to 9 checks covering both new rules. 20/20 tests still pass (existing mocks happen to return valid response objects so the new robust-completion path is exercised transparently).

Pipeline changes: - Subproblem generator now receives last 50 prior attempts (problem snippet + answer) so it actively avoids rewriting the same template; adds structural-diversity rule requiring varied problem shape. - Decompose prompt gains a parametric-family rule so each skill admits multiple instantiations (degree d, dimension n, ...) rather than a single pinned instance. - Incremental writes: keeps.json / skips.json flush after every candidate via on_keep / on_skip callbacks so the files grow in real time instead of only at end-of-skill. - Vertex ADC token auto-refreshes on 401 inside both LLM call paths (_robust_completion, call_llm) so long-running jobs survive token expiry without a kill+restart. - Added --start-from-skill N for mid-run resume after crashes. New script: - Stage1/dedupe_subproblems.py: BGE-large-en-v1.5 + cosine similarity with tunable threshold, global or per-skill modes. First dataset: - datasets/conics-tangent-5/keeps_deduped_095.json: 86 subproblems across 10 skills, per-skill dedupe at cosine=0.95 (tuned for math- dense text where surface vocabulary overlap inflates similarity).

…generate + top-up Implements the domain-agnostic pipeline in subproblem_generation_pipeline.md. Adds four analysis steps before skill decomposition and a per-skill top-up loop to reach a minimum keep count. New prompts (spec §1-§5 verbatim where feasible): - FEATURES_PROMPT step 1: structural feature extraction -> q1..q7 JSON - INTERPRETATIONS_PROMPT step 2: >=3 readings of the problem, each with answer_type / technique / justification / ignored_features - DECOMPOSE_PROMPT step 3: union skill set covering all interpretations and structural features; skills carry serves_interpretations, addresses_features, role - CRITIC_PROMPT step 4: adversarial coverage critic - GENERATE_PROMPT step 5 spec verbatim + ADDITIONAL CONSTRAINTS appendix preserving no-answer-leakage / unambiguous-answer / per-skill prior-attempts memory / structural diversity / target-leakage guard / strict output format New functions: - extract_features step 1 LLM call + parse - enumerate_interpretations step 2 LLM call + parse - run_critic step 4 LLM call + parse (fails open on malformed) - decompose_with_critic step 3 + step 4 with revision loop (cap 2) - _interpretations_for_skill / _features_for_skill render per-skill context for the generate prompt Schema: - Skill dataclass gains optional serves_interpretations, addresses_features, role - Legacy skills.json (name + description only) still loads - Keep records include generator_metadata (answer/solution_sketch/why_relevant/ failure_mode) as provenance; solver consensus remains authoritative - <metadata>JSON</metadata> block parsed alongside existing <problem>...</problem> Persisted artifacts in runs/<subdir>/: - features.json, interpretations.json, skills.json, critic.json - All cached; pipeline skips LLM calls on rerun when present Top-up: - --min-per-skill N: after the main loop, any skill below N is retried with full per-skill prior-attempts memory seeded in. Stall guard skips a skill that shows zero progress across two consecutive rounds. - --max-topup-rounds N: outer cap on retry rounds (default 6). Other: - Spec doc committed at subproblem_generation_pipeline.md - build_taxonomy_dataset caches pipeline head; reruns skip steps 1-4 when outputs already exist on disk.

Brings in the dedupe PR and two follow-ups (removed timeout, added experiment) from Darrow8/main. Our branch keeps its six-step pipeline (features -> interpretations -> decompose -> critic -> generate + top-up) and its BGE-embedding dedupe (Stage1/dedupe_subproblems.py), which now coexists with main's char-5-gram-Jaccard dedupe at pipeline_stages/dedupe.py. Conflict: Stage1/distinct_llm_prompting.py call_llm._call() - main removed the `timeout` parameter - our side wrapped the request in try/except for ADC token refresh on 401 Resolved: take main's simpler no-timeout call, re-apply our token refresh wrapper on top. _is_auth_error and _get_vertex_access_token survive on both sides.

Previously the top-up loop counted RAW keeps, so "--min-per-skill 10" with dedupe done as a separate post-processing step would routinely end with fewer than 10 deduped keeps for skills whose raw output had heavy near-duplication. Users who care about the post-dedupe count had to manually re-run with --start-from-skill. New default: after each top-up round the pipeline runs BGE-large dedupe (Stage1.dedupe_subproblems.{_embed,_greedy_dedupe}) per-skill and uses the deduped count for the loop continuation + stall-detection logic. The embedding model is loaded once at module level and reused across rounds. keeps_deduped.json is flushed mid-loop so callers can watch progress; a final dedupe write lands at the end of build_taxonomy_dataset. Raw keeps continue to be written incrementally as before; dedupe is additive. The existing external Stage1.dedupe_subproblems CLI is unchanged and still works on arbitrary keeps.json files. New CLI flags (all defaults preserve correctness): - --dedupe-threshold 0.95 cosine threshold; tuned for math-dense text where surface vocabulary overlap inflates similarity - --dedupe-mode per-skill (default) or global - --dedupe-model BAAI/bge-large-en-v1.5 - --raw-topup opt back into the previous raw-count behavior (skips loading sentence-transformers) Stall guard uses deduped counts too: if a skill's deduped count fails to increase across two consecutive top-up rounds, it is marked stalled and skipped regardless of max_topup_rounds. Reported in the final summary with both raw and deduped numbers. Verified with a monkeypatched smoke test that exercises the full control flow (LLM + embedder both faked) and confirms: - top-up fires when deduped count < min - stall guard triggers when dedupe surfaces no new unique - keeps_deduped.json written with correct n_in/n_out metadata

drewsungg added 17 commits April 21, 2026 23:17

add spec for taxonomy-first subproblem generation

5b02d05

add implementation plan for taxonomy-first generation

bdebef3

scaffold Stage1/taxonomy_generation.py with Skill dataclass

1990a3b

Skeleton only: module header, non-negotiable constants, Skill dataclass. Decomposition and per-skill generation added in subsequent tasks. Design: docs/superpowers/specs/2026-04-21-taxonomy-generation-design.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Six step pipeline#3

Six step pipeline#3
drewsungg wants to merge 17 commits intoDarrow8:mainfrom
drewsungg:six-step-pipeline

drewsungg commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewsungg commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Six-Step Taxonomy Pipeline for Subproblem Generation

Problem

Solution

Key design decisions

Files changed

CLI usage

Output structure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

drewsungg commented Apr 23, 2026 •

edited

Loading