Skip to content

Six step pipeline#3

Open
drewsungg wants to merge 17 commits intoDarrow8:mainfrom
drewsungg:six-step-pipeline
Open

Six step pipeline#3
drewsungg wants to merge 17 commits intoDarrow8:mainfrom
drewsungg:six-step-pipeline

Conversation

@drewsungg
Copy link
Copy Markdown
Collaborator

@drewsungg drewsungg commented Apr 23, 2026

Six-Step Taxonomy Pipeline for Subproblem Generation

Problem

The naive pipeline (problem → skills → subproblems → dedupe) fails on problems
with high-confidence wrong attractors — where a model's first pattern-match
leads to a plausible but incorrect interpretation. For example, a problem about
"conics tangent to 5 conics" pattern-matches to the classical count of 3264,
but the actual question (average over F_p-points of a finite étale cover)
requires Chebotarev density and harmonic sums. Decomposing under the wrong
interpretation produces subproblems that test irrelevant skills.

Solution

Insert structural analysis and interpretation enumeration before skill
decomposition, and an adversarial coverage critic after. The full pipeline:

  1. Structural feature extraction — analyze the problem statement itself
    (answer type, operations, aggregation, regime, named objects, deviations
    from the simplest phrasing, easily-overlooked details). Forces the model to
    attend to structure before committing to any technique.

  2. Interpretation enumeration — produce 3+ distinct readings of the
    problem, including the surface-level one. Each interpretation commits to a
    specific answer type, technique, and justification grounded in structural
    features. An ignored_features field enforces honesty.

  3. Skill decomposition — produce a union skill set covering all
    interpretations. Each skill is tagged with which interpretations it serves
    and which structural features it addresses. Skills must be atomic,
    load-bearing prerequisites (not vague topic labels).

  4. Adversarial coverage critic — clause-by-clause check that every part of
    the problem statement is covered by some skill. Includes a stripping test
    (remove covered clauses, check what remains) and an inverse stripping test
    (could you reconstruct the problem from just the skills?). Loops back to
    step 3 if gaps are found (capped at 2 revision rounds).

  5. Per-skill subproblem generation — generates candidates one skill at a
    time with agreement-window filtering (configurable, default 0.60–0.80).
    Each candidate is solved N times (default 20) to get solver consensus.
    Includes memory/diversity feedback: every prior attempt (kept or skipped)
    is fed back to the generator so it actively diversifies. Top-up rounds
    retry under-producing skills with full prior-attempt memory carried over.
    Top-up can be dedupe-aware (counts deduped keeps, not raw) so post-dedupe
    counts meet the minimum.

  6. Embedding-based dedupe — BGE-large cosine similarity (default threshold
    0.95). Supports per-skill mode (dedupe within each skill independently) or
    global mode (across all skills).

Key design decisions

  • No answer access required. The pipeline never needs ground truth on the
    parent problem. Validation comes from multiple interpretations, coverage
    checking, and solver agreement filtering.

  • All outputs cached. Steps 1-4 produce features.json, interpretations.json,
    skills.json, and critic.json. These are reused on subsequent runs for the
    same target problem, so only step 5 (the expensive part) re-runs.

  • Incremental writes. keeps.json and skips.json are flushed after every
    candidate, so a crash mid-run doesn't lose progress.

  • Domain-agnostic. None of the prompts mention domain-specific vocabulary.
    They refer only to structural features, interpretations, skills, and
    coverage — concepts that apply to math, physics, biology, or any domain
    with well-defined problems.

  • Model hardcoded to openai/gpt-oss-120b-maas for all calls (generation,
    solving, structural analysis). No max_tokens, no client-side timeouts.

Files changed

  • Stage1/taxonomy_generation.py — main pipeline implementation (1778 lines)
  • Stage1/dedupe_subproblems.py — standalone dedupe module (also callable as
    library from the main pipeline)
  • subproblem_generation_pipeline.md — full pipeline spec with all prompts
  • docs/specs/2026-04-21-taxonomy-generation-design.md — design spec
  • docs/plans/2026-04-21-taxonomy-generation.md — implementation plan
  • tests/test_taxonomy_generation.py — unit tests for decomposition, skill
    serialization, and model constant

CLI usage

python -u Stage1/taxonomy_generation.py \
    --problem-path data/target-problems/conics.txt \
    --runs-subdir conics-tangent-5-v2 \
    --n-skills 10 --problems-per-skill 15 --max-candidates-per-skill 200 \
    --n-samples 20 --agree-low 0.60 --agree-high 0.80 \
    --gen-workers 4 --max-workers 32 \
    --min-per-skill 10 --max-topup-rounds 2 \
    --dedupe-threshold 0.95 --dedupe-mode per-skill

Output structure

runs/<subdir>/
  features.json          # Step 1: structural features (cached)
  interpretations.json   # Step 2: 3+ interpretations (cached)
  skills.json            # Step 3: union skill set (cached)
  critic.json            # Step 4: coverage critique (cached)
  stage1_taxonomy/<timestamp>/
    keeps.json           # Accepted subproblems with solver traces
    skips.json           # Rejected candidates with reasons
    per_skill_stats.json # Per-skill generation stats
    keeps_deduped.json   # Final deduplicated output

Skeleton only: module header, non-negotiable constants, Skill
dataclass. Decomposition and per-skill generation added in subsequent
tasks. Design: docs/superpowers/specs/2026-04-21-taxonomy-generation-design.md
decompose_target() makes one LLM call, parses the JSON skills list,
retries on parse failure, and returns Skill objects. save_skills /
load_skills persist the decomposition with target hash and model
metadata for reproducibility.
- save_skills: guard os.makedirs against bare-filename paths where
  dirname is '', which would raise FileNotFoundError on makedirs('').
- decompose_target: validate each skill entry has non-empty string
  fields for name/description/example_problem_hint before returning,
  and add TypeError to the caught retry exceptions so a malformed
  entry triggers a retry instead of slipping through with None
  values.
- hashlib: move import to module top (was deferred inside
  target_text_hash for no reason).
generate_for_skill() pipelines candidates for one skill: generate,
solve n_samples times, check agreement window + numeric. Stops at
n_target keeps or max_candidates attempts. Reuses the solver and
answer-extraction infrastructure from distinct_llm_prompting.py.
build_taxonomy_dataset() runs Phase 1 (with skills.json caching) then
iterates Phase 2 per skill, persisting keeps/skips/per_skill_stats
after each skill so a crash mid-run doesn't lose progress. CLI matches
the existing Stage 1 shape but hardcodes the model and accepts
--failed-solutions for symmetry (unused in v1).
Critical fix: the real solve_and_check_agreement in
distinct_llm_prompting.py unconditionally calls pool.submit(...), so
passing pool=None (the default) crashes on the first solve call.
Tests masked this because they monkeypatch the solve function. Now
build_taxonomy_dataset creates one ThreadPoolExecutor sized by
--max-workers and threads it through generate_for_skill down to the
real solve call.

Also addresses the other final-review issues:
- build_taxonomy_dataset now calls _lazy_import_get_client() at its
  top so direct callers (not just main) don't hit TypeError on a
  None get_client.
- --max-workers is now actually used (was parsed and silently
  discarded) to size the solve pool.
- skip reasons split into 'too_easy' vs 'too_hard' so skips.json
  distinguishes the two; previously both collapsed to 'out_of_window'.
- _lazy_import_distinct_helpers split into narrower get_client /
  load_problem_from_txt variants so build_taxonomy_dataset doesn't
  eagerly pull in load_problem_from_txt (which only main() needs).
Previously generate_for_skill only printed a single summary line
'done: N/M passed' after all ~100 attempts finished, making each
skill look frozen for up to 30 minutes with no output. Now each
attempt prints one line showing: KEEP or skip reason, agreement
rate, extracted answer, and a running [kept/target] counter.
flush=True so the line appears immediately through tee pipes.
Removes the example_problem_hint field from the Skill dataclass,
DECOMPOSE_PROMPT, GENERATE_PROMPT, and all tests. Rationale: hints
in the decomposition prompt bias the generator toward a specific
problem framing, narrowing the diversity we're trying to produce.
name + description is enough context; the generator has more room
to find its own angle on the skill.

load_skills tolerates the legacy field so existing skills.json files
from the earlier run still load (new test pins that behavior).
…lation

Decomposition prompt now specifies *what makes a good decomposition*
instead of just listing requirements. Adds:
- Pairwise orthogonality check (could a student be strong at A but
  weak at B?)
- Anti-restatement: skills must be components, not the target
  restated at smaller scale
- Coverage of the single hardest insight: >=2 skills must build
  toward it explicitly
- Difficulty spread: skill 1 undergrad-level, skill N close to
  mastering the target; no flat middle
- Numerical-answer requirement pushed upstream so the decomposer
  factors it in rather than having downstream silently warp skills
  toward computational-only
- Self-audit pass before producing the JSON

Generation prompt now enforces:
- Isolation from the other 9 skills (full taxonomy shown in-context)
  with an UNISOLATABLE sentinel escape hatch when a skill can't be
  cleanly separated -- useful decomposition-quality signal
- No target leakage: no reuse of the target's setup, notation, or
  parameters
- Difficulty calibrated to position #i of N, not a flat baseline

Plumbing:
- generate_for_skill gets skill_index / n_skills / other_skill_names
- UNISOLATABLE candidates skip the solve pass (no compute spent on
  a sentinel), land in skips.json with reason='unisolatable' and
  reason_detail from the model
- 20/20 tests pass, including new test_unisolatable_sentinel_skips_without_solving
  that pins the no-solve-on-sentinel behavior
Previous prompt got the *shape* of the hard insight (skills
gesturing at excess intersection) but not its *content* (Chasles'
characteristic numbers on the space of complete conics). The model
also used 'difficulty spread' as license to pad with parametric
sweeps like 'tangent to 2 / 3 / 4 conics', which are all the same
Bezout skill applied at n=2,3,4.

Two prompt-level fixes:

1. Name-the-technique (load-bearing): for the 2-3 hardest skills
   (the ones covering the hard insight), the name and description
   must name a specific technique, theorem, or named object --
   'Chebotarev density', 'Veronese embedding', 'Chasles
   characteristic numbers', 'Lang-Weil'. Goal-only descriptions
   like 'compute the correction term' or 'handle the excess locus'
   are explicitly disallowed and flagged during self-audit.

2. Anti-padding: n=2/3/4 variants of the same technique collapse
   into ONE skill. Paired facts about the same object (codimension
   vs. degree of the same divisor) also collapse. Difficulty
   progression must come from genuinely different reasoning, not
   from incrementing a parameter.

Self-audit now runs 7 checks (was 5) covering both new rules.
Bug fix (critical, production-blocking): Vertex MaaS occasionally
returns a raw string instead of a ChatCompletion object. The
existing call_llm in distinct_llm_prompting.py guards against this
with a tenacity retry, but call_llm bakes in timeout=180s which
violates our no-timeouts constraint. New _robust_completion
helper does the same guards (raw-string detection, empty-choices,
empty-content) without any timeout, using hand-rolled jittered
exponential backoff up to 8 retries. Both call sites
(decompose_target, _generate_one_candidate) now route through it.

Prompt tightening from mid-run audit:
- Requirement 6 'No numbers in skill text': skill names and
  descriptions must not contain specific numerical answers,
  dimensions, degrees, or counts. With bad/good examples in the
  prompt. Mechanically checkable: reject any skill containing a
  digit. Fixes the '2^5 vs 6^5' factual-error mode by forcing
  the model to describe the technique rather than state the
  value.
- Requirement 7 'No arithmetic-only skills': disallows
  postprocessing-only skills like 'compute floor(100*L)' or
  'subtract the correction from the naive count'. These are
  target-leakage dressed up as skills.
- Self-audit expanded from 7 to 9 checks covering both new rules.

20/20 tests still pass (existing mocks happen to return valid
response objects so the new robust-completion path is exercised
transparently).
Pipeline changes:
- Subproblem generator now receives last 50 prior attempts (problem
  snippet + answer) so it actively avoids rewriting the same template;
  adds structural-diversity rule requiring varied problem shape.
- Decompose prompt gains a parametric-family rule so each skill admits
  multiple instantiations (degree d, dimension n, ...) rather than a
  single pinned instance.
- Incremental writes: keeps.json / skips.json flush after every
  candidate via on_keep / on_skip callbacks so the files grow in real
  time instead of only at end-of-skill.
- Vertex ADC token auto-refreshes on 401 inside both LLM call paths
  (_robust_completion, call_llm) so long-running jobs survive token
  expiry without a kill+restart.
- Added --start-from-skill N for mid-run resume after crashes.

New script:
- Stage1/dedupe_subproblems.py: BGE-large-en-v1.5 + cosine similarity
  with tunable threshold, global or per-skill modes.

First dataset:
- datasets/conics-tangent-5/keeps_deduped_095.json: 86 subproblems
  across 10 skills, per-skill dedupe at cosine=0.95 (tuned for math-
  dense text where surface vocabulary overlap inflates similarity).
…generate + top-up

Implements the domain-agnostic pipeline in subproblem_generation_pipeline.md.
Adds four analysis steps before skill decomposition and a per-skill top-up
loop to reach a minimum keep count.

New prompts (spec §1-§5 verbatim where feasible):
- FEATURES_PROMPT          step 1: structural feature extraction -> q1..q7 JSON
- INTERPRETATIONS_PROMPT   step 2: >=3 readings of the problem, each with
                           answer_type / technique / justification / ignored_features
- DECOMPOSE_PROMPT         step 3: union skill set covering all interpretations
                           and structural features; skills carry serves_interpretations,
                           addresses_features, role
- CRITIC_PROMPT            step 4: adversarial coverage critic
- GENERATE_PROMPT          step 5 spec verbatim + ADDITIONAL CONSTRAINTS appendix
                           preserving no-answer-leakage / unambiguous-answer /
                           per-skill prior-attempts memory / structural diversity /
                           target-leakage guard / strict output format

New functions:
- extract_features           step 1 LLM call + parse
- enumerate_interpretations  step 2 LLM call + parse
- run_critic                 step 4 LLM call + parse (fails open on malformed)
- decompose_with_critic      step 3 + step 4 with revision loop (cap 2)
- _interpretations_for_skill / _features_for_skill
                             render per-skill context for the generate prompt

Schema:
- Skill dataclass gains optional serves_interpretations, addresses_features, role
- Legacy skills.json (name + description only) still loads
- Keep records include generator_metadata (answer/solution_sketch/why_relevant/
  failure_mode) as provenance; solver consensus remains authoritative
- <metadata>JSON</metadata> block parsed alongside existing <problem>...</problem>

Persisted artifacts in runs/<subdir>/:
- features.json, interpretations.json, skills.json, critic.json
- All cached; pipeline skips LLM calls on rerun when present

Top-up:
- --min-per-skill N: after the main loop, any skill below N is retried with
  full per-skill prior-attempts memory seeded in. Stall guard skips a skill
  that shows zero progress across two consecutive rounds.
- --max-topup-rounds N: outer cap on retry rounds (default 6).

Other:
- Spec doc committed at subproblem_generation_pipeline.md
- build_taxonomy_dataset caches pipeline head; reruns skip steps 1-4 when
  outputs already exist on disk.
Brings in the dedupe PR and two follow-ups (removed timeout, added
experiment) from Darrow8/main. Our branch keeps its six-step pipeline
(features -> interpretations -> decompose -> critic -> generate +
top-up) and its BGE-embedding dedupe (Stage1/dedupe_subproblems.py),
which now coexists with main's char-5-gram-Jaccard dedupe at
pipeline_stages/dedupe.py.

Conflict: Stage1/distinct_llm_prompting.py call_llm._call()
  - main removed the `timeout` parameter
  - our side wrapped the request in try/except for ADC token refresh
  on 401
Resolved: take main's simpler no-timeout call, re-apply our token
refresh wrapper on top. _is_auth_error and _get_vertex_access_token
survive on both sides.
Previously the top-up loop counted RAW keeps, so "--min-per-skill 10" with
dedupe done as a separate post-processing step would routinely end with
fewer than 10 deduped keeps for skills whose raw output had heavy
near-duplication. Users who care about the post-dedupe count had to
manually re-run with --start-from-skill.

New default: after each top-up round the pipeline runs BGE-large dedupe
(Stage1.dedupe_subproblems.{_embed,_greedy_dedupe}) per-skill and uses
the deduped count for the loop continuation + stall-detection logic.
The embedding model is loaded once at module level and reused across
rounds. keeps_deduped.json is flushed mid-loop so callers can watch
progress; a final dedupe write lands at the end of build_taxonomy_dataset.

Raw keeps continue to be written incrementally as before; dedupe is
additive. The existing external Stage1.dedupe_subproblems CLI is
unchanged and still works on arbitrary keeps.json files.

New CLI flags (all defaults preserve correctness):
- --dedupe-threshold 0.95  cosine threshold; tuned for math-dense text
                           where surface vocabulary overlap inflates similarity
- --dedupe-mode per-skill  (default) or global
- --dedupe-model BAAI/bge-large-en-v1.5
- --raw-topup              opt back into the previous raw-count behavior
                           (skips loading sentence-transformers)

Stall guard uses deduped counts too: if a skill's deduped count fails
to increase across two consecutive top-up rounds, it is marked stalled
and skipped regardless of max_topup_rounds. Reported in the final
summary with both raw and deduped numbers.

Verified with a monkeypatched smoke test that exercises the full
control flow (LLM + embedder both faked) and confirms:
- top-up fires when deduped count < min
- stall guard triggers when dedupe surfaces no new unique
- keeps_deduped.json written with correct n_in/n_out metadata
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant