Skip to content

Taxonomy gen#2

Open
drewsungg wants to merge 14 commits intoDarrow8:mainfrom
drewsungg:taxonomy-gen
Open

Taxonomy gen#2
drewsungg wants to merge 14 commits intoDarrow8:mainfrom
drewsungg:taxonomy-gen

Conversation

@drewsungg
Copy link
Copy Markdown
Collaborator

Summary

Reworks the taxonomy-gen subproblem pipeline to produce usable GRPO training data and commits the first curated dataset for conics-tangent-5.

The pre-existing pipeline hit three dead ends in practice:

  1. Template lock-in — the generator kept rewriting the same problem with only the numerical parameter changed, so the agreement-band filter rejected 100% of attempts on easy skills (e.g. "dimension of degree-$d$ plane curves" always landed at agreement=1.00 → too_easy).
  2. Long-run auth expiry — the in-memory Vertex ADC token baked into the OpenAI client would expire mid-run, forcing a kill + manual re-auth every ~hour and losing skill-level progress.
  3. No dedupe — even when keeps accumulated, the filter couldn't catch near-duplicate subproblems that would give bad GRPO signal.

This PR addresses all three and ships a first usable dataset.

What changed

Generator diversity (Stage1/taxonomy_generation.py)

  • Memory trick (§3.2 of arxiv 2409.04109). Each generate_for_skill call now threads the last 50 prior attempts (truncated problem text + final answer) back into GENERATE_PROMPT. The generator is explicitly told to produce non-paraphrases with different numerical answers.
  • Structural-diversity rule. New load-bearing clause in GENERATE_PROMPT: a problem that reads as a find-and-replace of a prior attempt (same template, different number) is a failure. The prompt enumerates concrete axes of variation (object, algebraic incarnation, question direction, ambient setting).
  • Parametric-family rule in DECOMPOSE_PROMPT. Rule 6b rejects skill descriptions pinned to a single instance — every description must name a free parameter ($d$, $n$, $k$) or point to a technique that naturally applies to many configurations. Self-audit step (f2) checks "could 10+ non-identical subproblems exist for this skill?".

Real-time observability

  • on_keep / on_skip callbacks flush keeps.json and skips.json atomically after every accepted/rejected candidate, not just at end-of-skill. You can jq .n_problems any time during the run.
  • per_skill_stats.json refreshes on the same cadence.

Resume + auth resilience

  • --start-from-skill N CLI flag skips the first $N{-}1$ skills — useful after crashes or auth expiries. The other_skill_names isolation list still uses the full taxonomy; only generation is skipped.
  • Vertex ADC auto-refresh. Both LLM call paths (_robust_completion in taxonomy_generation, call_llm in distinct_llm_prompting) now detect 401/AuthenticationError, call _get_vertex_access_token() to re-read ADC from disk, and mutate client.api_key in place before the next tenacity retry. A shared _is_auth_error(exc) helper classifies by exception-type name + message substring.

Dedupe (Stage1/dedupe_subproblems.py, new)

  • BGE-large-en-v1.5 (2023, 335M params, 1024-dim) — chosen over the paper's MiniLM-L6-v2 because math-dense text with high surface-vocabulary overlap smears under smaller embedders.
  • Cosine similarity, greedy first-seen, configurable threshold.
  • Two modes: --per-skill (dedup inside each skill only; right default for skill-isolated training data) or global (default).
  • Atomic writes, preserves all fields of the input keeps payload.

First dataset: datasets/conics-tangent-5/keeps_deduped_095.json

86 subproblems across 10 skills, per-skill dedupe at cosine threshold 0.95.

Skill Kept (of 15) Notes
Dimension of plane curve linear system 5 Solver error floor — only borderline arithmetic slips land in-band
Degree of discriminant hypersurface 9
Divisor class of tangency 10
Naive Bézout count 1 Solver error floor — multiplication is mastered content
Fulton excess-intersection correction 13 Strongest signal; clean isolation
Chasles' characteristic numbers 10 Watch for target overlap on $c \ge 1$ problems
Veronese embedding → hyperplane 12 ⚠️ see known issues
Lang–Weil point-count estimate 9 ⚠️ see known issues
Chebotarev density 6
Degree of étale map (punchline) 11

Threshold tuning. At 0.88 (paper default, stronger embedder): 23/123 survived — too aggressive; BGE on math-dense text collapses genuinely distinct problems that share LaTeX + domain boilerplate. At 0.92: 47/123. At 0.95: 86/123. 0.95 only catches near-identical rewrites (same problem with a different number), which is the right semantic for training data.

Known issues

Both warrant a manual pass before training:

  • Skill 7 (Veronese), sample 1. Conflates "$Q$ passes through a point $P$" (legitimately linear under the Veronese embedding) with "$Q$ is tangent to a conic $C_i$" (actually a degree-6 divisor, class $6H$). The generated problem asks for "conics tangent to 5 smooth conics" via the Veronese translation and gets answer 1 at agreement=0.8 — but the correct answer is Chasles' 3264. The solver majority-voted on the false premise; the filter can't catch this class of error.
  • Skill 8 (Lang–Weil), off-by-one variant. The "smallest $M$ in $|N(p) - p| \le M\sqrt{p}$ for smooth degree-7 curve" problem lands at answer 31 with agreement=0.9, but the Weil bound is $M = 2g = 30$ for $g = 15$. Likely a systematic solver error from including the Lang–Weil error term without projective-vs-affine disambiguation.

Hand-inspect skill 7's 12 keeps and skill 8's 9 keeps before feeding to GRPO. 15–20 minutes of manual review will catch the canaries.

Production notes

  • Three separate pipeline runs produced these keeps (skills 1–2 → skills 3 → skills 4–10) because of auth expiries before the auto-refresh patch. The raw per-run keeps files live under the symlinked runs/ dir (not in this repo); the merged + deduped output is committed to datasets/.
  • Token auto-refresh survived a full 3h24m run without intervention after the patch.
  • _PRIOR_ATTEMPTS_MEMORY = 50 was tuned after observing 12-entry cycling; at 50 entries × ~250 tokens the memory block is ~12K tokens, comfortably within context.

Test plan

  • Manually review skill 7 (Veronese) keeps for the tangency-as-linear conflation
  • Manually review skill 8 (Lang–Weil) keeps for the off-by-one M bound
  • Spot-check 5 random keeps per skill for answer leakage in the problem statement
  • Sanity-run GRPO on a 10-problem subset before committing to full 86
  • If dataset count needs to be larger, re-run skills 9 and 10 with a different random seed / temperature and re-merge

Skeleton only: module header, non-negotiable constants, Skill
dataclass. Decomposition and per-skill generation added in subsequent
tasks. Design: docs/superpowers/specs/2026-04-21-taxonomy-generation-design.md
decompose_target() makes one LLM call, parses the JSON skills list,
retries on parse failure, and returns Skill objects. save_skills /
load_skills persist the decomposition with target hash and model
metadata for reproducibility.
- save_skills: guard os.makedirs against bare-filename paths where
  dirname is '', which would raise FileNotFoundError on makedirs('').
- decompose_target: validate each skill entry has non-empty string
  fields for name/description/example_problem_hint before returning,
  and add TypeError to the caught retry exceptions so a malformed
  entry triggers a retry instead of slipping through with None
  values.
- hashlib: move import to module top (was deferred inside
  target_text_hash for no reason).
generate_for_skill() pipelines candidates for one skill: generate,
solve n_samples times, check agreement window + numeric. Stops at
n_target keeps or max_candidates attempts. Reuses the solver and
answer-extraction infrastructure from distinct_llm_prompting.py.
build_taxonomy_dataset() runs Phase 1 (with skills.json caching) then
iterates Phase 2 per skill, persisting keeps/skips/per_skill_stats
after each skill so a crash mid-run doesn't lose progress. CLI matches
the existing Stage 1 shape but hardcodes the model and accepts
--failed-solutions for symmetry (unused in v1).
Critical fix: the real solve_and_check_agreement in
distinct_llm_prompting.py unconditionally calls pool.submit(...), so
passing pool=None (the default) crashes on the first solve call.
Tests masked this because they monkeypatch the solve function. Now
build_taxonomy_dataset creates one ThreadPoolExecutor sized by
--max-workers and threads it through generate_for_skill down to the
real solve call.

Also addresses the other final-review issues:
- build_taxonomy_dataset now calls _lazy_import_get_client() at its
  top so direct callers (not just main) don't hit TypeError on a
  None get_client.
- --max-workers is now actually used (was parsed and silently
  discarded) to size the solve pool.
- skip reasons split into 'too_easy' vs 'too_hard' so skips.json
  distinguishes the two; previously both collapsed to 'out_of_window'.
- _lazy_import_distinct_helpers split into narrower get_client /
  load_problem_from_txt variants so build_taxonomy_dataset doesn't
  eagerly pull in load_problem_from_txt (which only main() needs).
Previously generate_for_skill only printed a single summary line
'done: N/M passed' after all ~100 attempts finished, making each
skill look frozen for up to 30 minutes with no output. Now each
attempt prints one line showing: KEEP or skip reason, agreement
rate, extracted answer, and a running [kept/target] counter.
flush=True so the line appears immediately through tee pipes.
Removes the example_problem_hint field from the Skill dataclass,
DECOMPOSE_PROMPT, GENERATE_PROMPT, and all tests. Rationale: hints
in the decomposition prompt bias the generator toward a specific
problem framing, narrowing the diversity we're trying to produce.
name + description is enough context; the generator has more room
to find its own angle on the skill.

load_skills tolerates the legacy field so existing skills.json files
from the earlier run still load (new test pins that behavior).
…lation

Decomposition prompt now specifies *what makes a good decomposition*
instead of just listing requirements. Adds:
- Pairwise orthogonality check (could a student be strong at A but
  weak at B?)
- Anti-restatement: skills must be components, not the target
  restated at smaller scale
- Coverage of the single hardest insight: >=2 skills must build
  toward it explicitly
- Difficulty spread: skill 1 undergrad-level, skill N close to
  mastering the target; no flat middle
- Numerical-answer requirement pushed upstream so the decomposer
  factors it in rather than having downstream silently warp skills
  toward computational-only
- Self-audit pass before producing the JSON

Generation prompt now enforces:
- Isolation from the other 9 skills (full taxonomy shown in-context)
  with an UNISOLATABLE sentinel escape hatch when a skill can't be
  cleanly separated -- useful decomposition-quality signal
- No target leakage: no reuse of the target's setup, notation, or
  parameters
- Difficulty calibrated to position #i of N, not a flat baseline

Plumbing:
- generate_for_skill gets skill_index / n_skills / other_skill_names
- UNISOLATABLE candidates skip the solve pass (no compute spent on
  a sentinel), land in skips.json with reason='unisolatable' and
  reason_detail from the model
- 20/20 tests pass, including new test_unisolatable_sentinel_skips_without_solving
  that pins the no-solve-on-sentinel behavior
Previous prompt got the *shape* of the hard insight (skills
gesturing at excess intersection) but not its *content* (Chasles'
characteristic numbers on the space of complete conics). The model
also used 'difficulty spread' as license to pad with parametric
sweeps like 'tangent to 2 / 3 / 4 conics', which are all the same
Bezout skill applied at n=2,3,4.

Two prompt-level fixes:

1. Name-the-technique (load-bearing): for the 2-3 hardest skills
   (the ones covering the hard insight), the name and description
   must name a specific technique, theorem, or named object --
   'Chebotarev density', 'Veronese embedding', 'Chasles
   characteristic numbers', 'Lang-Weil'. Goal-only descriptions
   like 'compute the correction term' or 'handle the excess locus'
   are explicitly disallowed and flagged during self-audit.

2. Anti-padding: n=2/3/4 variants of the same technique collapse
   into ONE skill. Paired facts about the same object (codimension
   vs. degree of the same divisor) also collapse. Difficulty
   progression must come from genuinely different reasoning, not
   from incrementing a parameter.

Self-audit now runs 7 checks (was 5) covering both new rules.
Bug fix (critical, production-blocking): Vertex MaaS occasionally
returns a raw string instead of a ChatCompletion object. The
existing call_llm in distinct_llm_prompting.py guards against this
with a tenacity retry, but call_llm bakes in timeout=180s which
violates our no-timeouts constraint. New _robust_completion
helper does the same guards (raw-string detection, empty-choices,
empty-content) without any timeout, using hand-rolled jittered
exponential backoff up to 8 retries. Both call sites
(decompose_target, _generate_one_candidate) now route through it.

Prompt tightening from mid-run audit:
- Requirement 6 'No numbers in skill text': skill names and
  descriptions must not contain specific numerical answers,
  dimensions, degrees, or counts. With bad/good examples in the
  prompt. Mechanically checkable: reject any skill containing a
  digit. Fixes the '2^5 vs 6^5' factual-error mode by forcing
  the model to describe the technique rather than state the
  value.
- Requirement 7 'No arithmetic-only skills': disallows
  postprocessing-only skills like 'compute floor(100*L)' or
  'subtract the correction from the naive count'. These are
  target-leakage dressed up as skills.
- Self-audit expanded from 7 to 9 checks covering both new rules.

20/20 tests still pass (existing mocks happen to return valid
response objects so the new robust-completion path is exercised
transparently).
Pipeline changes:
- Subproblem generator now receives last 50 prior attempts (problem
  snippet + answer) so it actively avoids rewriting the same template;
  adds structural-diversity rule requiring varied problem shape.
- Decompose prompt gains a parametric-family rule so each skill admits
  multiple instantiations (degree d, dimension n, ...) rather than a
  single pinned instance.
- Incremental writes: keeps.json / skips.json flush after every
  candidate via on_keep / on_skip callbacks so the files grow in real
  time instead of only at end-of-skill.
- Vertex ADC token auto-refreshes on 401 inside both LLM call paths
  (_robust_completion, call_llm) so long-running jobs survive token
  expiry without a kill+restart.
- Added --start-from-skill N for mid-run resume after crashes.

New script:
- Stage1/dedupe_subproblems.py: BGE-large-en-v1.5 + cosine similarity
  with tunable threshold, global or per-skill modes.

First dataset:
- datasets/conics-tangent-5/keeps_deduped_095.json: 86 subproblems
  across 10 skills, per-skill dedupe at cosine=0.95 (tuned for math-
  dense text where surface vocabulary overlap inflates similarity).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant