Taxonomy gen by drewsungg · Pull Request #2 · Darrow8/ttt-binary

drewsungg · 2026-04-23T15:53:48Z

Summary

Reworks the taxonomy-gen subproblem pipeline to produce usable GRPO training data and commits the first curated dataset for conics-tangent-5.

The pre-existing pipeline hit three dead ends in practice:

Template lock-in — the generator kept rewriting the same problem with only the numerical parameter changed, so the agreement-band filter rejected 100% of attempts on easy skills (e.g. "dimension of degree-$d$ plane curves" always landed at agreement=1.00 → too_easy).
Long-run auth expiry — the in-memory Vertex ADC token baked into the OpenAI client would expire mid-run, forcing a kill + manual re-auth every ~hour and losing skill-level progress.
No dedupe — even when keeps accumulated, the filter couldn't catch near-duplicate subproblems that would give bad GRPO signal.

This PR addresses all three and ships a first usable dataset.

What changed

Generator diversity (`Stage1/taxonomy_generation.py`)

Memory trick (§3.2 of arxiv 2409.04109). Each generate_for_skill call now threads the last 50 prior attempts (truncated problem text + final answer) back into GENERATE_PROMPT. The generator is explicitly told to produce non-paraphrases with different numerical answers.
Structural-diversity rule. New load-bearing clause in GENERATE_PROMPT: a problem that reads as a find-and-replace of a prior attempt (same template, different number) is a failure. The prompt enumerates concrete axes of variation (object, algebraic incarnation, question direction, ambient setting).
Parametric-family rule in DECOMPOSE_PROMPT. Rule 6b rejects skill descriptions pinned to a single instance — every description must name a free parameter ($d$, $n$, $k$) or point to a technique that naturally applies to many configurations. Self-audit step (f2) checks "could 10+ non-identical subproblems exist for this skill?".

Real-time observability

on_keep / on_skip callbacks flush keeps.json and skips.json atomically after every accepted/rejected candidate, not just at end-of-skill. You can jq .n_problems any time during the run.
per_skill_stats.json refreshes on the same cadence.

Resume + auth resilience

--start-from-skill N CLI flag skips the first $N{-}1$ skills — useful after crashes or auth expiries. The other_skill_names isolation list still uses the full taxonomy; only generation is skipped.
Vertex ADC auto-refresh. Both LLM call paths (_robust_completion in taxonomy_generation, call_llm in distinct_llm_prompting) now detect 401/AuthenticationError, call _get_vertex_access_token() to re-read ADC from disk, and mutate client.api_key in place before the next tenacity retry. A shared _is_auth_error(exc) helper classifies by exception-type name + message substring.

Dedupe (`Stage1/dedupe_subproblems.py`, new)

BGE-large-en-v1.5 (2023, 335M params, 1024-dim) — chosen over the paper's MiniLM-L6-v2 because math-dense text with high surface-vocabulary overlap smears under smaller embedders.
Cosine similarity, greedy first-seen, configurable threshold.
Two modes: --per-skill (dedup inside each skill only; right default for skill-isolated training data) or global (default).
Atomic writes, preserves all fields of the input keeps payload.

First dataset: `datasets/conics-tangent-5/keeps_deduped_095.json`

86 subproblems across 10 skills, per-skill dedupe at cosine threshold 0.95.

Skill	Kept (of 15)	Notes
Dimension of plane curve linear system	5	Solver error floor — only borderline arithmetic slips land in-band
Degree of discriminant hypersurface	9
Divisor class of tangency	10
Naive Bézout count	1	Solver error floor — multiplication is mastered content
Fulton excess-intersection correction	13	Strongest signal; clean isolation
Chasles' characteristic numbers	10	Watch for target overlap on $c \ge 1$ problems
Veronese embedding → hyperplane	12	⚠️ see known issues
Lang–Weil point-count estimate	9	⚠️ see known issues
Chebotarev density	6
Degree of étale map (punchline)	11

Threshold tuning. At 0.88 (paper default, stronger embedder): 23/123 survived — too aggressive; BGE on math-dense text collapses genuinely distinct problems that share LaTeX + domain boilerplate. At 0.92: 47/123. At 0.95: 86/123. 0.95 only catches near-identical rewrites (same problem with a different number), which is the right semantic for training data.

Known issues

Both warrant a manual pass before training:

Skill 7 (Veronese), sample 1. Conflates "$Q$ passes through a point $P$" (legitimately linear under the Veronese embedding) with "$Q$ is tangent to a conic $C_i$" (actually a degree-6 divisor, class $6H$). The generated problem asks for "conics tangent to 5 smooth conics" via the Veronese translation and gets answer 1 at agreement=0.8 — but the correct answer is Chasles' 3264. The solver majority-voted on the false premise; the filter can't catch this class of error.
Skill 8 (Lang–Weil), off-by-one variant. The "smallest $M$ in $|N(p) - p| \le M\sqrt{p}$ for smooth degree-7 curve" problem lands at answer 31 with agreement=0.9, but the Weil bound is $M = 2g = 30$ for $g = 15$. Likely a systematic solver error from including the Lang–Weil error term without projective-vs-affine disambiguation.

Hand-inspect skill 7's 12 keeps and skill 8's 9 keeps before feeding to GRPO. 15–20 minutes of manual review will catch the canaries.

Production notes

Three separate pipeline runs produced these keeps (skills 1–2 → skills 3 → skills 4–10) because of auth expiries before the auto-refresh patch. The raw per-run keeps files live under the symlinked runs/ dir (not in this repo); the merged + deduped output is committed to datasets/.
Token auto-refresh survived a full 3h24m run without intervention after the patch.
_PRIOR_ATTEMPTS_MEMORY = 50 was tuned after observing 12-entry cycling; at 50 entries × ~250 tokens the memory block is ~12K tokens, comfortably within context.

Test plan

Manually review skill 7 (Veronese) keeps for the tangency-as-linear conflation
Manually review skill 8 (Lang–Weil) keeps for the off-by-one M bound
Spot-check 5 random keeps per skill for answer leakage in the problem statement
Sanity-run GRPO on a 10-problem subset before committing to full 86
If dataset count needs to be larger, re-run skills 9 and 10 with a different random seed / temperature and re-merge

Skeleton only: module header, non-negotiable constants, Skill dataclass. Decomposition and per-skill generation added in subsequent tasks. Design: docs/superpowers/specs/2026-04-21-taxonomy-generation-design.md

decompose_target() makes one LLM call, parses the JSON skills list, retries on parse failure, and returns Skill objects. save_skills / load_skills persist the decomposition with target hash and model metadata for reproducibility.

- save_skills: guard os.makedirs against bare-filename paths where dirname is '', which would raise FileNotFoundError on makedirs(''). - decompose_target: validate each skill entry has non-empty string fields for name/description/example_problem_hint before returning, and add TypeError to the caught retry exceptions so a malformed entry triggers a retry instead of slipping through with None values. - hashlib: move import to module top (was deferred inside target_text_hash for no reason).

generate_for_skill() pipelines candidates for one skill: generate, solve n_samples times, check agreement window + numeric. Stops at n_target keeps or max_candidates attempts. Reuses the solver and answer-extraction infrastructure from distinct_llm_prompting.py.

build_taxonomy_dataset() runs Phase 1 (with skills.json caching) then iterates Phase 2 per skill, persisting keeps/skips/per_skill_stats after each skill so a crash mid-run doesn't lose progress. CLI matches the existing Stage 1 shape but hardcodes the model and accepts --failed-solutions for symmetry (unused in v1).

Critical fix: the real solve_and_check_agreement in distinct_llm_prompting.py unconditionally calls pool.submit(...), so passing pool=None (the default) crashes on the first solve call. Tests masked this because they monkeypatch the solve function. Now build_taxonomy_dataset creates one ThreadPoolExecutor sized by --max-workers and threads it through generate_for_skill down to the real solve call. Also addresses the other final-review issues: - build_taxonomy_dataset now calls _lazy_import_get_client() at its top so direct callers (not just main) don't hit TypeError on a None get_client. - --max-workers is now actually used (was parsed and silently discarded) to size the solve pool. - skip reasons split into 'too_easy' vs 'too_hard' so skips.json distinguishes the two; previously both collapsed to 'out_of_window'. - _lazy_import_distinct_helpers split into narrower get_client / load_problem_from_txt variants so build_taxonomy_dataset doesn't eagerly pull in load_problem_from_txt (which only main() needs).

Previously generate_for_skill only printed a single summary line 'done: N/M passed' after all ~100 attempts finished, making each skill look frozen for up to 30 minutes with no output. Now each attempt prints one line showing: KEEP or skip reason, agreement rate, extracted answer, and a running [kept/target] counter. flush=True so the line appears immediately through tee pipes.

Removes the example_problem_hint field from the Skill dataclass, DECOMPOSE_PROMPT, GENERATE_PROMPT, and all tests. Rationale: hints in the decomposition prompt bias the generator toward a specific problem framing, narrowing the diversity we're trying to produce. name + description is enough context; the generator has more room to find its own angle on the skill. load_skills tolerates the legacy field so existing skills.json files from the earlier run still load (new test pins that behavior).

…lation Decomposition prompt now specifies *what makes a good decomposition* instead of just listing requirements. Adds: - Pairwise orthogonality check (could a student be strong at A but weak at B?) - Anti-restatement: skills must be components, not the target restated at smaller scale - Coverage of the single hardest insight: >=2 skills must build toward it explicitly - Difficulty spread: skill 1 undergrad-level, skill N close to mastering the target; no flat middle - Numerical-answer requirement pushed upstream so the decomposer factors it in rather than having downstream silently warp skills toward computational-only - Self-audit pass before producing the JSON Generation prompt now enforces: - Isolation from the other 9 skills (full taxonomy shown in-context) with an UNISOLATABLE sentinel escape hatch when a skill can't be cleanly separated -- useful decomposition-quality signal - No target leakage: no reuse of the target's setup, notation, or parameters - Difficulty calibrated to position #i of N, not a flat baseline Plumbing: - generate_for_skill gets skill_index / n_skills / other_skill_names - UNISOLATABLE candidates skip the solve pass (no compute spent on a sentinel), land in skips.json with reason='unisolatable' and reason_detail from the model - 20/20 tests pass, including new test_unisolatable_sentinel_skips_without_solving that pins the no-solve-on-sentinel behavior

Previous prompt got the *shape* of the hard insight (skills gesturing at excess intersection) but not its *content* (Chasles' characteristic numbers on the space of complete conics). The model also used 'difficulty spread' as license to pad with parametric sweeps like 'tangent to 2 / 3 / 4 conics', which are all the same Bezout skill applied at n=2,3,4. Two prompt-level fixes: 1. Name-the-technique (load-bearing): for the 2-3 hardest skills (the ones covering the hard insight), the name and description must name a specific technique, theorem, or named object -- 'Chebotarev density', 'Veronese embedding', 'Chasles characteristic numbers', 'Lang-Weil'. Goal-only descriptions like 'compute the correction term' or 'handle the excess locus' are explicitly disallowed and flagged during self-audit. 2. Anti-padding: n=2/3/4 variants of the same technique collapse into ONE skill. Paired facts about the same object (codimension vs. degree of the same divisor) also collapse. Difficulty progression must come from genuinely different reasoning, not from incrementing a parameter. Self-audit now runs 7 checks (was 5) covering both new rules.

Bug fix (critical, production-blocking): Vertex MaaS occasionally returns a raw string instead of a ChatCompletion object. The existing call_llm in distinct_llm_prompting.py guards against this with a tenacity retry, but call_llm bakes in timeout=180s which violates our no-timeouts constraint. New _robust_completion helper does the same guards (raw-string detection, empty-choices, empty-content) without any timeout, using hand-rolled jittered exponential backoff up to 8 retries. Both call sites (decompose_target, _generate_one_candidate) now route through it. Prompt tightening from mid-run audit: - Requirement 6 'No numbers in skill text': skill names and descriptions must not contain specific numerical answers, dimensions, degrees, or counts. With bad/good examples in the prompt. Mechanically checkable: reject any skill containing a digit. Fixes the '2^5 vs 6^5' factual-error mode by forcing the model to describe the technique rather than state the value. - Requirement 7 'No arithmetic-only skills': disallows postprocessing-only skills like 'compute floor(100*L)' or 'subtract the correction from the naive count'. These are target-leakage dressed up as skills. - Self-audit expanded from 7 to 9 checks covering both new rules. 20/20 tests still pass (existing mocks happen to return valid response objects so the new robust-completion path is exercised transparently).

Pipeline changes: - Subproblem generator now receives last 50 prior attempts (problem snippet + answer) so it actively avoids rewriting the same template; adds structural-diversity rule requiring varied problem shape. - Decompose prompt gains a parametric-family rule so each skill admits multiple instantiations (degree d, dimension n, ...) rather than a single pinned instance. - Incremental writes: keeps.json / skips.json flush after every candidate via on_keep / on_skip callbacks so the files grow in real time instead of only at end-of-skill. - Vertex ADC token auto-refreshes on 401 inside both LLM call paths (_robust_completion, call_llm) so long-running jobs survive token expiry without a kill+restart. - Added --start-from-skill N for mid-run resume after crashes. New script: - Stage1/dedupe_subproblems.py: BGE-large-en-v1.5 + cosine similarity with tunable threshold, global or per-skill modes. First dataset: - datasets/conics-tangent-5/keeps_deduped_095.json: 86 subproblems across 10 skills, per-skill dedupe at cosine=0.95 (tuned for math- dense text where surface vocabulary overlap inflates similarity).

drewsungg added 14 commits April 21, 2026 23:17

add spec for taxonomy-first subproblem generation

5b02d05

add implementation plan for taxonomy-first generation

bdebef3

scaffold Stage1/taxonomy_generation.py with Skill dataclass

1990a3b

Skeleton only: module header, non-negotiable constants, Skill dataclass. Decomposition and per-skill generation added in subsequent tasks. Design: docs/superpowers/specs/2026-04-21-taxonomy-generation-design.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taxonomy gen#2

Taxonomy gen#2
drewsungg wants to merge 14 commits intoDarrow8:mainfrom
drewsungg:taxonomy-gen

drewsungg commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewsungg commented Apr 23, 2026

Summary

What changed

Generator diversity (Stage1/taxonomy_generation.py)

Real-time observability

Resume + auth resilience

Dedupe (Stage1/dedupe_subproblems.py, new)

First dataset: datasets/conics-tangent-5/keeps_deduped_095.json

Known issues

Production notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Generator diversity (`Stage1/taxonomy_generation.py`)

Dedupe (`Stage1/dedupe_subproblems.py`, new)

First dataset: `datasets/conics-tangent-5/keeps_deduped_095.json`