Stage 2 (ADR 0004): repr-align data-collection contract layer#23
Draft
FluffyAIcode wants to merge 3 commits into
Draft
Stage 2 (ADR 0004): repr-align data-collection contract layer#23FluffyAIcode wants to merge 3 commits into
FluffyAIcode wants to merge 3 commits into
Conversation
Make training/repr_align/__init__.py expose ReprAlignedSurgery and
SurgeryConfig via __getattr__ instead of an eager top-of-module
import. The eager import paid a torch + transformers import cost
on every submodule access, which:
- blocked importing torch-free siblings (incoming
training.repr_align.data_collection package)
- made unit-test collection segfault on this VM under
coverage instrumentation (torch's PyMethodDef + coverage
tracer disagree on Python 3.12 METH flags)
Lazy-import via PEP 562 keeps the public API stable
(training.repr_align.ReprAlignedSurgery still works) while letting
new torch-free subpackages be imported without paying the heavy
import cost.
Verified:
python3 -c "import training.repr_align.data_collection.schema; import sys; assert 'torch' not in sys.modules"
→ torch not loaded
python3 -c "import training.repr_align as ra; ra.ReprAlignedSurgery"
→ lazy attribute resolves correctly
pytest tests/training/repr_align/test_proposer_surgery.py
→ 48 passed (no regression)
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Ship the contract layer for representation-alignment training data
collection. This is the foundation that downstream PRs (rollout
worker + per-domain configs) plug into.
Scope intentionally narrow:
- schema — single source of truth for the per-token row +
per-shard meta + pyarrow Schema. SCHEMA_VERSION = 1.
- prompt_pool — multi-domain composition with quotas, length
filter, language tagging, dedup. Pluggable
protocols (LanguageDetector, Deduper) with
dependency-free reference impls.
- parquet_writer — atomic versioned shard writer at
'data/alignment/<verifier_id>/<verifier_dtype>/
<schema_version>/shard_NNNNN.parquet' per ADR
0004 §2.5. Writes meta sidecar + parquet
atomically (tmp+rename) so readers never see
a half-written shard.
Out of scope (next PR in this work line):
- rollout_worker.py (loads real verifier, drives generation,
captures hidden states) — needs torch + transformers and is
where the heavy testing surface lives.
- configs/*.yaml (7 per-domain quota / source configs).
- post_filter.py (low-confidence + repetition + EOS-fail filters).
Quality bars (per project policy):
- 100% line coverage on the new module:
schema.py 115/115 lines
prompt_pool.py 155/155 lines
parquet_writer.py 129/129 lines
__init__.py 4/4 lines
- 97 unit tests, all real concrete classes (no mocks). Filter,
dedup and pool flows are tested with deterministic test
doubles that implement the public Protocol interfaces.
- No torch dependency: data_collection imports cleanly without
pulling in training.repr_align.proposer_surgery (relies on
the PEP 562 lazy-import shipped in the previous commit).
Deps:
- pyarrow>=15,<25 added to requirements.txt for the fixed-size-
list schema enforcement.
References:
- docs/adr/0004-alignment-training-data-preparation-policy.md
§2.1 (prompt pool composition), §2.2 (capture spec),
§2.5 (verifier-specific data isolation + path layout)
- training/repr_align/__init__.py (Stage roadmap)
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
CI on PR #23 reported 99.68% coverage with 5 statements missing on training/repr_align/__init__.py — the lazy __getattr__ + __dir__ hooks are unreachable from the existing test_proposer_surgery.py because that file imports proposer_surgery directly, never going through the package-level lazy attribute machinery. Added 4 tests that exercise the contract: - lazy attr resolves to ReprAlignedSurgery (and is the same object as a direct import from proposer_surgery) - lazy attr resolves to SurgeryConfig - unknown attr raises AttributeError with the documented message - dir() lists both public symbols and ordinary module attrs Verified locally: tests/training/ runs 149 tests pass. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First PR of the v0.3.x representation-alignment development line that opens
right after
v0.3.0-rc1. Ships the data-collection contract layer —the schema, prompt pool, and atomic versioned-Parquet writer that downstream
rollout-worker / trainer / eval code will plug into.
This PR is intentionally narrow. It is the foundation, not the implementation
of training itself. Subsequent PRs in this line:
rollout_worker.py(loads real verifier, drives generation,captures hidden states) +
post_filter.py(low-confidence / repetition /EOS-fail rules per ADR 0004 §2.2).
long-context / multi-turn / tool-call) per ADR 0004 §2.1 quotas.
trainer.pyconsuming the schema this PR locks in.eval.pyreporting the §2.7 / §2.8 metrics.What's new
training/repr_align/data_collection/(4 modules, 763 lines)schema.pyRolloutMeta,RolloutRow, pyarrowSchemabuilder,SCHEMA_VERSION = 1prompt_pool.pyLanguageDetector/DeduperProtocols with dependency-free reference impls (CharRatioLanguageDetector,ShingleJaccardDeduper)parquet_writer.pydata/alignment/<verifier_id>/<verifier_dtype>/<schema_version>/shard_NNNNN.parquetper ADR 0004 §2.5. Tmp+rename for parquet AND meta sidecar__init__.pyThe capture row schema mirrors ADR 0004 §2.2 exactly:
top_token_ids,top_probsandhidden_stateare fixed-size pyarrowlists (sizes from per-shard meta) so the trainer can vector-load batches
without per-row shape checks.
tests/training/repr_align/data_collection/(3 test files, 1050 lines)test_schema.pytest_prompt_pool.pytest_parquet_writer.pytraining/repr_align/__init__.py(PEP 562 lazy import)Refactored to expose
ReprAlignedSurgery/SurgeryConfigvia__getattr__instead of an eager import. Reasons:
proposer_surgerypulls intorch+transformers(heavy).data_collectionsubpackage is intentionally torch-free.Without lazy imports, importing
training.repr_align.data_collection.schemawould still pay the full torch import cost via the parent package.
(
SystemError: bad call flagsfrom torch's PyMethodDef undercoverage's tracer). Lazy import lets the new tests run with
coverage cleanly.
Public API unchanged:
training.repr_align.ReprAlignedSurgerystill works.requirements.txtAdded
pyarrow>=15,<25for the fixed-size-list schema enforcement.Verification
Coverage runs use
--confcutdir=tests/trainingto skip the torch-importingtop-level
tests/conftest.py(this is a local-VM workaround for a torchthe standard pytest invocation).
Quality bars
AlternatingRejectDeduper,ShingleJaccardDeduper,CharRatioLanguageDetector, realtmp_pathfilesystem). The few "test doubles" implement the published
Protocolinterfaces structurally, exactly as ADR 0001 §3 specifies for the
proposer/verifier doubles.
defaulting (
verifier_idmust matchorg/name,dtypemust be aliteral, schema_version mismatch refuses to load, etc.).
sizes, atomic file appearance, quota counts, RNG reproducibility),
not specific token/prob values that would couple to upstream changes.
Out of scope explicitly
personal layer.
References
docs/adr/0001-proposer-sizing-and-alignment.md§4 (Stage roadmap)docs/adr/0004-alignment-training-data-preparation-policy.md§2.1 (prompt pool), §2.2 (capture spec), §2.5 (path layout)
git show v0.3.0-rc1