Background
In feature 001-ms-tcm-impl, the original plan included a seeded synthetic dataset generator (per-storyline cluster-center feature vectors, configurable K/events/m/trials, bit-exact Parquet output across macOS/Ubuntu/Windows). After /speckit.clarify, we decided to replace the synthetic worked example with a real dataset (the category condition of Manning et al. 2023's FRFR study) so model fitting could target real behavior instead of simulated data.
Why we might want to come back to this
- The FRFR dataset is a single real experiment with K=4 categories per list. It cannot exercise MS-TCM's distinguishing predictions at parameter regimes that aren't present in the data (e.g., large K, varied m, the §4.4 numerical regime at β_G = β_S = 0.5, w_G = 0.2, w_S = 0.8, m = 3).
- A synthetic generator lets us run the §4.4 numerical anchor end-to-end (as an integration test, not just a unit test of the equations).
- Fast offline CI without network dependencies.
- Parameter-recovery experiments: generate data at known parameters, fit, verify the MLE recovers them — the single best diagnostic of whether the fitter works.
Proposed scope (when we revisit)
- Seeded
numpy.random.Generator(PCG64) based generator with configurable K, events/storyline, bridge m-range, trials/condition, feature dimensionality, jitter σ.
- Output in the same Parquet schema used for FRFR-category; validation routine accepts both with no special-casing.
- Unit tests pinning the §4.4 numerical anchor (composite similarity 0.866 vs 0.806 at β_G = β_S = 0.5, w_G = 0.2, w_S = 0.8, m = 3) to four decimals.
- Parameter-recovery test: generate at a known (β_G, β_S, w_G, γ, λ), fit, assert recovery within two bootstrap SEs.
Design artifacts already written
The spec/plan/tasks in specs/001-ms-tcm-impl/ (before the FRFR switch) contain detailed decisions on Parquet conventions, bit-exactness across platforms (zstd level 1, no dictionary, no statistics, sorted rows), and the §4.4 numerical anchor. When we revisit this issue, those decisions can be lifted with minimal rework.
Priority
Follow-on. Not blocking for the current FRFR-category feature.
Background
In feature 001-ms-tcm-impl, the original plan included a seeded synthetic dataset generator (per-storyline cluster-center feature vectors, configurable K/events/m/trials, bit-exact Parquet output across macOS/Ubuntu/Windows). After
/speckit.clarify, we decided to replace the synthetic worked example with a real dataset (the category condition of Manning et al. 2023's FRFR study) so model fitting could target real behavior instead of simulated data.Why we might want to come back to this
Proposed scope (when we revisit)
numpy.random.Generator(PCG64)based generator with configurable K, events/storyline, bridge m-range, trials/condition, feature dimensionality, jitter σ.Design artifacts already written
The spec/plan/tasks in
specs/001-ms-tcm-impl/(before the FRFR switch) contain detailed decisions on Parquet conventions, bit-exactness across platforms (zstd level 1, no dictionary, no statistics, sorted rows), and the §4.4 numerical anchor. When we revisit this issue, those decisions can be lifted with minimal rework.Priority
Follow-on. Not blocking for the current FRFR-category feature.