Revisit: synthetic multi-stream dataset generator

## Background

In feature [001-ms-tcm-impl](../tree/main/specs/001-ms-tcm-impl), the original plan included a seeded synthetic dataset generator (per-storyline cluster-center feature vectors, configurable K/events/m/trials, bit-exact Parquet output across macOS/Ubuntu/Windows). After `/speckit.clarify`, we decided to replace the synthetic worked example with a real dataset (the category condition of Manning et al. 2023's FRFR study) so model fitting could target real behavior instead of simulated data.

## Why we might want to come back to this

- The FRFR dataset is a single real experiment with K=4 categories per list. It cannot exercise MS-TCM's distinguishing predictions at parameter regimes that aren't present in the data (e.g., large K, varied m, the §4.4 numerical regime at β_G = β_S = 0.5, w_G = 0.2, w_S = 0.8, m = 3).
- A synthetic generator lets us run the §4.4 numerical anchor end-to-end (as an integration test, not just a unit test of the equations).
- Fast offline CI without network dependencies.
- Parameter-recovery experiments: generate data at known parameters, fit, verify the MLE recovers them — the single best diagnostic of whether the fitter works.

## Proposed scope (when we revisit)

- Seeded `numpy.random.Generator(PCG64)` based generator with configurable K, events/storyline, bridge m-range, trials/condition, feature dimensionality, jitter σ.
- Output in the same Parquet schema used for FRFR-category; validation routine accepts both with no special-casing.
- Unit tests pinning the §4.4 numerical anchor (composite similarity 0.866 vs 0.806 at β_G = β_S = 0.5, w_G = 0.2, w_S = 0.8, m = 3) to four decimals.
- Parameter-recovery test: generate at a known (β_G, β_S, w_G, γ, λ), fit, assert recovery within two bootstrap SEs.

## Design artifacts already written

The spec/plan/tasks in `specs/001-ms-tcm-impl/` (before the FRFR switch) contain detailed decisions on Parquet conventions, bit-exactness across platforms (zstd level 1, no dictionary, no statistics, sorted rows), and the §4.4 numerical anchor. When we revisit this issue, those decisions can be lifted with minimal rework.

## Priority

Follow-on. Not blocking for the current FRFR-category feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit: synthetic multi-stream dataset generator #1

Background

Why we might want to come back to this

Proposed scope (when we revisit)

Design artifacts already written

Priority

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Revisit: synthetic multi-stream dataset generator #1

Description

Background

Why we might want to come back to this

Proposed scope (when we revisit)

Design artifacts already written

Priority

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions