Skip to content

Revisit: synthetic multi-stream dataset generator #1

@jeremymanning

Description

@jeremymanning

Background

In feature 001-ms-tcm-impl, the original plan included a seeded synthetic dataset generator (per-storyline cluster-center feature vectors, configurable K/events/m/trials, bit-exact Parquet output across macOS/Ubuntu/Windows). After /speckit.clarify, we decided to replace the synthetic worked example with a real dataset (the category condition of Manning et al. 2023's FRFR study) so model fitting could target real behavior instead of simulated data.

Why we might want to come back to this

  • The FRFR dataset is a single real experiment with K=4 categories per list. It cannot exercise MS-TCM's distinguishing predictions at parameter regimes that aren't present in the data (e.g., large K, varied m, the §4.4 numerical regime at β_G = β_S = 0.5, w_G = 0.2, w_S = 0.8, m = 3).
  • A synthetic generator lets us run the §4.4 numerical anchor end-to-end (as an integration test, not just a unit test of the equations).
  • Fast offline CI without network dependencies.
  • Parameter-recovery experiments: generate data at known parameters, fit, verify the MLE recovers them — the single best diagnostic of whether the fitter works.

Proposed scope (when we revisit)

  • Seeded numpy.random.Generator(PCG64) based generator with configurable K, events/storyline, bridge m-range, trials/condition, feature dimensionality, jitter σ.
  • Output in the same Parquet schema used for FRFR-category; validation routine accepts both with no special-casing.
  • Unit tests pinning the §4.4 numerical anchor (composite similarity 0.866 vs 0.806 at β_G = β_S = 0.5, w_G = 0.2, w_S = 0.8, m = 3) to four decimals.
  • Parameter-recovery test: generate at a known (β_G, β_S, w_G, γ, λ), fit, assert recovery within two bootstrap SEs.

Design artifacts already written

The spec/plan/tasks in specs/001-ms-tcm-impl/ (before the FRFR switch) contain detailed decisions on Parquet conventions, bit-exactness across platforms (zstd level 1, no dictionary, no statistics, sorted rows), and the §4.4 numerical anchor. When we revisit this issue, those decisions can be lifted with minimal rework.

Priority

Follow-on. Not blocking for the current FRFR-category feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions