Skip to content

Releases: PoHsuanLai/gsem

v0.2.0

12 Jun 19:00
d6e71e3

Choose a tag to compare

Added

  • Ported read_fusion (gsem::io::fusion_reader + CLI gsem read-fusion). Reads raw FUSION TWAS .dat association files and converts
    each trait's TWAS.Z into a standardized gene-expression effect/SE (binary
    traits via the liability conversion effect/√(effect²·HSQ + π²/3),
    continuous via Z/√(N·HSQ)), supports the permutation-Z path, then
    inner-joins traits on Gene+Panel into the merged TWAS format that
    twas_reader / multiGene / userGWAS(TWAS=TRUE) consume. This is the
    on-ramp from raw FUSION output into the Rust TWAS path (previously only
    pre-merged files could be read). Validated against live R read_fusion.

  • Ported subSV (gsem_matrix::vech::subset_sv). Subsets vech(S) and
    the V sampling-covariance block by a set of 1-based vech positions, for
    both the with-diagonal (TYPE="S"/"S_Stand") and off-diagonal
    (TYPE="R") numbering. Validated against R to 1e-12. (R's matrix-input
    path has an undefined-RMATRIX bug; the port matches the bug-free
    LDSC_OBJECT path.)

  • Ported summaryGLSbands numeric core
    (gsem::stats::gls::summary_gls_bands). GLS fit plus the confidence-band
    data — predictor grid, fitted line, and ±BAND_SIZE·SE envelope at
    INTERVALS points, with INTERCEPT/QUAD/CONTROLVARS support.
    Validated against R to 1e-9. The ggplot rendering is not ported.

  • Per-option R-equivalence coverage for the drop-in surface. New
    real-package fixtures and tests for previously-untested option branches:
    userGWAS estimation="ML" / GC={conserv,none} / Q_SNP / std.lv;
    ldsc stand=TRUE (S_Stand/V_Stand) / select="ODD" / chisq.max /
    liability-scale; sumstats ambig=TRUE (full numeric parity); a
    non-degenerate misspecified-model CFI/chisq fixture; the paLDSC diag
    branch; and bindings parity suites (R testthat + Python pytest).

  • Coverage instrumentation in CI. A cargo-llvm-cov coverage job with a
    ratcheting floor, Codecov upload + README badge, and an advisory R-side
    covr job.

  • R documentation site (pkgdown). A published docs site for the gsemr
    binding (reference index + Compatibility/Architecture articles generated
    from the repo-root docs), deployed to GitHub Pages, with a navbar link to
    the upstream GenomicSEM project.

Fixed

  • std.lv left the factor scale unidentified. parse_model(std_lv=true)
    freed the auto-added latent variance instead of fixing it to 1 (lavaan's
    std.lv=TRUE convention: auto.fix.first=FALSE + latent variances fixed
    to 1). Any std.lv model — userGWAS / usermodel / rgmodel(std_lv=)
    estimated an unidentified factor scale and was ~12% off R. Now the
    auto-add fixes the latent variance under std_lv. (crates/gsem-sem/src/syntax.rs)

  • Q_SNP heterogeneity statistic + LDSC intercept floor. compute_q_snp
    computed the wrong quadratic form, and the LDSC intercept diagonal was not
    floored at 1.0 before building the per-SNP V (R's userGWAS.R:156).
    Both fixed; Q_SNP now matches R to ~1e-9.
    (crates/gsem/src/gwas/{q_snp,gc_correction,user_gwas}.rs)


The following entries accumulated on master after 0.1.3 and ship in 0.2.0:

Added

  • R-equivalence coverage for multiSNP, multiGene, simLDSC, and
    hdl.
    New live-R fixtures and tests validate the four previously
    untested functions, closing the last function-coverage gaps. multiSNP
    and multiGene reproduce R's augmented S_Full/V_Full matrices to
    1e-12/1e-14; simLDSC's deterministic per-SNP Z covariance matches R's
    exact varZ/covZ algebra to 1e-9; hdl reproduces R's genetic-
    covariance matrix S to optimiser-level precision on a synthetic LD
    panel built in R's exact .rda/.bim format.

Changed

  • hdl evaluated in the eigenspace (match R). gsem's HDL likelihood
    was fed raw LD scores as lam and the raw per-SNP bhat, whereas R
    GenomicSEM/HDL evaluates the likelihood in the LD-block eigenspace:
    lam = eigenvalues of the block correlation matrix and
    bstar = Vᵀ·bhat (projection onto eigenvectors). LdPiece now carries
    the per-piece eigenvalues/eigenvectors (read from the .rda panel, or
    the new chr*.eigen.tsv text file emitted by convert_hdl_panels), the
    per-trait reference N uses R's median(N), the likelihood floor matches
    R's exp(-18), and the per-piece optimiser is a projected-gradient
    method mirroring R's optim(L-BFGS-B). (crates/gsem-ldsc/src/hdl.rs)

Fixed

  • simLDSC phenotypic-overlap term. The off-diagonal environmental
    contribution to the per-SNP Z covariance was
    rPheno·n_overlap·√(NᵢNⱼ)/n_snps, carrying a spurious √(NᵢNⱼ)/n_snps
    factor. R GenomicSEM uses rPheno_ij · N_ij/√(NᵢNⱼ) (= rPheno·n_overlap
    for scalar sample overlap). Fixed and exposed as the deterministic
    per_snp_z_cov, validated against R's construction.
    (crates/gsem/src/stats/simulation.rs)

  • multiSNP/multiGene V_Full construction. run_multi_snp built
    V_Full as a diagonal approximation that ignored the LDSC intercept
    matrix. The new build_multi_snp_sv reproduces R's full sampling
    covariance — SNP-trait variances (SE·I_diag·varSNP)², cross-trait
    within-SNP, cross-SNP within-trait, and cross-SNP cross-trait blocks,
    plus the trait-trait V_LD block — matching R bit-for-bit.
    (crates/gsem/src/gwas/multi_snp.rs)

Documented R bugs (gsem implements the correct behavior)

  • multiSNP/multiGene constant cross-SNP cross-trait LD. R weights
    every cross-SNP cross-trait sampling-covariance cell by a single
    constant LD value (LD2[(f²−f)/2], the last lower-triangle entry)
    rather than the actual SNP-pair LD. gsem uses the correct per-pair
    LD[a,b]; the two coincide when the off-diagonal LD is constant (as in
    the equivalence fixtures). A unit test locks gsem's corrected path.

  • multiGene aborts for k ≥ 2 traits. R's multiGene assigns into
    V_SNP[y,x] — a variable that does not exist in multiGene (a typo for
    V_Gene) — so the function errors with object 'V_SNP' not found
    whenever there is more than one trait. gsem implements the corrected
    algorithm (identical to multiSNP with gene heritabilities as the
    "variances"); the reference is generated from a minimally-patched
    multiGene.

  • hdl intercept weak identification. The per-piece HDL intercept
    enters the likelihood only as int·lam/N and is weakly identified on
    small panels; R's optim drifts on some LD blocks while gsem's gradient
    method recovers the construction-true intercept. The genetic-covariance
    matrix S (HDL's primary output) is robustly identified and matches R.

  • sumstats beta standardization. gsem's sumstats wrote the raw
    input betas/SEs instead of standardizing them like R GenomicSEM. Since
    userGWAS/commonfactorGWAS build cov(SNP, trait) = varSNP · beta,
    this propagated a sqrt(varSNP) scale error into per-SNP GWAS output.
    Now implements all of R's modes — OLS (beta = Z/√(N·varSNP)),
    linprob, se.logit, the default logistic "none" transform, and
    betas pass-through — reading the P and N columns and deriving Z from
    P. Validated formula-for-formula against R for every mode.
    (crates/gsem/src/sumstats.rs)

  • sumstats ambiguous-SNP default. R GenomicSEM removes
    strand-ambiguous SNPs (A/T, C/G) only when ambig=TRUE; its default
    keeps them. The R and Python bindings (and the CLI) mapped
    keep_ambig = ambig, so their default dropped ~⅓ of SNPs where R keeps
    them. Fixed to keep_ambig = !ambig across the R binding, Python
    binding, and CLI (which gains an --ambig flag), matching R's default.

  • rgmodel V_R dimension. rgmodel returned V_R as the full
    kstar × kstar sampling covariance of vech(R) (including the fixed
    diagonal-1 correlations, which have ~zero variance). R GenomicSEM returns
    V_R for the off-diagonal correlations only (k(k−1)/2 square). Fixed to
    extract the off-diagonal vech block, matching R's shape and ordering (so
    the bindings now return R's shape too). (crates/gsem-sem/src/rgmodel.rs)

  • enrich model-based functional enrichment. gsem's enrich
    (enrichment_test) computed the classic LDSC heritability-enrichment ratio
    (prop_h²/prop_SNPs, the original Finucane statistic), not R GenomicSEM's
    model-based enrich. R fits a SEM to a baseline annotation, fixes the
    regressions/loadings (per fix) to their baseline estimates, re-fits each
    annotation freeing the remaining parameters, and reports per-parameter
    enrichment = (est_annot/est_baseline)/Prop with SE and a 1-sided p-value.
    Added gsem_sem::enrich_model::model_enrichment (+ a shared gsem_sem::fit
    helper) implementing this with the regressions/covariances/variances fix
    modes, wired through the R binding (enrich(s_covstruc, model, params, fix))
    and the Python binding (model_enrichment). Validated against live R
    GenomicSEM on a factor-variance covstruc. gsem's classic ratio is retained
    as enrichment_test. (crates/gsem-sem/src/enrich_model.rs)

  • s_ldsc overlap-weighted partitioned heritability. Stratified LDSC
    returned per-annotation S as the raw coefficient contribution tau · M
    (R's S_Tau), missing R's overlap weighting. R computes each category's
    partitioned (co)heritability as S = overlap · (tau · M), where
    overlap[f,a] = M(f,a)/M_a and M(f,a) is the number of SNPs in both
    annotations f and a (from the .annot.gz membership + .frq MAF
    filter). These agree only for disjoint annotations and diverge for the
    realistic overlapping baselineLD model. Added a .annot.gz/.frq
    cross-product reader and applied the overlap transform to both S and the
    jackknife-based V; the result now exposes s_annot/v_annot (R's
    S/V) and s_tau/v_tau (R's S_Tau/V_Tau). Threaded through the R
    binding, Python binding, and CLI (new --frq flag). Validated against R
    on synt...

Read more

v0.1.3

12 Apr 20:13

Choose a tag to compare

Critical fix for cross-trait LDSC

Fixed

  • LDSC cross-trait allele alignment. Z-scores were not flipped when A1 differed between traits, producing wrong-sign genetic covariance and ~10x inflated cross-trait intercepts. Diagonal (heritability) estimates were unaffected. All users running multi-trait LDSC should upgrade.
  • LDSC per-trait chi² filtering now matches R GenomicSEM's pre-filter-then-merge behavior. Cross-trait SNP counts now agree exactly with R.

Full Changelog: v0.1.2...v0.1.3

v0.1.2

12 Apr 18:29

Choose a tag to compare

What's Changed

Fixed

  • Equality constraints (a*V1 + a*V2 syntax) now correctly share a single free parameter. Previously each labelled term got its own independent parameter, silently ignoring the constraint.
  • Labelled first-indicator loadings (e.g. F1 =~ a*V1 + a*V2) are now free, not fixed to 1. Matches lavaan semantics.
  • Observed variable detection for covariance-only models (RL ~~ MDD) no longer fails with "0 observed variables".
  • L-BFGS convergence flag: when the line search exhausts all step sizes, the optimizer is at a stationary point — now correctly reports converged = true.

Added

  • Heywood case warnings: negative variance estimates are detected after every SEM fit and logged at WARN level via the log crate. Surfaces in R, Python, and CLI.
  • New tests for equality constraints (shared params, value propagation, implied covariance, Jacobian) and negative variance detection.

Full Changelog: v0.1.1...v0.1.2

v0.1.1

11 Apr 12:09

Choose a tag to compare

Second release of GenomicSEM-rs. Patch release on top of v0.1.0 with
pre-built R binaries, Windows R support, the CHANGELOG.md file, and
a handful of R CMD check fixes.

Full diff: v0.1.0...v0.1.1

Added

  • Pre-built R binary packages for Linux, macOS, and Windows attached
    to every GitHub release. Users no longer need a Rust toolchain to
    install gsemr — pick the platform-native file from the release page
    and run install.packages(<url>, repos = NULL). Source tarball +
    remotes::install_github remain as the "from source" fallback for
    unmatched R versions and dev builds.
  • R package now ships with full Windows build support: configure.win,
    src/Makevars.win, and -Wl,--export-all-symbols in the MinGW link
    line so R .Call("wrap__*_rust", ...) resolves at runtime on
    Windows. R CMD check is green on ubuntu / macos / windows.
  • workflow_dispatch trigger on publish.yml with a release_tag
    input, so the R-only job path can be manually re-run against an
    existing release (to ship a late R binary or fix without burning a
    crates.io / PyPI version slot).
  • readme = "README.md" and [project.urls] in
    bindings/python/pyproject.toml so the PyPI project page renders a
    description with links to the repo, issue tracker, and architecture
    docs. (The 0.1.0 wheels shipped before this landed — the PyPI 0.1.0
    page will continue to show "no project description" until 0.1.1 is
    uploaded.)

Changed

  • bindings/python/README.md: absolute GitHub URLs instead of
    relative ../../API_COMPAT.md / ../../ARCHITECTURE.md links
    (which don't render on PyPI); function list expanded from 10 to all
    17 exports and grouped as Core pipeline vs Advanced.
  • Root README.md: R install section restructured — binary
    packages as the primary path, source install demoted to a "From
    source" subsection. New Rust MSRV (1.88+) badge.
  • R package internal cleanup:
    • Explicit importFrom(stats, pnorm, setNames) /
      importFrom(utils, read.table, write.table) in NAMESPACE,
      clearing the "no visible global function definition" notes.
    • @param out added to sumstats.R + regenerated sumstats.Rd
      (fixes an "Undocumented arguments in Rd file" WARNING).
    • {chr} in ldsc.R's @param ld description escaped to
      \code{<chr>.l2.ldscore.gz} (fixes "Lost braces" NOTE).
    • configure / configure.win print rustc --version / cargo --version before the build (fixes "No rustc version reported"
      WARNING).
    • configure / configure.win now rm -rf src/rust/target after
      copying out libgsemr.a, so R CMD check doesn't scan a duplicate
      rust/target/release/libgsemr.a and double-report the Rust
      stdlib's _exit / abort / exit symbols.

Removed

  • PyLdscResult.to_json() and gsem.LdscResult.from_json(s) — the
    json: String field and the conversions::json_to_ldsc helper that
    backed them. These were dead compat from the pre-NumPy era; every hot
    path has been reading s / v / i_mat through NumPy getters for a
    while. This is the only user-visible break in 0.1.1. Callers who
    were serializing LDSC results to disk should use pickle or pass the
    result object directly into downstream functions (all 17 exported
    functions accept it).

Fixed

  • read_sumstats aborted on NA tokens, so .sumstats.gz files that
    worked in GenomicSEM::ldsc failed in gsemr::ldsc with a cryptic
    gsemr::ldsc error: invalid N long before sample.prev /
    population.prev were even evaluated. GenomicSEM tolerates these
    files because its reader calls na.omit(read_delim(...)); gsemr now
    matches that behaviour — rows whose N or Z field is an NA token
    (empty, ., NA, NaN, N/A, NULL; case-insensitive) are
    silently dropped, with a one-line INFO log naming the file and the
    dropped-row count. Genuine parse failures now report file path, line
    number, and the offending value, and load_trait_data wraps the
    error with the failing file path so multi-trait runs tell you which
    input went bad. (crates/gsem/src/io/gwas_reader.rs)
  • publish.yml skip cascade: check-r-package and
    upload-r-to-release no longer inherit the implicit if: success()
    gate, which was evaluating to false on workflow_dispatch runs
    (because publish-crates is skipped there). They now use
    if: always() && needs.<upstream>.result == 'success' and proceed
    as long as their direct needs are green.
  • Versioned workspace deps: root Cargo.toml internal deps now
    carry version = "0.1.0" alongside path, required by
    cargo publish for inter-crate references.
  • R binding Cargo patch: bindings/r/src/rust/Cargo.toml declares
    internal crates at crates.io versions and uses [patch.crates-io]
    to redirect to the in-repo sources when available. configure
    strips the patch block when the local crates aren't reachable
    (tarball install), so cargo falls back to the published crates
    cleanly.

v0.1.0

11 Apr 07:41

Choose a tag to compare

First tagged release of GenomicSEM-rs.

A Rust rewrite of R GenomicSEM, shipped as a 6-crate workspace with
R (gsemr) and Python (genomicsem) bindings plus a standalone CLI
(gsem).

What's in this release

Core engines (crates.io)

  • gsem-matrix — matrix utilities (nearest PD, half-vec, PSD smoothing)
  • gsem-ldsc — LD Score Regression with block jackknife
  • gsem-sem — SEM engine (DWLS/ML, lavaan syntax, sandwich SEs)
  • gsem — pipeline + CLI binary (munge, ldsc, s_ldsc, hdl, sumstats,
    commonfactor, usermodel, userGWAS, commonfactorGWAS, rgmodel,
    write.model, paLDSC, enrich, multiSNP, simLDSC, summaryGLS)

Bindings

  • gsemr — R package (source tarball attached below)
  • genomicsem — Python package (PyPI wheels for Linux/macOS/Windows)

Highlights

  • End-to-end drop-in compatibility with R GenomicSEM's public API on
    all 18 user-facing functions, with the same argument shapes, output
    layouts, and file formats.
  • Significant wall-clock speedups on the PGC benchmark (see README
    comparison table).
  • Every user-facing function on every surface (R, Python, CLI) ships
    with a worked Examples block in its help output.

See README.md for the installation and usage walk-through.