Skip to content

v0.2.0

Latest

Choose a tag to compare

@PoHsuanLai PoHsuanLai released this 12 Jun 19:00
· 5 commits to master since this release
d6e71e3

Added

  • Ported read_fusion (gsem::io::fusion_reader + CLI gsem read-fusion). Reads raw FUSION TWAS .dat association files and converts
    each trait's TWAS.Z into a standardized gene-expression effect/SE (binary
    traits via the liability conversion effect/√(effect²·HSQ + π²/3),
    continuous via Z/√(N·HSQ)), supports the permutation-Z path, then
    inner-joins traits on Gene+Panel into the merged TWAS format that
    twas_reader / multiGene / userGWAS(TWAS=TRUE) consume. This is the
    on-ramp from raw FUSION output into the Rust TWAS path (previously only
    pre-merged files could be read). Validated against live R read_fusion.

  • Ported subSV (gsem_matrix::vech::subset_sv). Subsets vech(S) and
    the V sampling-covariance block by a set of 1-based vech positions, for
    both the with-diagonal (TYPE="S"/"S_Stand") and off-diagonal
    (TYPE="R") numbering. Validated against R to 1e-12. (R's matrix-input
    path has an undefined-RMATRIX bug; the port matches the bug-free
    LDSC_OBJECT path.)

  • Ported summaryGLSbands numeric core
    (gsem::stats::gls::summary_gls_bands). GLS fit plus the confidence-band
    data — predictor grid, fitted line, and ±BAND_SIZE·SE envelope at
    INTERVALS points, with INTERCEPT/QUAD/CONTROLVARS support.
    Validated against R to 1e-9. The ggplot rendering is not ported.

  • Per-option R-equivalence coverage for the drop-in surface. New
    real-package fixtures and tests for previously-untested option branches:
    userGWAS estimation="ML" / GC={conserv,none} / Q_SNP / std.lv;
    ldsc stand=TRUE (S_Stand/V_Stand) / select="ODD" / chisq.max /
    liability-scale; sumstats ambig=TRUE (full numeric parity); a
    non-degenerate misspecified-model CFI/chisq fixture; the paLDSC diag
    branch; and bindings parity suites (R testthat + Python pytest).

  • Coverage instrumentation in CI. A cargo-llvm-cov coverage job with a
    ratcheting floor, Codecov upload + README badge, and an advisory R-side
    covr job.

  • R documentation site (pkgdown). A published docs site for the gsemr
    binding (reference index + Compatibility/Architecture articles generated
    from the repo-root docs), deployed to GitHub Pages, with a navbar link to
    the upstream GenomicSEM project.

Fixed

  • std.lv left the factor scale unidentified. parse_model(std_lv=true)
    freed the auto-added latent variance instead of fixing it to 1 (lavaan's
    std.lv=TRUE convention: auto.fix.first=FALSE + latent variances fixed
    to 1). Any std.lv model — userGWAS / usermodel / rgmodel(std_lv=)
    estimated an unidentified factor scale and was ~12% off R. Now the
    auto-add fixes the latent variance under std_lv. (crates/gsem-sem/src/syntax.rs)

  • Q_SNP heterogeneity statistic + LDSC intercept floor. compute_q_snp
    computed the wrong quadratic form, and the LDSC intercept diagonal was not
    floored at 1.0 before building the per-SNP V (R's userGWAS.R:156).
    Both fixed; Q_SNP now matches R to ~1e-9.
    (crates/gsem/src/gwas/{q_snp,gc_correction,user_gwas}.rs)


The following entries accumulated on master after 0.1.3 and ship in 0.2.0:

Added

  • R-equivalence coverage for multiSNP, multiGene, simLDSC, and
    hdl.
    New live-R fixtures and tests validate the four previously
    untested functions, closing the last function-coverage gaps. multiSNP
    and multiGene reproduce R's augmented S_Full/V_Full matrices to
    1e-12/1e-14; simLDSC's deterministic per-SNP Z covariance matches R's
    exact varZ/covZ algebra to 1e-9; hdl reproduces R's genetic-
    covariance matrix S to optimiser-level precision on a synthetic LD
    panel built in R's exact .rda/.bim format.

Changed

  • hdl evaluated in the eigenspace (match R). gsem's HDL likelihood
    was fed raw LD scores as lam and the raw per-SNP bhat, whereas R
    GenomicSEM/HDL evaluates the likelihood in the LD-block eigenspace:
    lam = eigenvalues of the block correlation matrix and
    bstar = Vᵀ·bhat (projection onto eigenvectors). LdPiece now carries
    the per-piece eigenvalues/eigenvectors (read from the .rda panel, or
    the new chr*.eigen.tsv text file emitted by convert_hdl_panels), the
    per-trait reference N uses R's median(N), the likelihood floor matches
    R's exp(-18), and the per-piece optimiser is a projected-gradient
    method mirroring R's optim(L-BFGS-B). (crates/gsem-ldsc/src/hdl.rs)

Fixed

  • simLDSC phenotypic-overlap term. The off-diagonal environmental
    contribution to the per-SNP Z covariance was
    rPheno·n_overlap·√(NᵢNⱼ)/n_snps, carrying a spurious √(NᵢNⱼ)/n_snps
    factor. R GenomicSEM uses rPheno_ij · N_ij/√(NᵢNⱼ) (= rPheno·n_overlap
    for scalar sample overlap). Fixed and exposed as the deterministic
    per_snp_z_cov, validated against R's construction.
    (crates/gsem/src/stats/simulation.rs)

  • multiSNP/multiGene V_Full construction. run_multi_snp built
    V_Full as a diagonal approximation that ignored the LDSC intercept
    matrix. The new build_multi_snp_sv reproduces R's full sampling
    covariance — SNP-trait variances (SE·I_diag·varSNP)², cross-trait
    within-SNP, cross-SNP within-trait, and cross-SNP cross-trait blocks,
    plus the trait-trait V_LD block — matching R bit-for-bit.
    (crates/gsem/src/gwas/multi_snp.rs)

Documented R bugs (gsem implements the correct behavior)

  • multiSNP/multiGene constant cross-SNP cross-trait LD. R weights
    every cross-SNP cross-trait sampling-covariance cell by a single
    constant LD value (LD2[(f²−f)/2], the last lower-triangle entry)
    rather than the actual SNP-pair LD. gsem uses the correct per-pair
    LD[a,b]; the two coincide when the off-diagonal LD is constant (as in
    the equivalence fixtures). A unit test locks gsem's corrected path.

  • multiGene aborts for k ≥ 2 traits. R's multiGene assigns into
    V_SNP[y,x] — a variable that does not exist in multiGene (a typo for
    V_Gene) — so the function errors with object 'V_SNP' not found
    whenever there is more than one trait. gsem implements the corrected
    algorithm (identical to multiSNP with gene heritabilities as the
    "variances"); the reference is generated from a minimally-patched
    multiGene.

  • hdl intercept weak identification. The per-piece HDL intercept
    enters the likelihood only as int·lam/N and is weakly identified on
    small panels; R's optim drifts on some LD blocks while gsem's gradient
    method recovers the construction-true intercept. The genetic-covariance
    matrix S (HDL's primary output) is robustly identified and matches R.

  • sumstats beta standardization. gsem's sumstats wrote the raw
    input betas/SEs instead of standardizing them like R GenomicSEM. Since
    userGWAS/commonfactorGWAS build cov(SNP, trait) = varSNP · beta,
    this propagated a sqrt(varSNP) scale error into per-SNP GWAS output.
    Now implements all of R's modes — OLS (beta = Z/√(N·varSNP)),
    linprob, se.logit, the default logistic "none" transform, and
    betas pass-through — reading the P and N columns and deriving Z from
    P. Validated formula-for-formula against R for every mode.
    (crates/gsem/src/sumstats.rs)

  • sumstats ambiguous-SNP default. R GenomicSEM removes
    strand-ambiguous SNPs (A/T, C/G) only when ambig=TRUE; its default
    keeps them. The R and Python bindings (and the CLI) mapped
    keep_ambig = ambig, so their default dropped ~⅓ of SNPs where R keeps
    them. Fixed to keep_ambig = !ambig across the R binding, Python
    binding, and CLI (which gains an --ambig flag), matching R's default.

  • rgmodel V_R dimension. rgmodel returned V_R as the full
    kstar × kstar sampling covariance of vech(R) (including the fixed
    diagonal-1 correlations, which have ~zero variance). R GenomicSEM returns
    V_R for the off-diagonal correlations only (k(k−1)/2 square). Fixed to
    extract the off-diagonal vech block, matching R's shape and ordering (so
    the bindings now return R's shape too). (crates/gsem-sem/src/rgmodel.rs)

  • enrich model-based functional enrichment. gsem's enrich
    (enrichment_test) computed the classic LDSC heritability-enrichment ratio
    (prop_h²/prop_SNPs, the original Finucane statistic), not R GenomicSEM's
    model-based enrich. R fits a SEM to a baseline annotation, fixes the
    regressions/loadings (per fix) to their baseline estimates, re-fits each
    annotation freeing the remaining parameters, and reports per-parameter
    enrichment = (est_annot/est_baseline)/Prop with SE and a 1-sided p-value.
    Added gsem_sem::enrich_model::model_enrichment (+ a shared gsem_sem::fit
    helper) implementing this with the regressions/covariances/variances fix
    modes, wired through the R binding (enrich(s_covstruc, model, params, fix))
    and the Python binding (model_enrichment). Validated against live R
    GenomicSEM on a factor-variance covstruc. gsem's classic ratio is retained
    as enrichment_test. (crates/gsem-sem/src/enrich_model.rs)

  • s_ldsc overlap-weighted partitioned heritability. Stratified LDSC
    returned per-annotation S as the raw coefficient contribution tau · M
    (R's S_Tau), missing R's overlap weighting. R computes each category's
    partitioned (co)heritability as S = overlap · (tau · M), where
    overlap[f,a] = M(f,a)/M_a and M(f,a) is the number of SNPs in both
    annotations f and a (from the .annot.gz membership + .frq MAF
    filter). These agree only for disjoint annotations and diverge for the
    realistic overlapping baselineLD model. Added a .annot.gz/.frq
    cross-product reader and applied the overlap transform to both S and the
    jackknife-based V; the result now exposes s_annot/v_annot (R's
    S/V) and s_tau/v_tau (R's S_Tau/V_Tau). Threaded through the R
    binding, Python binding, and CLI (new --frq flag). Validated against R
    on synthetic overlapping annotations.
    (crates/gsem-ldsc/src/stratified.rs, crates/gsem-ldsc/src/annot_reader.rs)

Added

  • In-CI R-equivalence coverage for the LDSC half (ldsc, munge,
    sumstats) via small synthetic GWAS + LD-score fixtures generated by
    running R GenomicSEM (tests/generate_synthetic_reference.R).
    Previously only the data-dependent bench script covered these.

  • R-equivalence coverage for summaryGLS, paLDSC (observed eigenvalue
    spectrum), rgmodel, and write.model
    (factor→indicator structure)
    via tests/generate_covstruc_reference.R, plus ML-estimator and
    3-factor SEM
    cases — broadening the function, estimator, and model-size
    combinations checked against R.

  • Tightened R-validation tolerances to observed precision (SEM /
    commonfactor estimates now checked at 1e-6 vs the prior 5e-2; fit
    indices at 1e-10), with a MEASURE=1 env-gated max-diff tracker.