Added
-
Ported
read_fusion(gsem::io::fusion_reader+ CLIgsem read-fusion). Reads raw FUSION TWAS.datassociation files and converts
each trait'sTWAS.Zinto a standardized gene-expression effect/SE (binary
traits via the liability conversioneffect/√(effect²·HSQ + π²/3),
continuous viaZ/√(N·HSQ)), supports the permutation-Z path, then
inner-joins traits on Gene+Panel into the merged TWAS format that
twas_reader/multiGene/userGWAS(TWAS=TRUE)consume. This is the
on-ramp from raw FUSION output into the Rust TWAS path (previously only
pre-merged files could be read). Validated against live Rread_fusion. -
Ported
subSV(gsem_matrix::vech::subset_sv). Subsetsvech(S)and
theVsampling-covariance block by a set of 1-based vech positions, for
both the with-diagonal (TYPE="S"/"S_Stand") and off-diagonal
(TYPE="R") numbering. Validated against R to 1e-12. (R's matrix-input
path has an undefined-RMATRIXbug; the port matches the bug-free
LDSC_OBJECTpath.) -
Ported
summaryGLSbandsnumeric core
(gsem::stats::gls::summary_gls_bands). GLS fit plus the confidence-band
data — predictor grid, fitted line, and ±BAND_SIZE·SE envelope at
INTERVALSpoints, withINTERCEPT/QUAD/CONTROLVARSsupport.
Validated against R to 1e-9. The ggplot rendering is not ported. -
Per-option R-equivalence coverage for the drop-in surface. New
real-package fixtures and tests for previously-untested option branches:
userGWASestimation="ML"/GC={conserv,none}/Q_SNP/std.lv;
ldscstand=TRUE(S_Stand/V_Stand) /select="ODD"/chisq.max/
liability-scale;sumstatsambig=TRUE(full numeric parity); a
non-degenerate misspecified-model CFI/chisq fixture; thepaLDSCdiag
branch; and bindings parity suites (R testthat + Python pytest). -
Coverage instrumentation in CI. A
cargo-llvm-covcoverage job with a
ratcheting floor, Codecov upload + README badge, and an advisory R-side
covr job. -
R documentation site (pkgdown). A published docs site for the
gsemr
binding (reference index + Compatibility/Architecture articles generated
from the repo-root docs), deployed to GitHub Pages, with a navbar link to
the upstream GenomicSEM project.
Fixed
-
std.lvleft the factor scale unidentified.parse_model(std_lv=true)
freed the auto-added latent variance instead of fixing it to 1 (lavaan's
std.lv=TRUEconvention:auto.fix.first=FALSE+ latent variances fixed
to 1). Any std.lv model —userGWAS/usermodel/rgmodel(std_lv=)—
estimated an unidentified factor scale and was ~12% off R. Now the
auto-add fixes the latent variance understd_lv. (crates/gsem-sem/src/syntax.rs) -
Q_SNP heterogeneity statistic + LDSC intercept floor.
compute_q_snp
computed the wrong quadratic form, and the LDSC intercept diagonal was not
floored at 1.0 before building the per-SNPV(R'suserGWAS.R:156).
Both fixed; Q_SNP now matches R to ~1e-9.
(crates/gsem/src/gwas/{q_snp,gc_correction,user_gwas}.rs)
The following entries accumulated on master after 0.1.3 and ship in 0.2.0:
Added
- R-equivalence coverage for
multiSNP,multiGene,simLDSC, and
hdl. New live-R fixtures and tests validate the four previously
untested functions, closing the last function-coverage gaps.multiSNP
andmultiGenereproduce R's augmentedS_Full/V_Fullmatrices to
1e-12/1e-14;simLDSC's deterministic per-SNP Z covariance matches R's
exactvarZ/covZalgebra to 1e-9;hdlreproduces R's genetic-
covariance matrixSto optimiser-level precision on a synthetic LD
panel built in R's exact.rda/.bimformat.
Changed
hdlevaluated in the eigenspace (match R). gsem's HDL likelihood
was fed raw LD scores aslamand the raw per-SNPbhat, whereas R
GenomicSEM/HDL evaluates the likelihood in the LD-block eigenspace:
lam= eigenvalues of the block correlation matrix and
bstar = Vᵀ·bhat(projection onto eigenvectors).LdPiecenow carries
the per-piece eigenvalues/eigenvectors (read from the.rdapanel, or
the newchr*.eigen.tsvtext file emitted byconvert_hdl_panels), the
per-trait reference N uses R'smedian(N), the likelihood floor matches
R'sexp(-18), and the per-piece optimiser is a projected-gradient
method mirroring R'soptim(L-BFGS-B). (crates/gsem-ldsc/src/hdl.rs)
Fixed
-
simLDSCphenotypic-overlap term. The off-diagonal environmental
contribution to the per-SNP Z covariance was
rPheno·n_overlap·√(NᵢNⱼ)/n_snps, carrying a spurious√(NᵢNⱼ)/n_snps
factor. R GenomicSEM usesrPheno_ij · N_ij/√(NᵢNⱼ)(=rPheno·n_overlap
for scalar sample overlap). Fixed and exposed as the deterministic
per_snp_z_cov, validated against R's construction.
(crates/gsem/src/stats/simulation.rs) -
multiSNP/multiGeneV_Fullconstruction.run_multi_snpbuilt
V_Fullas a diagonal approximation that ignored the LDSC intercept
matrix. The newbuild_multi_snp_svreproduces R's full sampling
covariance — SNP-trait variances(SE·I_diag·varSNP)², cross-trait
within-SNP, cross-SNP within-trait, and cross-SNP cross-trait blocks,
plus the trait-traitV_LDblock — matching R bit-for-bit.
(crates/gsem/src/gwas/multi_snp.rs)
Documented R bugs (gsem implements the correct behavior)
-
multiSNP/multiGeneconstant cross-SNP cross-trait LD. R weights
every cross-SNP cross-trait sampling-covariance cell by a single
constant LD value (LD2[(f²−f)/2], the last lower-triangle entry)
rather than the actual SNP-pair LD. gsem uses the correct per-pair
LD[a,b]; the two coincide when the off-diagonal LD is constant (as in
the equivalence fixtures). A unit test locks gsem's corrected path. -
multiGeneaborts for k ≥ 2 traits. R'smultiGeneassigns into
V_SNP[y,x]— a variable that does not exist inmultiGene(a typo for
V_Gene) — so the function errors withobject 'V_SNP' not found
whenever there is more than one trait. gsem implements the corrected
algorithm (identical tomultiSNPwith gene heritabilities as the
"variances"); the reference is generated from a minimally-patched
multiGene. -
hdlintercept weak identification. The per-piece HDL intercept
enters the likelihood only asint·lam/Nand is weakly identified on
small panels; R'soptimdrifts on some LD blocks while gsem's gradient
method recovers the construction-true intercept. The genetic-covariance
matrixS(HDL's primary output) is robustly identified and matches R. -
sumstatsbeta standardization. gsem'ssumstatswrote the raw
input betas/SEs instead of standardizing them like R GenomicSEM. Since
userGWAS/commonfactorGWASbuildcov(SNP, trait) = varSNP · beta,
this propagated asqrt(varSNP)scale error into per-SNP GWAS output.
Now implements all of R's modes —OLS(beta = Z/√(N·varSNP)),
linprob,se.logit, the default logistic "none" transform, and
betaspass-through — reading the P and N columns and deriving Z from
P. Validated formula-for-formula against R for every mode.
(crates/gsem/src/sumstats.rs) -
sumstatsambiguous-SNP default. R GenomicSEM removes
strand-ambiguous SNPs (A/T, C/G) only whenambig=TRUE; its default
keeps them. The R and Python bindings (and the CLI) mapped
keep_ambig = ambig, so their default dropped ~⅓ of SNPs where R keeps
them. Fixed tokeep_ambig = !ambigacross the R binding, Python
binding, and CLI (which gains an--ambigflag), matching R's default. -
rgmodelV_R dimension.rgmodelreturnedV_Ras the full
kstar × kstarsampling covariance ofvech(R)(including the fixed
diagonal-1 correlations, which have ~zero variance). R GenomicSEM returns
V_Rfor the off-diagonal correlations only (k(k−1)/2square). Fixed to
extract the off-diagonal vech block, matching R's shape and ordering (so
the bindings now return R's shape too). (crates/gsem-sem/src/rgmodel.rs) -
enrichmodel-based functional enrichment. gsem'senrich
(enrichment_test) computed the classic LDSC heritability-enrichment ratio
(prop_h²/prop_SNPs, the original Finucane statistic), not R GenomicSEM's
model-basedenrich. R fits a SEM to a baseline annotation, fixes the
regressions/loadings (perfix) to their baseline estimates, re-fits each
annotation freeing the remaining parameters, and reports per-parameter
enrichment = (est_annot/est_baseline)/Propwith SE and a 1-sided p-value.
Addedgsem_sem::enrich_model::model_enrichment(+ a sharedgsem_sem::fit
helper) implementing this with the regressions/covariances/variances fix
modes, wired through the R binding (enrich(s_covstruc, model, params, fix))
and the Python binding (model_enrichment). Validated against live R
GenomicSEM on a factor-variance covstruc. gsem's classic ratio is retained
asenrichment_test. (crates/gsem-sem/src/enrich_model.rs) -
s_ldscoverlap-weighted partitioned heritability. Stratified LDSC
returned per-annotationSas the raw coefficient contributiontau · M
(R'sS_Tau), missing R's overlap weighting. R computes each category's
partitioned (co)heritability asS = overlap · (tau · M), where
overlap[f,a] = M(f,a)/M_aandM(f,a)is the number of SNPs in both
annotationsfanda(from the.annot.gzmembership +.frqMAF
filter). These agree only for disjoint annotations and diverge for the
realistic overlapping baselineLD model. Added a.annot.gz/.frq
cross-product reader and applied the overlap transform to bothSand the
jackknife-basedV; the result now exposess_annot/v_annot(R's
S/V) ands_tau/v_tau(R'sS_Tau/V_Tau). Threaded through the R
binding, Python binding, and CLI (new--frqflag). Validated against R
on synthetic overlapping annotations.
(crates/gsem-ldsc/src/stratified.rs,crates/gsem-ldsc/src/annot_reader.rs)
Added
-
In-CI R-equivalence coverage for the LDSC half (
ldsc,munge,
sumstats) via small synthetic GWAS + LD-score fixtures generated by
running R GenomicSEM (tests/generate_synthetic_reference.R).
Previously only the data-dependent bench script covered these. -
R-equivalence coverage for
summaryGLS,paLDSC(observed eigenvalue
spectrum),rgmodel, andwrite.model(factor→indicator structure)
viatests/generate_covstruc_reference.R, plus ML-estimator and
3-factor SEM cases — broadening the function, estimator, and model-size
combinations checked against R. -
Tightened R-validation tolerances to observed precision (SEM /
commonfactor estimates now checked at 1e-6 vs the prior 5e-2; fit
indices at 1e-10), with aMEASURE=1env-gated max-diff tracker.