🪸 VARium: a synthetic genome collection for multi-platform SV benchmarking

Overview

VARium is an extensive suite of synthetic genomes designed to systematically assess the performance of structural variant (SV) discovery tools as a function of key domain parameters and confounders. The collection provides multi-platform, multi-coverage simulated WGS datasets with SVs stratified by type, size, genomic context, and genome complexity.

VARium enables rigorous, reproducible, and fine-grained stress-testing of recall and precision across diverse biological and technical conditions, while establishing an interpretable upper bound on achievable method performance.

All datasets are publicly accessible via Google Cloud: VARium Google Bucket

🧬 Dataset tiers

VARium currently comprises 119 synthetic genomes organized into two tiers:

Tier 1 (Recall-focused) - 79 genomes designed to evaluate simple SV recall
- 4 variant types: DEL, DUP, INS, INV
- 7 size ranges: 50-150bp, 150-500bp, 500-1kbp, 1k-10kbp, 10k-20kbp, 20k-100kbp, and 100k-1Mbp
- 6 genomic contexts: MAPPABLE, NONUNIQUE, SEGDUP*, Alu**, L1HS**, TR** (*only contain variants up to 20k; **DEL only)
- 3 platforms: Illumina, PacBio, ONT
- 5 coverages: 0.5x, 5x, 10x, 15x, 30x
Tier 2 (Precision-focused) – 40 genomes designed to evaluate simple SV precision in the presence of complex variants
- 4 complex SV types: dDUP, nrTRA, INV_dDUP, INV_nrTRA
- 5 size ranges: 50-150bp, 150-500bp, 500-1kbp, 1k-10kbp, 10k-20kbp
- 2 dispersion distances: 10–50kbp, 1–10Mbp
- 3 platforms: Illumina, PacBio, ONT
- 1 coverage: 30x

🖥 Data generation workflow

Each genome is generated by simulating specific SVs and performing context-aware SV placement into the hg38 reference genome using insilicoSV, followed by platform-specific read simulation, alignment, and assembly. The overall workflow is illustrated below:

📦 File types

Each simulated genome includes:

Data type	Description	File name	Size
Genome sequence	Modified hg38 reference genome with embedded SVs	sim.hap[AB].fa	100M-1.6G
SVs	Truth set of simulated variants	sim.sorted.vcf.gz	19k-146M
Illumina reads	Short-read alignment (0.5x-30x)	sim.ilmn.minimap2.sorted.bam	2.6G-36G (30x)
HiFi reads	HiFi alignment (0.5x-30x)	sim.hifi.minimap2.sorted.bam	2.7G-35G (30x)
ONT reads	ONT alignment (0.5x-30x)	sim.ont.minimap2.sorted.bam	2.5G-33G (30x)
HiFi de novo Assembly	Assembly from HiFi reads (30× or 15×)	sim.hifi.pbsim.hifiasm.bp.hap[12].p_ctg.fa.gz	35M-486M
ONT de novo Assembly	Assembly from ONT reads (30× or 15×)	sim.ont.pbsim.hifiasm.bp.hap[12].p_ctg.fa.gz	38M-503M
Metadata	insilico SV design parameters	*.yaml	290-363

📌 SV placement

In Tier 1, SVs are placed within predefined genomic contexts to reflect real-world mapping challenges. insilicoSV configuration files used for SV simulation and placement are provided in the workflows/variants folder. Genomic contexts are defined as follows.

MAPPABLE

Regions outside Repeat masker annotations, where mapping is relatively easy. In this regime, no SV breakpoint is allowed to overlap RMSK intervals.

NONUNIQUE

Regions from the GIAB mappability stratification group GRCh38_nonunique_l250_m0_e0.bed.gz. Intervals ≥150bp were retained to increase contextual difficulty. SVs are placed such that at least one breakpoint overlaps a NONUNIQUE interval.

SEGDUP

Regions from GIAB Segmental Duplications stratification group GRCh38_segdups.bed.gz. SVs are placed such that all breakpoints are fully contained within a single segmental duplication interval.

Example: A 150-500bp homozygous deletion placed in different genomic contexts across three sequencing platforms:

📊 SV size

SV size distributions for Tier 1 and Tier 2 are shown below:

🔍 Read characteristics

Our simulated reads aims to approximate empirical sequencing characteristics.

Long-read datasets

Reads were simulated using pbsim3, sampling from publicly available datasets.

HiFi and ONT read from publically available datasets.
- HiFi: AJtrio_PacBio_CCS_15kb_20kb_chemistry2_02112020 (https://sra-pub-src-2.s3.amazonaws.com/SRR10382244/m64011_190901_095311.fastq.1)
- ONT: R10 hac 5khz (https://42basepairs.com/download/s3/ont-open-data/giab_2025.01/basecalling/hac/HG002/PAW70337/calls.sorted.bam)

Read length comparison between real data (used for sampling) and simulated data at different coverages:

Illumina datasets

Fragment characteristics were modeled using Broad Clinical Labs sequencing runs on NovaSeq X:

Read length: 151 bp
Mean fragment length: 440 bp
Fragment standard deviation: 200 bp
Error rate: 0.001 (≈ Q30)

⚙️ Reproduce VARium data

All resources required to reproduce the simulated reads, alignments, and de novo assemblies included in VARium are provided in the workflows/ folder.

1. Read simulation

Reads were simulated using scripts in:

reads/
├── dwgsim_ilmn.sh # Illumina short-read simulation
├── pbsim_hifi.sh # PacBio HiFi simulation
└── pbsim_ont.sh # ONT simulation

All simulators use seed=100 to ensure reproducibility. Each script:

Takes the modified reference genome generated by insilicoSV as input
Simulates reads at 15x per haplotype to achieve overall 30x coverage
Produces FASTQ files for downstream alignment or de novo assembly

2. Alignment, assembly, and downsampling

Reads are processed using scripts in:

alignment_and_assembly/
├── process_hifi.sh   # Alignment (minimap2), downsampling (rasusa), assembly (hifiasm)
├── process_ilmn.sh   # Alignment (minimap2), downsampling (rasusa)
└── process_ont.sh    # Alignment (minimap2), downsampling (rasusa), assembly (hifiasm)

minimap2 presets are set per data type (HiFi, ONT, Illumina).
rasusa performs coverage downsampling using seed=100.
De novo assemblies are generated with hifiasm using 15× and 30× HiFi and ONT datasets.

3. Regeneration Overview

To fully regenerate the VARium dataset:

Select the simulated genome FASTA from the VARium Google bucket
Run the appropriate script in reads/
Process reads using the corresponding script in alignment_and_assembly/

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
docs/imgs		docs/imgs
workflows		workflows
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🪸 VARium: a synthetic genome collection for multi-platform SV benchmarking

Table of Contents

Overview

🧬 Dataset tiers

🖥 Data generation workflow

📦 File types

📌 SV placement

📊 SV size

🔍 Read characteristics

⚙️ Reproduce VARium data

1. Read simulation

2. Alignment, assembly, and downsampling

3. Regeneration Overview

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

PopicLab/varium

Folders and files

Latest commit

History

Repository files navigation

🪸 VARium: a synthetic genome collection for multi-platform SV benchmarking

Table of Contents

Overview

🧬 Dataset tiers

🖥 Data generation workflow

📦 File types

📌 SV placement

📊 SV size

🔍 Read characteristics

⚙️ Reproduce VARium data

1. Read simulation

2. Alignment, assembly, and downsampling

3. Regeneration Overview

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages