Skip to content

A synthetic genome resource containing variants across different genomic contexts, designed for benchmarking and method development.

License

Notifications You must be signed in to change notification settings

PopicLab/varium

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🪸 VARium: a synthetic genome collection for multi-platform SV benchmarking

Table of Contents

Overview
Dataset tiers
Data generation workflow
File types
SV placement
SV size
Read characteristics

Overview

VARium is an extensive suite of synthetic genomes designed to systematically assess the performance of structural variant (SV) discovery tools as a function of key domain parameters and confounders. The collection provides multi-platform, multi-coverage simulated WGS datasets with SVs stratified by type, size, genomic context, and genome complexity.

VARium enables rigorous, reproducible, and fine-grained stress-testing of recall and precision across diverse biological and technical conditions, while establishing an interpretable upper bound on achievable method performance.

All datasets are publicly accessible via Google Cloud: VARium Google Bucket

🧬 Dataset tiers

VARium currently comprises 119 synthetic genomes organized into two tiers:

  • Tier 1 (Recall-focused) - 79 genomes designed to evaluate simple SV recall
    • 4 variant types: DEL, DUP, INS, INV
    • 7 size ranges: 50-150bp, 150-500bp, 500-1kbp, 1k-10kbp, 10k-20kbp, 20k-100kbp, and 100k-1Mbp
    • 6 genomic contexts: MAPPABLE, NONUNIQUE, SEGDUP*, Alu**, L1HS**, TR** (*only contain variants up to 20k; **DEL only)
    • 3 platforms: Illumina, PacBio, ONT
    • 5 coverages: 0.5x, 5x, 10x, 15x, 30x
  • Tier 2 (Precision-focused) – 40 genomes designed to evaluate simple SV precision in the presence of complex variants
    • 4 complex SV types: dDUP, nrTRA, INV_dDUP, INV_nrTRA
    • 5 size ranges: 50-150bp, 150-500bp, 500-1kbp, 1k-10kbp, 10k-20kbp
    • 2 dispersion distances: 10–50kbp, 1–10Mbp
    • 3 platforms: Illumina, PacBio, ONT
    • 1 coverage: 30x

eventTypes

🖥 Data generation workflow

Each genome is generated by simulating specific SVs and performing context-aware SV placement into the hg38 reference genome using insilicoSV, followed by platform-specific read simulation, alignment, and assembly. The overall workflow is illustrated below:

workflow

📦 File types

Each simulated genome includes:

Data type Description File name Size
Genome sequence Modified hg38 reference genome with embedded SVs sim.hap[AB].fa 100M-1.6G
SVs Truth set of simulated variants sim.sorted.vcf.gz 19k-146M
Illumina reads Short-read alignment (0.5x-30x) sim.ilmn.minimap2.sorted.bam 2.6G-36G (30x)
HiFi reads HiFi alignment (0.5x-30x) sim.hifi.minimap2.sorted.bam 2.7G-35G (30x)
ONT reads ONT alignment (0.5x-30x) sim.ont.minimap2.sorted.bam 2.5G-33G (30x)
HiFi de novo Assembly Assembly from HiFi reads (30× or 15×) sim.hifi.pbsim.hifiasm.bp.hap[12].p_ctg.fa.gz 35M-486M
ONT de novo Assembly Assembly from ONT reads (30× or 15×) sim.ont.pbsim.hifiasm.bp.hap[12].p_ctg.fa.gz 38M-503M
Metadata insilico SV design parameters *.yaml 290-363

📌 SV placement

In Tier 1, SVs are placed within predefined genomic contexts to reflect real-world mapping challenges. insilicoSV configuration files used for SV simulation and placement are provided in the workflows/variants folder. Genomic contexts are defined as follows.

  1. MAPPABLE

Regions outside Repeat masker annotations, where mapping is relatively easy. In this regime, no SV breakpoint is allowed to overlap RMSK intervals.

  1. NONUNIQUE

Regions from the GIAB mappability stratification group GRCh38_nonunique_l250_m0_e0.bed.gz. Intervals ≥150bp were retained to increase contextual difficulty. SVs are placed such that at least one breakpoint overlaps a NONUNIQUE interval.

  1. SEGDUP

Regions from GIAB Segmental Duplications stratification group GRCh38_segdups.bed.gz. SVs are placed such that all breakpoints are fully contained within a single segmental duplication interval.

Example: A 150-500bp homozygous deletion placed in different genomic contexts across three sequencing platforms: samplot

📊 SV size

SV size distributions for Tier 1 and Tier 2 are shown below:

tier1size repsize tier2size

🔍 Read characteristics

Our simulated reads aims to approximate empirical sequencing characteristics.

Long-read datasets

Reads were simulated using pbsim3, sampling from publicly available datasets.

Read length comparison between real data (used for sampling) and simulated data at different coverages: RL

Illumina datasets

Fragment characteristics were modeled using Broad Clinical Labs sequencing runs on NovaSeq X:

  • Read length: 151 bp
  • Mean fragment length: 440 bp
  • Fragment standard deviation: 200 bp
  • Error rate: 0.001 (≈ Q30)

⚙️ Reproduce VARium data

All resources required to reproduce the simulated reads, alignments, and de novo assemblies included in VARium are provided in the workflows/ folder.

1. Read simulation

Reads were simulated using scripts in:

reads/
├── dwgsim_ilmn.sh # Illumina short-read simulation
├── pbsim_hifi.sh # PacBio HiFi simulation
└── pbsim_ont.sh # ONT simulation

All simulators use seed=100 to ensure reproducibility. Each script:

  • Takes the modified reference genome generated by insilicoSV as input
  • Simulates reads at 15x per haplotype to achieve overall 30x coverage
  • Produces FASTQ files for downstream alignment or de novo assembly

2. Alignment, assembly, and downsampling

Reads are processed using scripts in:

alignment_and_assembly/
├── process_hifi.sh   # Alignment (minimap2), downsampling (rasusa), assembly (hifiasm)
├── process_ilmn.sh   # Alignment (minimap2), downsampling (rasusa)
└── process_ont.sh    # Alignment (minimap2), downsampling (rasusa), assembly (hifiasm)
  • minimap2 presets are set per data type (HiFi, ONT, Illumina).
  • rasusa performs coverage downsampling using seed=100.
  • De novo assemblies are generated with hifiasm using 15× and 30× HiFi and ONT datasets.

3. Regeneration Overview

To fully regenerate the VARium dataset:

  1. Select the simulated genome FASTA from the VARium Google bucket
  2. Run the appropriate script in reads/
  3. Process reads using the corresponding script in alignment_and_assembly/

About

A synthetic genome resource containing variants across different genomic contexts, designed for benchmarking and method development.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages