genomicframe-core → GenomicFrame

This is a work in progress, and shouldn't be used in production. Any feedback or contributions are greatly appreciated!

genomicframe-core → GenomicFrame

High-performance, memory-efficient genomics I/O and interoperability layer written entirely in Rust.

Overview

genomicframe-core is the foundation for GenomicFrame — a high-performance genomics analytics library written in Rust and compiled to Python via PyO3. The end goal is a first-class Python library for use in Jupyter notebooks and Nextflow pipelines, with Rust-level performance and no CPython bottlenecks. This project addresses the limitations of existing Python-based bioinformatics libraries by providing:

Blazing-fast I/O for VCF, BAM, SAM, FASTA, FASTQ, BED, and GFF/GTF formats
Memory efficiency through zero-copy streaming and O(1) memory regardless of file size
Composable pipelines with a LogicalPlan query API and lazy evaluation
Rust-level performance with safe, concurrent execution via Rayon

Performance

These numbers are measured — not extrapolated. See BENCHMARKS.md for full methodology and reproduction steps.

VCF Processing (Criterion.rs benchmarks)

Task	genomicframe-core	PyVCF	cyvcf2	scikit-allel
Parse 10K variants	4.5 ms	850 ms	65 ms	120 ms
Compute full statistics	6.1 ms	1,200 ms	N/A	200 ms
Peak memory	~100 KB	~50 MB	~25 MB	~80 MB

Throughput: ~1.6 million variants/second with full statistics (Ts/Tv, variant classification, quality, filter distribution) computed in a single streaming pass — no reopen, no second loop.

Scaling to Production Files

File Size	Variants	rustbio (streaming)	PyVCF	scikit-allel
10 MB	10,000	10 KB	50 MB	80 MB
1 GB	1,000,000	10 KB	5 GB	8 GB
10 GB	10,000,000	10 KB	OOM	OOM

Memory usage is flat. Python scales linearly and crashes on large files. Rust processes a 10 GB file with the same footprint as a 1 MB file.

Key advantages:

10–200x faster than Python equivalents
250–800x less memory via streaming architecture
O(1) memory regardless of file size
Process 1M+ variants in under 1 second

Format Support

Format	Extension	Reader	Stats	Intervals	Gzip	Status
VCF	`.vcf`, `.vcf.gz`	✅	✅	✅	✅	Production
BAM	`.bam`	✅	✅	✅	✅ (BGZF)	Production
SAM	`.sam`	✅	—	✅	—	Production
BED	`.bed`, `.bed.gz`	✅	✅	✅	✅	Production
FASTQ	`.fastq`, `.fq`, `.gz`	✅	✅	—	✅	Production
FASTA	`.fasta`, `.fa`, `.gz`	✅	✅	—	✅	Production
GFF3/GTF	`.gff`, `.gff3`, `.gtf`, `.gz`	✅	—	✅	✅	Production

All formats share the same GenomicRecordIterator trait and convert to a unified GenomicInterval coordinate system.

Core API

Streaming (simple, direct)

Every format follows the same pattern with O(1) memory:

use genomicframe_core::core::GenomicRecordIterator;
use genomicframe_core::formats::vcf::VcfReader;

let mut reader = VcfReader::from_path("variants.vcf.gz")?;
while let Some(record) = reader.next_record()? {
    if record.is_pass() && record.qual.unwrap_or(0.0) > 30.0 {
        println!("{}\t{}\t{} -> {:?}", record.chrom, record.pos, record.reference, record.alt);
    }
}

LogicalPlan (composable, ergonomic)

The LogicalPlan API is the central interface for filtered queries. It compiles filter expressions into format-native operations — no intermediate allocations.

use genomicframe_core::plan::LogicalPlan;
use genomicframe_core::expression::{col, lit};
use genomicframe_core::schema::FileFormat;
use genomicframe_core::execution::execute;

// Build a lazy query — nothing runs yet
let plan = LogicalPlan::scan("variants.vcf.gz", FileFormat::Vcf)
    .max_scan(100_000)
    .filter(
        col("qual").gte(lit(30.0))
            .and(col("chrom").eq(lit("chr22")))
    );

// Execute — streaming, filtered at read time
let result = execute(plan)?;

Statistics

use genomicframe_core::formats::vcf::{VcfReader, VcfStats};

let mut reader = VcfReader::from_path("variants.vcf.gz")?;

// All computed in one streaming pass
let stats = VcfStats::compute(&mut reader)?;
stats.print_summary();

println!("Ts/Tv ratio:  {:.3}", stats.ts_tv_ratio().unwrap());
println!("SNP count:    {}", stats.snp_count());
println!("Mean quality: {:.2}", stats.mean_quality().unwrap());

For FASTQ, parallel computation is available:

use genomicframe_core::formats::fastq::FastqStats;

// Multi-threaded, O(threads) memory
let stats = FastqStats::par_compute("reads.fastq.gz", None)?;
println!("Q30+ reads: {} ({:.1}%)", stats.high_quality_reads_q30,
    (stats.high_quality_reads_q30 as f64 / stats.total_reads as f64) * 100.0);

Real-World Pipeline Example

The following is drawn from internal_examples/complete_ecosystem.rs — a full four-stage genomics pipeline demonstrating how the API composes across formats.

Stage 1: Pre-Alignment QC

use genomicframe_core::formats::fastq::FastqStats;
use genomicframe_core::formats::fasta::reader::FastaReader;
use genomicframe_core::core::GenomicRecordIterator;

// FASTQ quality control (parallel, O(threads) memory)
let stats = FastqStats::par_compute("reads.fastq.gz", None)?;
println!("Mean quality: {:.1}", stats.mean_quality().unwrap_or(0.0));
println!("Q30+ reads:   {:.1}%",
    (stats.high_quality_reads_q30 as f64 / stats.total_reads as f64) * 100.0);

// Reference FASTA statistics (streaming)
let mut reader = FastaReader::from_path("reference.fa")?;
while let Some(record) = reader.next_record()? {
    // Compute GC%, N-content per sequence
}

Stage 2: Build Filtered Annotation Databases

LogicalPlan acts as the primary interface for loading annotations. Three lines from query to indexed annotation database:

use genomicframe_core::plan::LogicalPlan;
use genomicframe_core::expression::{col, lit};
use genomicframe_core::schema::FileFormat;
use genomicframe_core::execution::execute;

// Large genes only (>= 1000 bp) — filtered during load
let genes = execute(
    LogicalPlan::scan("genes.bed", FileFormat::Bed)
        .filter(col("length").gte(lit(1000)))
)?.to_annotation_index(|r| r.name.clone().unwrap_or_else(|| "unknown".to_string()))?;

// Chr22 exons only — compound filter with .and()
let exons = execute(
    LogicalPlan::scan("exons.bed", FileFormat::Bed)
        .filter(col("chrom").eq(lit("chr22")).and(col("length").gte(lit(50))))
)?.to_annotation_index(|r| format!("exon_{}", r.name.as_deref().unwrap_or("unknown")))?;

println!("Loaded {} genes, {} exons", genes.len(), exons.len());

No manual iteration. No intermediate Vec. The filter executes at read time.

Stage 3: BAM Filtering + Annotation

use genomicframe_core::plan::LogicalPlan;
use genomicframe_core::execution::execute;
use genomicframe_core::schema::FileFormat;
use genomicframe_core::expression::{col, lit};

// Filter high-quality reads and annotate with gene/exon overlaps
let result = execute(
    LogicalPlan::scan("alignments.bam", FileFormat::Bam)
        .filter(
            col("mapq").gte(lit(30))
                .and(col("refid").eq(lit("22")))
        )
)?.annotate_with_genes(&genes, &exons)?;

result.print_summary();

Stage 4: VCF Filtering + Annotation

// Scan first 50K records, filter high-quality chr22 variants, annotate
let result = execute(
    LogicalPlan::scan("variants.vcf.gz", FileFormat::Vcf)
        .max_scan(50_000)
        .filter(
            col("qual").gte(lit(30.0))
                .and(col("chrom").eq(lit("22")))
        )
)?.annotate_with_genes(&genes, &exons)?;

result.print_summary();
// → genic variants, exonic variants, intergenic counts

The same annotate_with_genes pattern works across BAM and VCF — GenomicInterval unifies the coordinate systems.

Unified Interval System

All formats convert to GenomicInterval for cross-format operations. Coordinate system conversions (1-based ↔ 0-based half-open) happen at the conversion boundary, not in application code:

// Each conversion handles its format's coordinate convention
let vcf_interval: GenomicInterval = (&vcf_record).into(); // 1-based → 0-based
let bam_interval: GenomicInterval = (&bam_record).into(); // 0-based + CIGAR length
let bed_interval: GenomicInterval = (&bed_record).into(); // 0-based, direct
let gff_interval: GenomicInterval = (&gff_record).into(); // 1-based inclusive → 0-based half-open

// Cross-format overlap queries
if vcf_interval.overlaps(&bed_interval) {
    println!("Variant falls in annotated region");
}

AnnotationIndex uses an interval tree for O(log n) overlap queries, built from any BED or GFF source.

Path to GenomicFrame

genomicframe-core is intentionally structured as two separable tiers:

genomicframe-core (this crate)        genomicframe (future crate)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━       ━━━━━━━━━━━━━━━━━━━━━━━━━━━
Format readers (VCF, BAM, ...)   →   GenomicFrame::scan_vcf(...)
LogicalPlan + expression AST     →   .filter(col("qual") > 30)
ExecutionResult + annotation     →   .annotate_with_genes(...)
AnnotationIndex (interval tree)  →   .join_by_overlap(exons)
Statistics (VcfStats, BamStats)  →   .vcf_stats() / .bam_stats()
                                      ↓
                                      .collect_polars() / .collect_arrow()

The LogicalPlan and expression modules in this crate are the foundation. genomicframe will add:

Query optimization — predicate pushdown, filter reordering
Parallel execution — chunk-based Rayon dispatch
Index-aware scans — tabix/BAI random access when available
Arrow/Polars output — for downstream data science tooling
Auto-detection — GenomicFrame::scan("file.vcf.gz") detects format from path

When a LogicalPlan is handed to genomicframe, it can inspect the expression tree, detect indexed files, push filters to the read layer, and dispatch chunks across threads — without changing the query authoring API.

Target Environments

genomicframe is primarily a Python library compiled from Rust via PyO3. The intended deployment targets are:

Jupyter notebooks — interactive genomics analysis with Polars/Pandas interop for visualisation and ML
Nextflow pipelines — drop-in replacement for Python bioinformatics tools in DSL2 process blocks, with no GIL, no OOM risk, and predictable latency

# Future genomicframe Python API
import genomicframe as gf

# Nextflow process: annotate variants
variants = (
    gf.scan("variants.vcf.gz")          # auto-detects VCF
      .filter("qual >= 30 AND chrom == '22'")
      .annotate(gf.scan("exons.bed"))   # interval overlap join
      .to_polars()                      # hand off to Polars for reporting
)

The Rust core handles all I/O and compute. Python only touches results.

Installation

[dependencies]
genomicframe-core = "0.1"

Optional Features

[dependencies]
genomicframe-core = { version = "0.1", features = ["async", "cloud"] }

async — Enable async I/O with Tokio
cloud — S3/GCS/Azure cloud storage via object_store

Project Structure

genomicframe-core/
├── src/
│   ├── lib.rs              # Crate entry point
│   ├── core.rs             # Core traits (GenomicRecordIterator, etc.)
│   ├── error.rs            # Unified error handling
│   ├── expression.rs       # Expression AST (col, lit, eq, gte, and, ...)
│   ├── plan.rs             # LogicalPlan builder
│   ├── execution.rs        # Plan executor → ExecutionResult
│   ├── schema.rs           # FileFormat, schema metadata
│   ├── filters.rs          # Filter traits and combinators
│   ├── filtered_reader.rs  # Filtered iterator wrapper
│   ├── interval/           # GenomicInterval + AnnotationIndex
│   ├── parallel/           # Rayon-based parallel processing
│   ├── stats.rs            # Shared statistics types
│   └── formats/
│       ├── vcf/            # VCF reader, stats, filters, expressions
│       ├── bam/            # BAM reader, stats, filters, expressions
│       ├── fasta/          # FASTA reader
│       ├── fastq/          # FASTQ reader, stats (parallel)
│       ├── bed/            # BED reader, stats
│       └── gff/            # GFF3/GTF reader
├── internal_examples/
│   └── complete_ecosystem.rs  # Full four-stage pipeline demo
├── docs/
│   └── bench/
│       └── BENCHMARKS.md
└── Cargo.toml

Roadmap

Phase	Focus	Status
1a	Format I/O core (VCF, BAM, FASTQ, FASTA, BED, GFF)	✅ Complete
1b	LogicalPlan + expression + execution engine	✅ In progress
1c	Writers, validation, indexed readers (tabix/BAI)	🚧 Planned
2	`genomicframe` — lazy engine, optimizer, Arrow/Polars output	📋 Planned
3	Python bindings via PyO3 (notebook + Nextflow targets)	📋 Planned
4	Cloud integration (S3/GCS/Azure)	📋 Planned
5	Query optimization (predicate pushdown, parallel joins)	📋 Planned

Design Principles

Performance First — Everything measurable should be optimized
Predictable Memory — No unexpected growth or hidden buffering
Composable Design — Pipelines compose easily across file types
Transparency — Execution plans and memory profiles are inspectable
Open Science — Code clarity > cleverness. Benchmarks are reproducible

Contributing

Contributions welcome. This is an early-stage research project. Please open an issue to discuss major changes before submitting a PR.

License

Dual-licensed under MIT OR Apache-2.0.

Author

Ryan Duffy — 2025 Performance-driven genomics for the modern bioinformatics era.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
benches		benches
docs/bench		docs/bench
script		script
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
ROADMAP.md		ROADMAP.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

genomicframe-core → GenomicFrame

Overview

Performance

VCF Processing (Criterion.rs benchmarks)

Scaling to Production Files

Format Support

Core API

Streaming (simple, direct)

LogicalPlan (composable, ergonomic)

Statistics

Real-World Pipeline Example

Stage 1: Pre-Alignment QC

Stage 2: Build Filtered Annotation Databases

Stage 3: BAM Filtering + Annotation

Stage 4: VCF Filtering + Annotation

Unified Interval System

Path to GenomicFrame

Target Environments

Installation

Optional Features

Project Structure

Roadmap

Design Principles

Contributing

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

genomicframe-core → GenomicFrame

Overview

Performance

VCF Processing (Criterion.rs benchmarks)

Scaling to Production Files

Format Support

Core API

Streaming (simple, direct)

LogicalPlan (composable, ergonomic)

Statistics

Real-World Pipeline Example

Stage 1: Pre-Alignment QC

Stage 2: Build Filtered Annotation Databases

Stage 3: BAM Filtering + Annotation

Stage 4: VCF Filtering + Annotation

Unified Interval System

Path to GenomicFrame

Target Environments

Installation

Optional Features

Project Structure

Roadmap

Design Principles

Contributing

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages