Skip to content

Performance comparison with Rust #40

@jonathanBieler

Description

@jonathanBieler

I was wondering how fast this package is, so I made a small comparison with rust's fastq library. On v1.6, for fastq files rust is about 3-4 times faster, while for fastq.gz it's only 1.6x (I guess decompressing becomes more of a bottleneck). I think I've used every tricks to make the Julia code as fast as possible, have I missed something ?

While I think the performances are pretty good (1.6x doesn't matter that much), improvement are always welcome. It seems Julia's version is still allocating a lot of memory (1.07 M allocations: 225.188 MiB, 0.11% gc time, for a 100MB fastq.gz file), to me it seems the rust library is not really validating or converting anything, it just load the data and memory and goes through it. While have nicely converted records is nice, maybe a more "bare-bone" API could be provided when performance is really crucial, what do you think ?

My rust code (ends up being simpler than the Julia code):

use fastq::{parse_path, Record};

extern crate fastq;

fn main() {
    let filename = String::from("test.fastq.gz");

    let path = Some(filename);

    let mut total: u64 = 0;
    let mut tot_qual: u64 = 0; 
    parse_path(path, |parser| {
        parser.each(|record| {

            let s = record.seq();
            let q = record.qual();

            for (is, iq) in s.iter().zip(q.iter()) {
                if *is == b'A'{
                    total += 1;
                    tot_qual += (*iq -0x21) as u64;
                }
            }
            true
        }).expect("Invalid fastq file");
    }).expect("Invalid compression");
    println!("{:.15}", tot_qual as f64 / total as f64);
}

And Julia :

using FASTX, CodecZlib, BioSymbols, BioSequences

function count_reads(fasta)
    #reader = FASTQ.Reader(open(fasta))
    reader = FASTQ.Reader(GzipDecompressorStream(open(fasta)))
    record = FASTQ.Record()
    N, tot_qual = 0, 0
    seq = LongDNASeq(200)
    
    while !eof(reader)
        read!(reader, record)
        q = FASTQ.quality(record)
        copyto!(seq, record)

        @inbounds for i=1:length(q)
            if seq[i] == DNA_A
                N += 1
                tot_qual += q[i]
            end
        end
    end
    close(reader)
    tot_qual/N
end

fasta = "test.fastq.gz"

@time count_reads(fasta)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions