Skip to content

Comparison between Leon and Gzip for fastq compression

Notifications You must be signed in to change notification settings

Char-Al/bench_leon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project : Compare LEON and GZIP for the FastQ compression

Table of Contents

Just a little story

In the 1970's Sanger, Maxam, Gilbert and colleagues developed a rapid method to sequence the DNA. Twenty years after, sequencing by Sanger method is the more common way and allows the first whole genome sequencing for Haemophilus influenzae, in 1995. In 2004, almost thirty years after Sanger has developed his method, the Human Genome Project sequenced the first whole Human genome. Since this step, sequencing methods have changed and the Next Generation Sequencing (NGS) have emerged. In 10 years the cost and the time to sequence a whole genome considerably decrease. NGS technologies allow to sequence routinely a large number of samples. So, the amount of data generated by NGS substantially increase during the last decade and the storage and transmission of these data are a major concern for now.

Graph from SRA (http://www.ncbi.nlm.nih.gov/Traces/sra/) 2016-08-08

What is currently done?

GZIP

Now, the common way to compress those data is the GZIP format. GZIP is based on the Deflate algorithm, actually it is the combination of the Huffman coding and the LZ77 algorithm (more explanation here). This algorithm have been developed in order to compress text data, which means data with a large set of characters.

What is LEON ?

LEON is a new software to compress data issue from NGS (Fasta and FastQ). LEON shares similarities with approaches using a reference genome to compress files. The LEON approach build the reference de novo, contrary to the other algorithms, with a de Bruijn Graph where the pieces are k-mers. The de Bruijn Graph is heavy and it has to be stored with the compressed data, so the size could be a problem. To figure out this problem, the de Bruijn Graph needs a good parametrization and the implementation is based on probabilistic data structure in order to reduce its size. Based on Bloom filters the de Bruijn Graph is efficient to store large data.

LEON method overview (from : Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph)

Comparison between compression of fastq by Gzip and LEON

With this little magic script, we produce some awesome graphs to compare the efficiency of GZIP and LEON. To compare these two softwares, we are interested in the global rate of compression, the rate of compression depending the size of the initial FastQ and the time of compression/decompression. We use FastQ from Human data with size between 100 Mo and 26 Go.

Compression ratio fo each tools

We can see that the ratio compression of LEON is better than GZIP, regardless the size of the FastQ. In addition the "lossy" mode of LEON have a ratio between 90 and 95% in each cases, almost 15% more than the other tools. There is no significant differences between the level 6 and 9 of GZIP, but these two have a wider variations.

Boxplot comparant les taux de compression de gzip et LEON avec différentes options

Compression ratio depending on the size of FASTQ

Now, we focus on the compression rate depending of the size of the original FastQ. We can notice a peak at 18 Go, corresponding to FastQ files that have larger reads (125 vs 100 pb). However this peak cannot change the analysis because all softwares show the same effect. Regarding these results, we can say that LEON is more efficient than GZIP, especially the "lossy" mode which is very stable in each case.

Evolution du taux de compression en fonction de la taille des fastQ d'origine

Time of compression/decompression

The time of compression and decompression depends of the size of initial file. The LEON "lossy" and "lossless" mode and GZIP level 6 have similar time for the compression, while the time of GZIP level 9 for compression is longer (in some cases more than 2 times). LEON is less efficient for decompression and all GZIP level have almost the same time for decompression.

Evolution du temps de compression en fonction de la taille des fastQ d'origine

Is LEON compression have an impact on SNPs/Indels calling ?

To studying the impact of compression we will used a set of 12 FastQ issue from Human data. All FastQ have been compressed by LEON "lossy", LEON "lossless", and GZIP (default : level 6), then decompressed and recompressed with GZIP (default). The last recompression is usefull to perform the variant calling with Nenufaar v2.3 (pipeline by David BAUX).

The next table show the number of SNV and indels call for each VCF issue from the three compression methods. We notice that LEON "lossy" mode have 23 SNV and indels which differ than others. Indeed, 12 ponctual mutations are found only with GZIP and LEON "lossless" mode and 11 with LEON "lossy". However most of these 23 mutations are in repeated sequence and this difference may be caused by a shift of few nucleotides.

VCF PASS OTHER NA TOTAL (PASS+OTHER)
GZIP 759 110 12 869
LEON "lossless" 759 110 12 869
LEON "lossy" 767 103 11 870

The following chart show the differences between the AB (Allelic Ballance) of each SNV/indel. The AB from the VCF issue of GZIP (VCF1). We notice than there is no differences between GZIP and LEON "lossless" (VCF2). With LEON "lossy" mode there is some differences. The AB of 84 indels, on 192 indels in totals, is different but most of these (66) have an AB wich differ less than 2%. We can have the same conclusion with SNV with 344 wich differ, on a total of 666 SNV, and 318 wich differ less than 2%.

Differences of variant calling on GZIP, LEON lossless and LEON lossy file

Related publications

About

Comparison between Leon and Gzip for fastq compression

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published