Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Giant FASTQ support in stats #33

Open
GoogleCodeExporter opened this issue Aug 3, 2015 · 3 comments
Open

Giant FASTQ support in stats #33

GoogleCodeExporter opened this issue Aug 3, 2015 · 3 comments

Comments

@GoogleCodeExporter
Copy link

Some stats programs have things like kmer (with -K) reports and probe-id 
counting (with -D).

These programs can consume a lot of RAM (>10GB), even with the highly efficient 
sparsehash library on very large files (> 200 mil reads).

The use of a disk-backed key-value store, like levelDB could see decent 
performance, like a hash, but would also allow growth past available RAM with 
decent performance.   I'm thinking that the code should switch to a DB-backed 
store at the 200 mil record level.   This would slow things down by about 3x 
(from 1 mil writes/sec to 300k writes/sec), but would also allow infinte 
growth.  Enabling a large LRU cache could it perform so similarly that the 
sparse hash can be abandoned, especially if the db remains an insigificant 
fraction of the stats collection process.   

Original issue reported on code.google.com by earone...@gmail.com on 9 Jul 2014 at 2:26

@GoogleCodeExporter
Copy link
Author

Original comment by earone...@gmail.com on 9 Jul 2014 at 2:27

  • Added labels: Type-Enhancement
  • Removed labels: Type-Defect

@GoogleCodeExporter
Copy link
Author

LevelDB was 50x slower.   So sad.   Some optimizations were done to reduce 
memory use.      Need to look at more options.

Original comment by earone...@gmail.com on 20 Aug 2014 at 5:09

@GoogleCodeExporter
Copy link
Author

Going to do this by a) allowing detection of a pre-sorting by probe-id when run 
with -D ... if detected... RAM is freed and duplication detection proceeds 
without the need for a hash.   Other hashes (like kmers) can bw switched to 
some sort of counting bloom filter

Original comment by earone...@gmail.com on 8 Sep 2014 at 8:16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant