Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genome preparation fails in multi-thread mode for extremely large genomes #251

Closed
FelixKrueger opened this issue Apr 14, 2019 · 3 comments
Closed
Assignees
Labels

Comments

@FelixKrueger
Copy link
Owner

When using the Bowtie2 indexer on a very large genomes, e.g. the Axolotl genome (~32GB), the auto-detection of small/large genome sequences doesn't seem to work as expected:

...
Building a SMALL index
Reading reference sizes
Building a SMALL index
Reading reference sizes
Error: Reference sequence has more than 2^32-1 characters!  Please build a large index by passing the --large-index option to bowtie2-build
  Time reading reference sizes: 00:00:57
Total time for call to driver() for forward index: 00:01:00
Error: Encountered internal Bowtie 2 exception (#1)
Command: bowtie2-build --wrapper basic-0 -f --threads 2 genome_mfa.GA_conversion.fa BS_GA 
Deleting "BS_GA.3.bt2" file written during aborted indexing attempt.
Deleting "BS_GA.4.bt2" file written during aborted indexing attempt.
Error: Reference sequence has more than 2^32-1 characters!  Please build a large index by passing the --large-index option to bowtie2-build
  Time reading reference sizes: 00:00:57
Total time for call to driver() for forward index: 00:01:00
Error: Encountered internal Bowtie 2 exception (#1)
Command: bowtie2-build --wrapper basic-0 -f --threads 2 genome_mfa.CT_conversion.fa BS_CT 
Deleting "BS_CT.3.bt2" file written during aborted indexing attempt.
Deleting "BS_CT.4.bt2" file written during aborted indexing attempt.
Parent process: Failed to build index

It appear that we need to allow passing on the indexing option --large-index to bowtie2-build to make this work.

PS: It works in default (single-core) indexing mode, i.e. it finds and automatically generates a large index. The wallclock time was roughly 2d 6h, and took ~150GB of RAM.

@FelixKrueger FelixKrueger self-assigned this Apr 14, 2019
@FelixKrueger
Copy link
Owner Author

It appears that HISAT2 is also failing, even with the very same Bowtie-2 message... 📦

...Reading reference sizes
Reading reference sizes
Error: Reference sequence has more than 2^32-1 characters!  Please build a large index by passing the --large-index option to bowtie2-build
  Time reading reference sizes: 00:00:53
Total time for call to driver() for forward index: 00:00:57
Error: Encountered internal HISAT2 exception (#1)
Command: hisat2-build --wrapper basic-0 -f --threads 2 genome_mfa.CT_conversion.fa BS_CT 
Deleting "BS_CT.1.ht2" file written during aborted indexing attempt.
Deleting "BS_CT.2.ht2" file written during aborted indexing attempt.
Deleting "BS_CT.3.ht2" file written during aborted indexing attempt.
Deleting "BS_CT.4.ht2" file written during aborted indexing attempt.
Parent process: Failed to build index

@FelixKrueger
Copy link
Owner Author

I have now added a new option --large-index to the bismark_genome_preparation which should hopefully fix the auto-detection problem. Tests are currently under way, but will probably take a day or two to complete. Should consider reporting this to the Bowtie 2 and HISAT2 developers. Added here: 5de68d5.

@FelixKrueger
Copy link
Owner Author

I can confirm that the indexing with both Bowtie 2 as well as HISAT2 now works with multi-core support when --large-index is specified specifically.

This also brings the time of indexing the Axolotl genome down to ~18-20 hours when using --parallel 4 (8 cores total, and ~183GB RAM usage).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant