Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise parameters for minimap2 alignments #446

Closed
FelixKrueger opened this issue Aug 2, 2021 · 6 comments
Closed

Optimise parameters for minimap2 alignments #446

FelixKrueger opened this issue Aug 2, 2021 · 6 comments

Comments

@FelixKrueger
Copy link
Owner

Some of the long read alignments take a very long time indeed, so I am sure there is plenty of scope for improvement.

@FelixKrueger
Copy link
Owner Author

One likely contender is the minibatch option -K:

-K NUM
Number of bases loaded into memory to process in a mini-batch [500M]. Similar to option -I, K/M/G/k/m/g suffix is accepted. A large NUM helps load balancing in the multi-threading mode, at the cost of increased memory.

So using default mode, -K is 500,000,000. If we assume a mean read length of 5000bp, this would be ~500,000 sequences per minibatch. For a non-directional run within Bismark, it would take 4 times the time (once per strand) to align 500K sequences each before sequences can start being compared and processed by the Bismark methylation calling process.

I'll do a few tests with this later, but first I'll focus on k-mer size.

@FelixKrueger
Copy link
Owner Author

Heng Li suggested >> Due to the reduced alphabet, you could probably increase the k-mer size such that "mid_occ" in stderr is about several hundred.

I have done a few tests now with 10K and 100K PacBio EM-seq test reads (5kb), and it seems pretty obvious that we can achieve a decent speed-up by moving away from the default -k 15:

Screenshot 2021-08-03 at 15 29 39

For the time being, I will change the kmer size to -k 20 for the Bismark genome preparation.

@FelixKrueger
Copy link
Owner Author

FelixKrueger commented Aug 4, 2021

Using a k-mer length of 20 as a sweet spot, I tested a few different variations of the minibatch size (-K).

For our specific application, the default of -K 500M appears to be the least ideal option, especially in combination with -k 15...

Screenshot 2021-08-04 at 08 26 08

Going forward I think we'll settle for for a combination for -k 20 -K 250K, unless someone has a better idea?

@FelixKrueger
Copy link
Owner Author

FelixKrueger commented Aug 5, 2021

Test on 100,000 reads:

default [-k 15 -K 500M] [-k 20 -K 250K]
509 min 16 min
MEM: > 120GB MEM: 40GB

This is a speed-up of ~45-fold, along with a 3-fold reduction in memory. Pretty happy with these results.

@FelixKrueger
Copy link
Owner Author

I looked at the possibility of improving the alignment speed via parallelisation within Bismark.

The default mode (i.e. --parallel 1) uses:

  • 1 core for Bismark
  • 4 threads of minimap2 for non-directional alignments, with 3 cores (-t 3) each
  • some gzip I/O streams, which typically don't take very much CPU

So one should probably reserve 14 cores and 40G RAM for each level of --parallel (the genome was human GRCh38 + Lambda + pUC19).

Screenshot 2021-08-06 at 11 39 34

The results show that there is a decent level of additional speed-up to be had if the resources are plentiful, even though there might be diminishing returns at some point.

@FelixKrueger
Copy link
Owner Author

Lastly, I looked at the multi-threading capacity within each minimap2 invocation:

Screenshot 2021-08-07 at 16 01 40

Using at least -t 2 for each thread seems to make a real difference, but there is not much gain beyond that.

For the time being I would like to settle on the following parameters as the default: -t 2 -k 20 -K 250K, which would bring a default run (non-directional, human genome, to ~10 cores + 40GB of RAM per level of --parallel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant