Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource contention from independently run, unrelated hmmscan processes #240

Closed
GabeAl opened this issue Apr 29, 2021 · 4 comments
Closed

Comments

@GabeAl
Copy link

GabeAl commented Apr 29, 2021

Hi all, I'm noticing a performance problem with hmmscan run on some simple bacterial genomic protein sets. With one thread it does fine. But even when given independent processes on independent threads from independent shells... this happens:

image

I have 256 threads (128 cores), and am using only 64 (--cpu 1) processes of hmmscan. These are all different, separate, independent instances that should not be talking to each other or even aware of each other's existence. This is a screenshot from htop showing the resource utilization. When running multiples like this I see very low CPU use in the "user" category and very high use in the "system" (aka wasted) category. That's where the red lines are coming from in the htop view I posted -- that's wasted processing.

Some numbers:

  • On one genome, hmmscan takes about 4 minutes to run.
  • But scale up to 64 separate instances, and each takes 60 minutes!

That's just weird and seems like it shouldn't be happening. What's going on? Is hmmscan locking something on the system? If so, couldn't it copy what it needs to its own local thread and process it there? It seems one instance is somehow negatively affecting other instances.

Or does it already do this and we are just RAM bandwidth starved in general here? (Like, does it take each query and separately, individually, take that one query through the whole entire 1.5GB database and then return for the next query?)

Maybe unrelated, but what is the I/O thread actually doing? I see each instance eating up more than 100% CPU, but what benefits does it actually bring? Why can't we just "fread" the file into RAM? Or mmap() it? Honestly these aren't big files -- Pfam-A is only 1.5 gigs, and each entire genome full of queries is a couple of MBs.

Would be eager to figure out what's happening here. (I would be even more eager to switch to hmmsearch, but the e-values it produces, from what I understand, are not assigned per sequence but in some nebulous context of the whole genome, which is a huge no-no when trying to build a deterministic model around the presence/absence of certain annotations at a universal e-value cutoff, as it produces different e-values for the same exact protein in different contexts, which in turn makes it impossible to subselect genes beforehand, or even directly compare annotations across a complete and incomplete genome of the identical strain, or related strains with differing pangenomic content, etc. Maybe the easiest solution is to just add the hmmscan table output to the hmmsearch program as an option, e.g. "--query-centric" and make hmmscan an alias to it).

Thanks for any insights and/or discussion!

PS: for reference, 128 instances of hmmsearch --cpu 1
image
Obviously a lot nicer-looking and much less red despite double the threads used!

@npcarter
Copy link
Member

npcarter commented Apr 29, 2021 via email

@GabeAl
Copy link
Author

GabeAl commented Apr 29, 2021

Thanks, your understanding is correct!

I should mention that in the hmmscan case, I'm using the binary version of the hmm file (converted via hmmpress) and storing it on ramdisk so there is no disk I/O involved, nor the overhead of parsing text.

That said, your comment about the -Z flag is absolutely and incredibly appreciated. Pfam-A supposedly contains 19,179 entries (v34.0) according to the web site (http://pfam.xfam.org/). I'll try using this number (unless their definition of an "entry" differs from our definition of an "HMM", e.g. not a 1:1 mapping).

I am currently running hmmsearch standard (no splitting or -Z) and am seeing a speed of roughly 2,000 genomes annotated per hour (roughly 1 genome is fully annotated every 2 seconds). I would be quite eager to improve this speed to reach 1 genome per second, since I have over a million prokaryotic genomes to get through. In your experience, should I recompile to disable the I/O thread (e.g. disable all threading) and run on all hyperthreads, or keep the default "I/O thread" as-is and run half the number of instances as threads (i.e. one run per physical core)?

I'll also try the splitting trick, but have a question there: How can I split a pre-existing hmm database? Is this in the manual? I assume once I do I'll need to set Z to whatever number of HMMs each one contains. Then merging the results is as simple as concatenating the --tblout files (after removing the comments).

Thanks,
Gabe

@npcarter
Copy link
Member

npcarter commented Apr 29, 2021 via email

@GabeAl
Copy link
Author

GabeAl commented Apr 29, 2021

Excellent, thanks. Yes, the pfam docs have a name for each. The manual (section 8, page 107) details the ascii format of the existing hmm files as well, which can be parsed for the names to feed hmmfetch (or, if you're particularly bold, to parse the ascii hmm file itself!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants