-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resource contention from independently run, unrelated hmmscan processes #240
Comments
To make sure I understand correctly, you're running 64 hmmscan jobs where
the HMM database (Pfam-A) is the same for all jobs, and the sequence file
varies from run to run. Is that correct?
If that's the case, it's fairly likely that you're being I/O limited by
some combination of the time required to read the HMM database from disk
and the time required to parse it into the data structures we use
internally. Hmmscan reads the HMM database from disk once for each
sequence in the sequence file, so the total amount of data read from disk
in a run is something like sequence file size + HMM file size * # of
sequences in file. Also, converting a text-format HMM into our data
structures takes a moderate amount of computation, making things worse.
(Yes, this method is inefficient in the case where a run searches many
sequences, but it lets us minimize memory use in the common case where each
run targets a single sequence.)
For something like what you're doing, you should see better performance if
you turn things around and use hmmsearch, like you mentioned. Hmmsearch
reads the target database once for each HMM in the query HMM file, but an
HMM takes about 200x the disk space of a sequence of the same length, and
more computation per element to convert into internal data structures. As
you mention, the challenge here is making the program calculate the correct
e-values. You can do that with the -Z flag to hmmsearch, which sets the
number of comparisons used when calculating e-values.
For the hmmscans you're running, the number of comparisons = the number of
HMMs in the Pfam-A database, and each sequence is treated as an independent
search. So, if you switch to hmmsearch and use -Z to set the number of
comparisons to that value, you'll get the same e-values as you do with
hmmscan.
Given all that, you can probably minimize the bandwidth bottleneck by
merging all of your sequences into one sequence database, splitting your
HMM database into 64 pieces, and doing hmmsearch searches of each piece of
your HMM database against your complete sequence database.
Hope that helps!
…-Nick
On Wed, Apr 28, 2021 at 8:29 PM Gabriel Al-Ghalith ***@***.***> wrote:
Hi all, I'm noticing a performance problem with hmmscan run on some simple
bacterial genomic protein sets. With one thread it does fine. But even when
given independent processes on independent threads from independent
shells... this happens:
[image: image]
<https://user-images.githubusercontent.com/8050438/116487157-4fd95480-a85d-11eb-987e-d29e3234a62b.png>
I have 256 threads (128 cores), and am using only 64 (--cpu 1) processes
of hmmscan. These are all different, separate, independent instances that
should not be talking to each other or even aware of each other's
existence. This is a screenshot from htop showing the resource utilization.
When running multiples like this I see very low CPU use in the "user"
category and very high use in the "system" (aka wasted) category. That's
where the red lines are coming from in the htop view I posted -- that's
wasted processing.
Some numbers:
- On one genome, hmmscan takes about 4 minutes to run.
- But scale up to 64 separate instances, and each takes 60 minutes!
That's just weird and seems like it shouldn't be happening. What's going
on? Is hmmscan locking something on the system? If so, couldn't it copy
what it needs to its own local thread and process it there? It seems one
instance is somehow negatively affecting other instances.
Or does it already do this and we are just RAM bandwidth starved in
general here? (Like, does it take each query and separately, individually,
take that one query through the whole entire 1.5GB database and then return
for the next query?)
Maybe unrelated, but what is the I/O thread actually doing? I see each
instance eating up more than 100% CPU, but what benefits does it actually
bring? Why can't we just "fread" the file into RAM? Or mmap() it? Honestly
these aren't big files -- Pfam-A is only 1.5 gigs, and each entire genome
full of queries is a couple of MBs.
Would be eager to figure out what's happening here. (I would be even more
eager to switch to hmmsearch, but the e-values it produces, from what I
understand, are not assigned per sequence but in some nebulous context of
the whole genome, which is a huge no-no when trying to build a
deterministic model around the presence/absence of certain annotations at a
universal e-value cutoff. Maybe the easiest solution is to just add the
hmmscan table output to the hmmsearch program as an option, e.g.
"--query-centric").
Thanks for any insights and/or discussion!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#240>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDJBZHC36Q3XE4ULV7AFLTTLCR6TANCNFSM43YC2UVQ>
.
|
Thanks, your understanding is correct! I should mention that in the hmmscan case, I'm using the binary version of the hmm file (converted via hmmpress) and storing it on ramdisk so there is no disk I/O involved, nor the overhead of parsing text. That said, your comment about the -Z flag is absolutely and incredibly appreciated. Pfam-A supposedly contains 19,179 entries (v34.0) according to the web site (http://pfam.xfam.org/). I'll try using this number (unless their definition of an "entry" differs from our definition of an "HMM", e.g. not a 1:1 mapping). I am currently running hmmsearch standard (no splitting or -Z) and am seeing a speed of roughly 2,000 genomes annotated per hour (roughly 1 genome is fully annotated every 2 seconds). I would be quite eager to improve this speed to reach 1 genome per second, since I have over a million prokaryotic genomes to get through. In your experience, should I recompile to disable the I/O thread (e.g. disable all threading) and run on all hyperthreads, or keep the default "I/O thread" as-is and run half the number of instances as threads (i.e. one run per physical core)? I'll also try the splitting trick, but have a question there: How can I split a pre-existing hmm database? Is this in the manual? I assume once I do I'll need to set Z to whatever number of HMMs each one contains. Then merging the results is as simple as concatenating the --tblout files (after removing the comments). Thanks, |
I don't have any experience disabling the I/O thread, so can't comment much
there.
For splitting HMM files, the hmmfetch utility may help. If you can
construct text files containing the name of the HMMs in each split, you can
then pass that to hmmfetch to extract those HMMs into separate files.
…-Nick
On Thu, Apr 29, 2021 at 3:21 PM Gabriel Al-Ghalith ***@***.***> wrote:
Thanks, your understanding is correct!
I should mention that I'm using the binary version of the hmm file
(converted via hmmpress) and storing it on ramdisk so there is no disk I/O
involved, nor the overhead of parsing text.
That said, your comment about the -Z flag is absolutely and incredibly
appreciated. Pfam-A supposedly contains 19,179 entries (v34.0) according to
the web site (http://pfam.xfam.org/). I'll try using this number (unless
their definition of an "entry" differs from our definition of an "HMM",
e.g. not a 1:1 mapping).
I am currently running hmmsearch standard (no splitting or -Z) and am
seeing a speed of roughly 2,000 genomes annotated per hour (roughly 1
genome is fully annotated every 2 seconds). I would be quite eager to
improve this speed to reach 1 genome per second, since I have over a
million prokaryotic genomes to get through. In your experience, should I
recompile to disable the I/O thread (e.g. disable all threading) and run on
all hyperthreads, or keep the default "I/O thread" as-is and run half the
number of instances as threads (i.e. one run per physical core)?
I'll also try the splitting trick, but have a question there: How can I
split a pre-existing hmm database? Is this in the manual? I assume once I
do I'll need to set Z to whatever number of HMMs each one contains. Then
merging the results is as simple as concatenating the --tblout files (after
removing the comments).
Thanks,
Gabe
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#240 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDJBZBLZJULJGXDPO7JSOLTLGWVFANCNFSM43YC2UVQ>
.
|
Excellent, thanks. Yes, the pfam docs have a name for each. The manual (section 8, page 107) details the ascii format of the existing hmm files as well, which can be parsed for the names to feed hmmfetch (or, if you're particularly bold, to parse the ascii hmm file itself!). |
Hi all, I'm noticing a performance problem with hmmscan run on some simple bacterial genomic protein sets. With one thread it does fine. But even when given independent processes on independent threads from independent shells... this happens:
I have 256 threads (128 cores), and am using only 64 (--cpu 1) processes of hmmscan. These are all different, separate, independent instances that should not be talking to each other or even aware of each other's existence. This is a screenshot from htop showing the resource utilization. When running multiples like this I see very low CPU use in the "user" category and very high use in the "system" (aka wasted) category. That's where the red lines are coming from in the htop view I posted -- that's wasted processing.
Some numbers:
That's just weird and seems like it shouldn't be happening. What's going on? Is hmmscan locking something on the system? If so, couldn't it copy what it needs to its own local thread and process it there? It seems one instance is somehow negatively affecting other instances.
Or does it already do this and we are just RAM bandwidth starved in general here? (Like, does it take each query and separately, individually, take that one query through the whole entire 1.5GB database and then return for the next query?)
Maybe unrelated, but what is the I/O thread actually doing? I see each instance eating up more than 100% CPU, but what benefits does it actually bring? Why can't we just "fread" the file into RAM? Or mmap() it? Honestly these aren't big files -- Pfam-A is only 1.5 gigs, and each entire genome full of queries is a couple of MBs.
Would be eager to figure out what's happening here. (I would be even more eager to switch to hmmsearch, but the e-values it produces, from what I understand, are not assigned per sequence but in some nebulous context of the whole genome, which is a huge no-no when trying to build a deterministic model around the presence/absence of certain annotations at a universal e-value cutoff, as it produces different e-values for the same exact protein in different contexts, which in turn makes it impossible to subselect genes beforehand, or even directly compare annotations across a complete and incomplete genome of the identical strain, or related strains with differing pangenomic content, etc. Maybe the easiest solution is to just add the hmmscan table output to the hmmsearch program as an option, e.g. "--query-centric" and make hmmscan an alias to it).
Thanks for any insights and/or discussion!
PS: for reference, 128 instances of hmmsearch --cpu 1
Obviously a lot nicer-looking and much less red despite double the threads used!
The text was updated successfully, but these errors were encountered: