-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop phmmer search after N significant hits have been found. #264
Comments
I don't think we'll want to do it that way, because the subsample you'd get (from early stopping) would be biased by the arbitrary order of the target sequence database. We do plan to have HMMER4 do downsampling in some other ways, though. |
Dear professor, my idea was to pre-shuffle the target sequence database once and perform the searches with my queries on these sequences, stopping early if the threshold is reached. I know this is not the same as drafting a new subsample from the output MSA for each individual query (e.g., two similar query sequences will likely share the first several hits and thus yield the same MSA if the threshold is low), but the resulting MSAs would still be sufficiently random for my application. Unfortunately, I now spend most of my time aligning sequences which I end up discarding. David |
The problem with stopping a search early is that HMMER can't determine
which hits are significant until it knows how big the database is, which it
can only (currently) do by searching the database and counting the number
of the sequences. This is a limitation of the FASTA file format, so we
can't do much about it. When searching a database, HMMER computes a
bit-score for each sequence, and then converts that into a preliminary
e-value using the number of sequences searched so far as the database size.
After that pass, it goes through and re-computes the e-value of each
sequence/HMM comparison using the full database size. The effect is that
sequences early in the database have a very easy time meeting the e-value
threshold to be reported during the first pass, but often do not make the
reporting threshold when the full database size is used to compute the
e-value.
If we took your suggestion and stopped searching after finding the first N
hits, we'd have to use the preliminary e-values, and there'd be a very high
chance that many of those N hits would have dropped below the reporting
threshold during the final e-value computation, so we'd be returning bad
hits to the user.
Parsing large numbers of hits can be challenging. One suggestion would be
to use the --tblout argument to generate a tabular version of the results,
process that to determine which hits you want to include in the alignment,
and then extract the details of those hits from the full output.
Hope this helps,
…-Nick
On Thu, Nov 18, 2021 at 2:16 PM David Jakubec ***@***.***> wrote:
Dear professor,
my idea was to pre-shuffle the target sequence database once and perform
the searches with my queries on these sequences, stopping early if the
threshold is reached. I know this is not the same as drafting a new
subsample from the output MSA for each individual query (*e.g.*, two
similar query sequences will likely share the first several hits and thus
yield the same MSA if the threshold is low), but the resulting MSAs would
still be sufficiently random for my application. Unfortunately, I now spend
most of my time aligning sequences which I end up discarding.
David
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#264 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDJBZGQ7T2J7TRVIJHMKRTUMVGKRANCNFSM5IJD2NDA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Thank you for the detailed answer. I thought that the size of the target database (which I know) could be specified with the Parsing the MSAs is really not much of an issue; I use the Easel tools to do the actual file content manipulation, and my Python scripts to do the rest. |
Dear HMMER team,
I'm wondering if it would be possible or practical to add an option to
phmmer
, which would stop the search after a specified number of significant hitsN
have been found. This would be very useful when one is working with large target databases (e.g., UniProtKB), but subsamples the resulting MSAs downstream in their analyses. Unfortunately, I don't know C or the HMMER codebase well enough to tell this from looking atsrc/phmmer.c
, but perhaps you might know quickly whether this could work.Best,
David
The text was updated successfully, but these errors were encountered: