Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop phmmer search after N significant hits have been found. #264

Closed
davidjakubec opened this issue Nov 18, 2021 · 4 comments
Closed

Stop phmmer search after N significant hits have been found. #264

davidjakubec opened this issue Nov 18, 2021 · 4 comments

Comments

@davidjakubec
Copy link

Dear HMMER team,

I'm wondering if it would be possible or practical to add an option to phmmer, which would stop the search after a specified number of significant hits N have been found. This would be very useful when one is working with large target databases (e.g., UniProtKB), but subsamples the resulting MSAs downstream in their analyses. Unfortunately, I don't know C or the HMMER codebase well enough to tell this from looking at src/phmmer.c, but perhaps you might know quickly whether this could work.

Best,
David

@cryptogenomicon
Copy link
Member

I don't think we'll want to do it that way, because the subsample you'd get (from early stopping) would be biased by the arbitrary order of the target sequence database. We do plan to have HMMER4 do downsampling in some other ways, though.

@davidjakubec
Copy link
Author

Dear professor,

my idea was to pre-shuffle the target sequence database once and perform the searches with my queries on these sequences, stopping early if the threshold is reached. I know this is not the same as drafting a new subsample from the output MSA for each individual query (e.g., two similar query sequences will likely share the first several hits and thus yield the same MSA if the threshold is low), but the resulting MSAs would still be sufficiently random for my application. Unfortunately, I now spend most of my time aligning sequences which I end up discarding.

David

@npcarter
Copy link
Member

npcarter commented Nov 18, 2021 via email

@davidjakubec
Copy link
Author

Thank you for the detailed answer. I thought that the size of the target database (which I know) could be specified with the -Z option, but it seems that I've misunderstood the argument. It appears that what I imagined is probably more complicated than what I'd thought. Please feel free to close the issue if you feel nothing more can be done.

Parsing the MSAs is really not much of an issue; I use the Easel tools to do the actual file content manipulation, and my Python scripts to do the rest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants