Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for rare "Invalid alphabet type" nhmmer error #252

Merged

Conversation

traviswheeler
Copy link
Contributor

The following error can arise in edge case inputs:

% nhmmer --dna A.fa B.fa
Error: Invalid alphabet type in target for nhmmer. Expect DNA or RNA.

An example input that will produce this error is:

% cat A.fa

seq1
AA

% cat B.fa

seq1
AC


Note: The --dna flag tells nhmmer that the QUERY is in DNA format.
It does not assert anything about the target, and there isn't
a flag that does.

There is a check in the first step of reading the target that aims
to stop a user from providing a protein target sequence to nhmmer:

  esl_sqfile_GuessAlphabet(dbfp, &q_type);
  if (! (q_type == eslDNA || q_type == eslRNA))
    p7_Fail("Invalid alphabet type in target for nhmmer. Expect DNA or RNA.\n");

The problem arises when esl_sqfile_GuessAlphabet() is unable to
guess the target alphabet. This can happen when the target is
not long/diverse enough to provide the guesser with enough
information. Two examples beyond A.fa and B.fa above:

>seq1 - too short  (needs to be >10 characters)
ACGTACGTAC

>seq1 - does not include all 4 nucleotides.
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

These kind of inputs seem likely to be sample inputs to
test that nhmmer runs, not true sequences. In any event,
we'd like nhmmer to run.

We can overcome this error by encoding the approach "if the
guesser can't figure it out, we'll assume that the query
format is the guide". This is acheived by simply allowing
eslUNKNOWN as the result of a guess:

      if (! (q_type == eslDNA || q_type == eslRNA || q_type == eslUNKNOWN))

If the input sequence is guessed as a protein, it'll
still correctly induce the error.

That's the change found in this commit.

Note: it is possible for a user to still sneak past this
test, e.g. with the input:

target_seq
EE

It has illegal letters for DNA, but is too short for the
guesser to guess. As committed, nhmmer will just breeze
past that sequence with no match. An input with >10 letters
containing any illegal DNA character will produce the
previous message:

% cat A.fa

seq1
ACGTACGTAC

% cat E.fa

aa_seq
AAAAAAAAAAE

% nhmmer --dna A.fa E.fa
Error: Invalid alphabet type in target for nhmmer. Expect DNA or RNA.


I'm of the opinion that this "short amino sequences don't raise
and error" issue is ok, since that's deep in user error territory
(and such a short sequence would never produce enough score to
yield a match anyway). That said, if we want to be more robust
about invalid amino acid letters in too-short inputs, we could
create an easel function that tests if a supposed string is a
legal match for an input type. To my knowledge, this function
doesn't currently exist. It seems too heavyweight for the problem
at hand, but I'm open to going that route, if you think it's
necessary.

The following error can arise in edge case inputs:

% nhmmer --dna A.fa B.fa
Error: Invalid alphabet type in target for nhmmer. Expect DNA or RNA.

An example input that will produce this error is:

% cat A.fa
>seq1
AA

% cat B.fa
>seq1
AC

~~~

Note: The --dna flag tells nhmmer that the QUERY is in DNA format.
It does not assert anything about the target, and there isn't
a flag that does.

There is a check in the first step of reading the target that aims
to stop a user from providing a protein target sequence to nhmmer:

  esl_sqfile_GuessAlphabet(dbfp, &q_type);
  if (! (q_type == eslDNA || q_type == eslRNA))
    p7_Fail("Invalid alphabet type in target for nhmmer. Expect DNA or RNA.\n");

The problem arises when esl_sqfile_GuessAlphabet() is unable to
guess the target alphabet. This can happen when the target is
not long/diverse enough to provide the guesser with enough
information. Two examples beyond A.fa and B.fa above:

>seq1 - too short  (needs to be >10 characters)
ACGTACGTAC

>seq1 - does not include all 4 nucleotides.
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

These kind of inputs seem likely to be sample inputs to
test that nhmmer runs, not true sequences. In any event,
we'd like nhmmer to run.

We can overcome this error by encoding the approach "if the
guesser can't figure it out, we'll assume that the query
format is the guide". This is acheived by simply allowing
eslUNKNOWN as the result of a guess:

      if (! (q_type == eslDNA || q_type == eslRNA || q_type == eslUNKNOWN))

If the input sequence is guessed as a protein, it'll
still correctly induce the error.

That's the change found in this commit.

~~~

Note: it is possible for a user to still sneak past this
test, e.g. with the input:

>target_seq
EE

It has illegal letters for DNA, but is too short for the
guesser to guess. As committed, nhmmer will just breeze
past that sequence with no match. An input with >10 letters
containing any illegal DNA character will produce the
previous message:

% cat A.fa
>seq1
ACGTACGTAC

% cat E.fa
>aa_seq
AAAAAAAAAAE

% nhmmer --dna A.fa E.fa
Error: Invalid alphabet type in target for nhmmer. Expect DNA or RNA.

~~~

I'm of the opinion that this "short amino sequences don't raise
and error" issue is ok, since that's deep in user error territory
(and such a short sequence would never produce enough score to
yield a match anyway). That said, if we want to be more robust
about invalid amino acid letters in too-short inputs, we could
create an easel function that tests if a supposed string is a
legal match for an input type. To my knowledge, this function
doesn't currently exist. It seems too heavyweight for the problem
at hand, but I'm open to going that route, if you think it's
necessary.
@cryptogenomicon
Copy link
Member

Looks good - thanks!

@jksull
Copy link

jksull commented Apr 11, 2022

still getting this error even with the develop branch (which was supposedly fixed a few days ago). When manually encoding the fix on the main branch, the error subsides but instead am met with a segmentation fault error for all the libraries where the above 'invalid alphabet' error was previously present.

@traviswheeler
Copy link
Contributor Author

I have pulled the updates into my local clone of the develop branch, and I do not receive an error when I run a command like:
nhmmer --dna query.fa target-10k.fa

You haven't provided much detail, so it's hard to be sure exactly what problem you're running into. Can you please do the following:

  • run 'git pull' to ensure that you have the fix.
  • if the error persists:
    -- provide us with the exact command you are running, and the exact error you are receiving.
    -- confirm that you are using the files provided in the original issue; if not, tell us that files you are using.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants