Fix for rare "Invalid alphabet type" nhmmer error #252

traviswheeler · 2021-08-16T23:32:49Z

The following error can arise in edge case inputs:

% nhmmer --dna A.fa B.fa
Error: Invalid alphabet type in target for nhmmer. Expect DNA or RNA.

An example input that will produce this error is:

% cat A.fa

seq1
AA

% cat B.fa

seq1
AC


Note: The --dna flag tells nhmmer that the QUERY is in DNA format.
It does not assert anything about the target, and there isn't
a flag that does.

There is a check in the first step of reading the target that aims
to stop a user from providing a protein target sequence to nhmmer:

  esl_sqfile_GuessAlphabet(dbfp, &q_type);
  if (! (q_type == eslDNA || q_type == eslRNA))
    p7_Fail("Invalid alphabet type in target for nhmmer. Expect DNA or RNA.\n");

The problem arises when esl_sqfile_GuessAlphabet() is unable to
guess the target alphabet. This can happen when the target is
not long/diverse enough to provide the guesser with enough
information. Two examples beyond A.fa and B.fa above:

>seq1 - too short  (needs to be >10 characters)
ACGTACGTAC

>seq1 - does not include all 4 nucleotides.
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

These kind of inputs seem likely to be sample inputs to
test that nhmmer runs, not true sequences. In any event,
we'd like nhmmer to run.

We can overcome this error by encoding the approach "if the
guesser can't figure it out, we'll assume that the query
format is the guide". This is acheived by simply allowing
eslUNKNOWN as the result of a guess:

      if (! (q_type == eslDNA || q_type == eslRNA || q_type == eslUNKNOWN))

If the input sequence is guessed as a protein, it'll
still correctly induce the error.

That's the change found in this commit.

Note: it is possible for a user to still sneak past this
test, e.g. with the input:

target_seq
EE

It has illegal letters for DNA, but is too short for the
guesser to guess. As committed, nhmmer will just breeze
past that sequence with no match. An input with >10 letters
containing any illegal DNA character will produce the
previous message:

% cat A.fa

seq1
ACGTACGTAC

% cat E.fa

aa_seq
AAAAAAAAAAE

% nhmmer --dna A.fa E.fa
Error: Invalid alphabet type in target for nhmmer. Expect DNA or RNA.


I'm of the opinion that this "short amino sequences don't raise
and error" issue is ok, since that's deep in user error territory
(and such a short sequence would never produce enough score to
yield a match anyway). That said, if we want to be more robust
about invalid amino acid letters in too-short inputs, we could
create an easel function that tests if a supposed string is a
legal match for an input type. To my knowledge, this function
doesn't currently exist. It seems too heavyweight for the problem
at hand, but I'm open to going that route, if you think it's
necessary.

The following error can arise in edge case inputs: % nhmmer --dna A.fa B.fa Error: Invalid alphabet type in target for nhmmer. Expect DNA or RNA. An example input that will produce this error is: % cat A.fa >seq1 AA % cat B.fa >seq1 AC ~~~ Note: The --dna flag tells nhmmer that the QUERY is in DNA format. It does not assert anything about the target, and there isn't a flag that does. There is a check in the first step of reading the target that aims to stop a user from providing a protein target sequence to nhmmer: esl_sqfile_GuessAlphabet(dbfp, &q_type); if (! (q_type == eslDNA || q_type == eslRNA)) p7_Fail("Invalid alphabet type in target for nhmmer. Expect DNA or RNA.\n"); The problem arises when esl_sqfile_GuessAlphabet() is unable to guess the target alphabet. This can happen when the target is not long/diverse enough to provide the guesser with enough information. Two examples beyond A.fa and B.fa above: >seq1 - too short (needs to be >10 characters) ACGTACGTAC >seq1 - does not include all 4 nucleotides. TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA These kind of inputs seem likely to be sample inputs to test that nhmmer runs, not true sequences. In any event, we'd like nhmmer to run. We can overcome this error by encoding the approach "if the guesser can't figure it out, we'll assume that the query format is the guide". This is acheived by simply allowing eslUNKNOWN as the result of a guess: if (! (q_type == eslDNA || q_type == eslRNA || q_type == eslUNKNOWN)) If the input sequence is guessed as a protein, it'll still correctly induce the error. That's the change found in this commit. ~~~ Note: it is possible for a user to still sneak past this test, e.g. with the input: >target_seq EE It has illegal letters for DNA, but is too short for the guesser to guess. As committed, nhmmer will just breeze past that sequence with no match. An input with >10 letters containing any illegal DNA character will produce the previous message: % cat A.fa >seq1 ACGTACGTAC % cat E.fa >aa_seq AAAAAAAAAAE % nhmmer --dna A.fa E.fa Error: Invalid alphabet type in target for nhmmer. Expect DNA or RNA. ~~~ I'm of the opinion that this "short amino sequences don't raise and error" issue is ok, since that's deep in user error territory (and such a short sequence would never produce enough score to yield a match anyway). That said, if we want to be more robust about invalid amino acid letters in too-short inputs, we could create an easel function that tests if a supposed string is a legal match for an input type. To my knowledge, this function doesn't currently exist. It seems too heavyweight for the problem at hand, but I'm open to going that route, if you think it's necessary.

cryptogenomicon · 2021-08-17T12:33:10Z

Looks good - thanks!

jksull · 2022-04-11T02:51:53Z

still getting this error even with the develop branch (which was supposedly fixed a few days ago). When manually encoding the fix on the main branch, the error subsides but instead am met with a segmentation fault error for all the libraries where the above 'invalid alphabet' error was previously present.

traviswheeler · 2022-04-11T03:12:36Z

I have pulled the updates into my local clone of the develop branch, and I do not receive an error when I run a command like:
nhmmer --dna query.fa target-10k.fa

You haven't provided much detail, so it's hard to be sure exactly what problem you're running into. Can you please do the following:

run 'git pull' to ensure that you have the fix.
if the error persists:
-- provide us with the exact command you are running, and the exact error you are receiving.
-- confirm that you are using the files provided in the original issue; if not, tell us that files you are using.

Thanks

cryptogenomicon merged commit 387fb90 into EddyRivasLab:develop Aug 17, 2021

traviswheeler deleted the nhmmer-invalid-alph branch August 20, 2021 22:31

This was referenced Mar 17, 2022

nhmmer failed to run - Error: Invalid alphabet type in target for nhmmer. Expect DNA or RNA tseemann/barrnap#54

Open

Error: Invalid alphabet type in target / expected FASTA to start with > #271

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for rare "Invalid alphabet type" nhmmer error #252

Fix for rare "Invalid alphabet type" nhmmer error #252

traviswheeler commented Aug 16, 2021

cryptogenomicon commented Aug 17, 2021

jksull commented Apr 11, 2022 •

edited

Loading

traviswheeler commented Apr 11, 2022

Fix for rare "Invalid alphabet type" nhmmer error #252

Fix for rare "Invalid alphabet type" nhmmer error #252

Conversation

traviswheeler commented Aug 16, 2021

cryptogenomicon commented Aug 17, 2021

jksull commented Apr 11, 2022 • edited Loading

traviswheeler commented Apr 11, 2022

jksull commented Apr 11, 2022 •

edited

Loading