Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to identify sequence in sequence set by name #167

Open
0xaf1f opened this issue Mar 17, 2021 · 12 comments
Open

unable to identify sequence in sequence set by name #167

0xaf1f opened this issue Mar 17, 2021 · 12 comments

Comments

@0xaf1f
Copy link

0xaf1f commented Mar 17, 2021

I'm trying to run alitv.pl with an existing maf alignment file and I'm getting an error about a sequence that isn't found that is definitely there.

INFO - MAF input file and buggy BioPerl detected... Therefore, workaround for revcom issue activated
FATAL - Unable to identify sequence in sequence set by name '4_0010' at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV.pm line 99.

Unable to identify sequence in sequence set by name '4_0010' at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV.pm line 99.

Also, I replaced all occurrences of the hyphen in the original sequence name (4-0010) with an underscore, but it'd be nice to not have to do that.

@iimog
Copy link
Member

iimog commented Mar 19, 2021

Hmm, I can't tell what's going wrong there. Can you provide a minimal example of a maf file and the exact call to alitv.pl so I can reproduce and debug it?

I agree that it would be more convenient if manually replacing characters in IDs would not be necessary. Maybe @greatfireball has an idea how to fix that?

@0xaf1f
Copy link
Author

0xaf1f commented Mar 22, 2021

Well, so I'm running using a single maf file representing a multiple sequence alignment for about 100 samples. Each sample has a tsv feature file. I was looking into your test data directory to see if I could modify one of your existing test cases to look like mine, but I didn't find a maf file and tsv feature file that could be used together.

My yaml file looks like

genomes:
    -
         name: 1_0021
         sequence_files:
              - renamed-seqs/1_0021.fasta
          feature_files:
              blocks:
                  - 500/features.tsv

...
...

alignment:
    program: importer
    parameter:
        - "alignment.manglednames.maf"

where each fasta file has a single sequence and whose name matches what I put in the name field, so the header of renamed_seqs/1_0021.fasta is >1_0021

There is nothing special about my command-line invocation. I am just using /path/to/AliTV-perl-interface/bin/alitv.pl --project tb --overwrite alitv.yaml

@iimog
Copy link
Member

iimog commented Mar 25, 2021

Ok, after looking some more into this, I think our support for importing pre-calculated alignments is really rudimentary. The ids you use might get replaced even if all characters are letters, digits and underscores. This can happen e.g. if they are too long (more than 8 characters long). When you run your command there should be a file called tb.map. When you look into that, are IDs being renamed? If so you might need to replace the IDs in the maf files with the new names (you can keep the old names in the fasta files). If you can verify that this fixes the issue we should really work on automating this process at our end.

@0xaf1f
Copy link
Author

0xaf1f commented Mar 25, 2021

I do actually get this in the logging output:

INFO - Sequence names are longer then maximum allowed length (8 characters) and will be replaced by unique sequence names. Failing sequence names are:  ... ... ...

but none of the names listed there (they're only about 8) are the one that it claims to be unable to find when it crashes afterwards.

And...looking at the .map file, yes, it looks like everything was renamed:

#genome old_name        new_name
1_0006  1_0006  seq0
1_0009  1_0009  seq1
1_0013  1_0013  seq2
1_0017  1_0017  seq3
1_0021  1_0021  seq4
1_0028  1_0028  seq5
... ...

@0xaf1f
Copy link
Author

0xaf1f commented Mar 25, 2021

Ok, replacing the names in the maf file gets me past the error, but I encounter other ones later:

...
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 107641.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 107692.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 107789.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 107842.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 107893.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 107946.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 107997.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 108048.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 108099.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 108151.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 108200.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 108251.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 108301.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 108398.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 108448.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 108500.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 108550.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 108602.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 108653.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 108702.
Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 108751.
FATAL - unable to create features at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 305.

unable to create features at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 305.

@iimog
Copy link
Member

iimog commented Mar 25, 2021

Okay, getting closer. The most likely cause of this issue is that there are sequences in the maf file that are not in the fasta files. However, the error message does not help us much. So I created a new branch: improve-maf-import. This currently only makes the last error message more meaningful by printing the failing sequence ids. If you checkout that branch and re-run you should see which ids in the maf are causing this error.

@0xaf1f
Copy link
Author

0xaf1f commented Mar 29, 2021

Looks like all of them:

FATAL - unable to create features for seqs: seq0 seq1 seq10 seq11 seq12 seq12 seq13 seq14 seq15 seq15 seq16 seq17 seq18 seq19 seq2 seq20 seq21 seq22 seq22 seq23 seq24 seq25 seq26 seq27 seq28 seq29 seq3 seq30 seq31 seq32 seq33 seq33 seq34 seq34 seq35 seq35 seq36 seq36 seq37 seq37 seq38 seq39 seq4 seq40 seq40 seq41 seq41 seq42 seq43 seq43 seq44 seq45 seq45 seq46 seq47 seq47 seq48 seq48 seq49 seq49 seq5 seq50 seq50 seq51 seq51 seq52 seq52 seq53 seq53 seq54 seq55 seq55 seq56 seq57 seq57 seq58 seq59 seq59 seq6 seq6 seq60 seq60 seq61 seq62 seq62 seq63 seq63 seq64 seq64 seq65 seq65 seq66 seq66 seq67 seq67 seq68 seq68 seq69 seq7 seq70 seq70 seq71 seq71 seq72 seq72 seq73 seq73 seq74 seq74 seq75 seq76 seq77 seq78 seq78 seq79 seq79 seq8 seq8 seq80 seq80 seq81 seq81 seq82 seq82 seq83 seq83 seq84 seq85 seq85 seq86 seq87 seq87 seq88 seq88 seq89 seq89 seq9 seq90 seq90 seq91 seq91 seq92 seq92 seq93 seq94 seq94 at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 305.

unable to create features for seqs: seq0 seq1 seq10 seq11 seq12 seq12 seq13 seq14 seq15 seq15 seq16 seq17 seq18 seq19 seq2 seq20 seq21 seq22 seq22 seq23 seq24 seq25 seq26 seq27 seq28 seq29 seq3 seq30 seq31 seq32 seq33 seq33 seq34 seq34 seq35 seq35 seq36 seq36 seq37 seq37 seq38 seq39 seq4 seq40 seq40 seq41 seq41 seq42 seq43 seq43 seq44 seq45 seq45 seq46 seq47 seq47 seq48 seq48 seq49 seq49 seq5 seq50 seq50 seq51 seq51 seq52 seq52 seq53 seq53 seq54 seq55 seq55 seq56 seq57 seq57 seq58 seq59 seq59 seq6 seq6 seq60 seq60 seq61 seq62 seq62 seq63 seq63 seq64 seq64 seq65 seq65 seq66 seq66 seq67 seq67 seq68 seq68 seq69 seq7 seq70 seq70 seq71 seq71 seq72 seq72 seq73 seq73 seq74 seq74 seq75 seq76 seq77 seq78 seq78 seq79 seq79 seq8 seq8 seq80 seq80 seq81 seq81 seq82 seq82 seq83 seq83 seq84 seq85 seq85 seq86 seq87 seq87 seq88 seq88 seq89 seq89 seq9 seq90 seq90 seq91 seq91 seq92 seq92 seq93 seq94 seq94 at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 305.

Presumably I need to change the names in the yaml file, too.

@0xaf1f
Copy link
Author

0xaf1f commented Mar 29, 2021

So now I've changed the names in the yaml file, the sequence fasta headers, and the features.tsv file and I'm still seeing this error, so maybe that doesn't have to do with the names after all? The only place where the original names persist is in the sequence file name, but I doubt that matters.

For features on the minus strand, is it a problem if the start position is greater than the stop position? I wonder if that's the problem. The only other potential issue I see is that my feature names are just numbers between 1 and ~700.

@iimog
Copy link
Member

iimog commented Mar 30, 2021

I did not expect such a long list of sequences. I know MAF supports multiple alignments which are broken down into pairwise alignments by AliTV: #33
For me it looks like something is going wrong there, rather than an issue with naming the sequences/features. @greatfireball I'm afraid I need your input here.

@greatfireball
Copy link
Member

Thank you very much for your efforts @iimog

I will try to follow your discussion here today and answer asap.

Maybe we have a restriction to start features with a letter instead of only digits as suggested by @0xaf1f ? Need to check this

@0xaf1f
Copy link
Author

0xaf1f commented Mar 30, 2021

I prepended f to all feature names and tried again with no difference in the outcome. I also tried swapping start and stop positions where start>stop (features on minus strand) and that didn't help either.

These warnings:

Use of uninitialized value in subroutine entry at /path/to/utils/AliTV-perl-interface/bin/../lib/AliTV/Alignment.pm line 262, <GEN95> line 108751.

look to be due to the fact that my maf file doesn't have scores for the alignments. But looking at the format specification, the score is an optional field so perhaps its existence should be checked before trying to take the md5. I'm not sure if this is related to the ultimate failure of the run, though.

@iimog
Copy link
Member

iimog commented May 7, 2021

I just pushed a change to the branch https://github.com/AliTVTeam/AliTV-perl-interface/tree/improve-maf-import where you can call alitv.pl --keepids [...] which will prevent the mapping of fasta IDs to unique names. Unfortunately, this will not solve the score problem yet. I'll have to dig deeper and probably need help from @greatfireball.
Note: not renaming IDs might be problematic if you have non-unique IDs (across files) or IDs containing special characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants