-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unable to identify sequence in sequence set by name #167
Comments
Hmm, I can't tell what's going wrong there. Can you provide a minimal example of a maf file and the exact call to I agree that it would be more convenient if manually replacing characters in IDs would not be necessary. Maybe @greatfireball has an idea how to fix that? |
Well, so I'm running using a single maf file representing a multiple sequence alignment for about 100 samples. Each sample has a tsv feature file. I was looking into your test data directory to see if I could modify one of your existing test cases to look like mine, but I didn't find a maf file and tsv feature file that could be used together. My yaml file looks like
where each fasta file has a single sequence and whose name matches what I put in the name field, so the header of There is nothing special about my command-line invocation. I am just using |
Ok, after looking some more into this, I think our support for importing pre-calculated alignments is really rudimentary. The ids you use might get replaced even if all characters are letters, digits and underscores. This can happen e.g. if they are too long (more than 8 characters long). When you run your command there should be a file called |
I do actually get this in the logging output:
but none of the names listed there (they're only about 8) are the one that it claims to be unable to find when it crashes afterwards. And...looking at the .map file, yes, it looks like everything was renamed:
|
Ok, replacing the names in the maf file gets me past the error, but I encounter other ones later:
|
Okay, getting closer. The most likely cause of this issue is that there are sequences in the maf file that are not in the fasta files. However, the error message does not help us much. So I created a new branch: improve-maf-import. This currently only makes the last error message more meaningful by printing the failing sequence ids. If you checkout that branch and re-run you should see which ids in the maf are causing this error. |
Looks like all of them:
Presumably I need to change the names in the yaml file, too. |
So now I've changed the names in the yaml file, the sequence fasta headers, and the features.tsv file and I'm still seeing this error, so maybe that doesn't have to do with the names after all? The only place where the original names persist is in the sequence file name, but I doubt that matters. For features on the minus strand, is it a problem if the start position is greater than the stop position? I wonder if that's the problem. The only other potential issue I see is that my feature names are just numbers between 1 and ~700. |
I did not expect such a long list of sequences. I know MAF supports multiple alignments which are broken down into pairwise alignments by AliTV: #33 |
I prepended These warnings:
look to be due to the fact that my maf file doesn't have scores for the alignments. But looking at the format specification, the score is an optional field so perhaps its existence should be checked before trying to take the md5. I'm not sure if this is related to the ultimate failure of the run, though. |
I just pushed a change to the branch https://github.com/AliTVTeam/AliTV-perl-interface/tree/improve-maf-import where you can call |
I'm trying to run
alitv.pl
with an existing maf alignment file and I'm getting an error about a sequence that isn't found that is definitely there.Also, I replaced all occurrences of the hyphen in the original sequence name (4-0010) with an underscore, but it'd be nice to not have to do that.
The text was updated successfully, but these errors were encountered: