Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Salmon fails to match the transcript name between Gencode reference and annotation files #15

Closed
nicolasstransky opened this issue Sep 30, 2015 · 8 comments
Assignees

Comments

@nicolasstransky
Copy link

The transcript names in Gencode's reference sequence fasta files have the following format:
ENST00000257408.4|ENSG00000134962.6|OTTHUMG00000128577.1|OTTHUMT00000250429.1|KLB-001|KLB|6082|UTR5:1-97|CDS:98-3232|UTR3:3233-6082|

In the .gtf gene annotation files, only the transcript name appears:
ENST00000257408.4

As a consequence, salmon fails to match them and does not report the correct values in quant.genes.sf. Values in quant.sf seem to be correct though.

Nico

@rob-p rob-p self-assigned this Sep 30, 2015
@rob-p
Copy link
Collaborator

rob-p commented Sep 30, 2015

Hi @nicolasstransky --- thanks for reporting this. Now the question is, how should this be handled? I see at least 2 obvious possibilities :

  1. Assume that the transcript name should be split at the first whitespace character or |. Currently,
    it is only split at the first whitespace.
  2. If a gtf is provided for gene-level quantification, ensure that some non-trivial number of genes (e.g.
    more than half?) have at least 1 transcript in the index corresponding to them. If not, then complain.

Of course, there are also potentially other, better solutions; so I'm open to suggestions. The problem with 1 is that de-novo assemblers may have transcript names that are not unique up to the first |, so that the whole name needs to be taken into account. The problem with 2 is that it alerts the user of this potential issue, but doesn't resolve it. In the latter case, the user could provide the transcript-to-gene mapping using the provided transcript names in the "simple" format — i.e.

a simple tab-delimited format where each line contains the name of a transcript and the gene to which it belongs separated by a tab

which is also accepted by the --geneMap option. I sort of lean toward 2, but, as I said, am happy to consider other suggestions.

@nicolasstransky
Copy link
Author

Fair points. There are potentially a lot of special cases but since Gencode is widely used, it would be great to have a way to handle its format natively (i.e consider | in addition to a whitespace).
It think the problem with 2. is not a real problem because if you can't match transcript names in the gtf file that is provided, it's likely that there is a problem with the input.

@mdshw5
Copy link
Contributor

mdshw5 commented Sep 30, 2015

This issue reminds me to ask: what is the best way to ingest a GTF plus reference FASTA file and produce a transcript FASTA file ready for salmon indexing? I see that there may be some issues with using cufflinks gtf-to-fasta tool: https://groups.google.com/forum/#!msg/sailfish-users/oNVLlxJzgv4/nQYt9m4BBOcJ

@rob-p
Copy link
Collaborator

rob-p commented Sep 30, 2015

@nicolasstransky --- Ok, so, while I'm generally reticent to adopt special cases, GenCode may warrant one. Or, a more general solution would be to allow the user to specify a list of "separator" characters while indexing (which defaults to \s+). I think that, so far, I actually like this option the best. Also, this isn't mutually exclusive with 2. The ideal thing would be to (1) allow arbitrary separators defined by the user and (2) warn the user if many genes seem to have no transcripts in the index.

@rob-p
Copy link
Collaborator

rob-p commented Sep 30, 2015

@mdshw5, the best option I've found so far is actually rsem-prepare-reference. It's a bit slower than gtf-to-fasta, but, so far, seems to do a better job producing a usable transcriptome in the general case.

@nicolasstransky
Copy link
Author

@rob-p Using a list of "separator" characters is a nice idea. I think that's the best solution so far. However, it would also be a good thing that Gencode files work "out of the box" since they are so commonly used.

@mdshw5
Copy link
Contributor

mdshw5 commented Sep 30, 2015

Thanks, @rob-p. In the same vein, have you considered taking a GTF + FASTA for salmon index? It seems this might even solve @nicolasstransky's issue here.

rob-p added a commit that referenced this issue Aug 18, 2016
Addresses #15
Add --gencode option to salmon indexer

Addresses #15
Add --gencode option to salmon indexer

Addresses #15
Add --gencode option to salmon indexer

Addresses #15
Add --gencode option to salmon indexer

Addresses #15
Add --gencode option to salmon indexer

Addresses #15
Add --gencode option to salmon indexer

Addresses #15
Add --gencode option to salmon indexer

Addresses #15
Add --gencode option to salmon indexer

Addresses #15
Add --gencode option to salmon indexer

Addresses #15
Add --gencode option to salmon indexer

Addresses #15
Add --gencode option to salmon indexer

Addresses #15
Add --gencode option to salmon indexer

Addresses #15
Add --gencode option to salmon indexer

Addresses #15
Add --gencode option to salmon indexer

Addresses #15
he commit message for your changes. Lines starting
@rob-p
Copy link
Collaborator

rob-p commented Aug 21, 2016

The gencode option behaves described above, and is implemented as of commit d44df88, so it should make it into the next tagged release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants