Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with Salmon build: It removes identical transcript sequences #214

Closed
kvittingseerup opened this issue Apr 16, 2018 · 17 comments
Closed

Comments

@kvittingseerup
Copy link

We just discovered that Salmon build removes/collapses identical transcripts. This is very problematic that Salmon does this as many genes are duplicated throughout the genome. By concatenating them in the build index one of these is arbitrary selected (the others removed) meaning all downstream analysis will assume all expression originate from one genomic region instead of many.

In the most recent Gencode mouse release this problem affects 1563 sequences annotated as 13812 and covers all transcript types (incl 840 protein coding - although the major once are lincRNA (n=3658) and snoRNAs (n=2622)).

We strongly believe that if one want to analyse these duplicated regions jointly this should be done just like one would sum all transcripts from a particular gene to get the gene expression.

@rob-p
Copy link
Collaborator

rob-p commented Apr 16, 2018

Hi Kristoffer,

The duplicate transcript issue is a frustrating one. It came to our attention when we noticed that ensembl often annotated transcripts on patch / haplotype contigs that were identical and unlikely to be different from more "canonical" transcripts in any way. Further, these transcripts are indistinguishable from the quantification perspective. That being said, the removal of sequence duplicate transcripts is optional in Salmon. If you pass --keepDuplicates to the indexer, it wont remove them. Also, Salmon does record, in the index directory, the "collapsing map". Specifically, there is a tsv file that record, for every collapsed transcript, the transcript that was sequence identical and retained in the index. You can use this map to recover the abundances for the collapsed transcripts, since they are all sequence identical, they should all have an abundance of x / num duplicates (where x is the abundance of the retained transcript). I hope this info helps. Let me know if there is anything else i can clarify or help with.

Best,
Rob

@kvittingseerup
Copy link
Author

That is frustrating. But I have to agree the haplotype problem is probably the large one of the two...

@rob-p
Copy link
Collaborator

rob-p commented Apr 16, 2018

Yea. Both are frustrating, which is why we spam warning messages to the console when we remove duplicates. Sorry if this default behavior caused you any trouble, but hopefully its easy to recover these quants without rerunning anything using the map of collapsed transcripts.

@kvittingseerup
Copy link
Author

Actually it's fairly easy with GENCODE annotation as they do not have haplotypes in the general annotation - so we can just use the --keepDuplicates :-)

@rbenel
Copy link

rbenel commented Aug 5, 2018

Hi,
I am writing here, because I think this issue is relevant to both @rob-p and @kvittingseerup. I ran my salmon analysis twice with the most recent gencode annotation https://www.gencodegenes.org/releases/current.html -> PRI. Once with the --keepDuplicates option in the indexing and once without (bec I read this post late..).
When loadind the data into IsoformSwithcAnalyzer the first time (w/o --keepDuplicates), I received the following warning message, "The annotation (count matrix and isoform annotation) contain differences in which isoforms are analyzed... 875 more isoforms than the count matrix...". Following the run with --keepDuplicates, I now receive "67 more isoforms than the count matrix". If I am using the --keepDuplicates option, what exactly are there 67 isforms?

@rob-p
Copy link
Collaborator

rob-p commented Aug 5, 2018

Hi @rbenel,

Can you post here the output of salmon's indexing phase? Does it mention discarding anything? Presumably, we can just do a set difference on the iskform sets to see qhats happening.

--Rob

@rbenel
Copy link

rbenel commented Aug 6, 2018

Hi @rob-p,
Sure, I am posting the output of the indexing phase with the --keepDuplicates option.

[Step 1 of 4] : counting k-mers
counted k-mers for 40000 transcripts[2018-08-02 16:23:28.827] [jointLog] [warning] Entry with header [ENST00000473810.1|ENSG00000239255.1|OTTHUMG00000157482.1|OTTHUMT00000348942.1|RP11-145M9.2-001|RP11-145M9.2|25|processed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:28.909] [jointLog] [warning] Entry with header [ENST00000603775.1|ENSG00000271544.1|OTTHUMG00000184300.1|OTTHUMT00000468575.1|AC006499.9-001|AC006499.9|23|processed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
counted k-mers for 80000 transcripts[2018-08-02 16:23:29.870] [jointLog] [warning] Entry with header [ENST00000632684.1|ENSG00000282431.1|OTTHUMG00000190602.2|OTTHUMT00000485301.2|RP11-520H11.10-001|TRBD1|12|TR_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
counted k-mers for 120000 transcripts[2018-08-02 16:23:31.098] [jointLog] [warning] Entry with header [ENST00000626826.1|ENSG00000281344.1|OTTHUMG00000189570.1|OTTHUMT00000479989.1|RP11-210L7.2-001|HELLPAR|205012|macro_lncRNA|] was longer than 200000 nucleotides.  Are you certain that we are indexing a transcriptome and not a genome?
[2018-08-02 16:23:31.151] [jointLog] [warning] Entry with header [ENST00000543745.1|ENSG00000255972.1|OTTHUMG00000168883.1|OTTHUMT00000401485.1|RP11-324E6.8-001|RP11-324E6.8|28|processed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
counted k-mers for 130000 transcripts[2018-08-02 16:23:31.291] [jointLog] [warning] Entry with header [ENST00000415118.1|ENSG00000223997.1|OTTHUMG00000170844.2|OTTHUMT00000410670.2|AE000661.52-001|TRDD1|8|TR_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.291] [jointLog] [warning] Entry with header [ENST00000434970.2|ENSG00000237235.2|OTTHUMG00000170845.2|OTTHUMT00000410671.2|AE000661.53-001|TRDD2|9|TR_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.291] [jointLog] [warning] Entry with header [ENST00000448914.1|ENSG00000228985.1|OTTHUMG00000170846.2|OTTHUMT00000410672.2|AE000661.54-001|TRDD3|13|TR_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000439842.1|ENSG00000236597.1|OTTHUMG00000152435.2|OTTHUMT00000326213.2|AL122127.38-001|IGHD7-27|11|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000390567.1|ENSG00000211907.1|OTTHUMG00000152429.2|OTTHUMT00000326207.2|AL122127.37-001|IGHD1-26|20|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000452198.1|ENSG00000225825.1|OTTHUMG00000152436.2|OTTHUMT00000326214.2|AL122127.36-001|IGHD6-25|18|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000390569.1|ENSG00000211909.1|OTTHUMG00000152427.2|OTTHUMT00000326205.2|AL122127.35-001|IGHD5-24|20|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000437320.1|ENSG00000227196.1|OTTHUMG00000152438.2|OTTHUMT00000326216.2|AL122127.34-001|IGHD4-23|19|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000390572.1|ENSG00000211912.1|OTTHUMG00000152428.2|OTTHUMT00000326206.2|AL122127.32-001|IGHD2-21|28|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000450276.1|ENSG00000237020.1|OTTHUMG00000152432.2|OTTHUMT00000326210.2|AL122127.31-001|IGHD1-20|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000390574.1|ENSG00000211914.1|OTTHUMG00000152431.2|OTTHUMT00000326209.2|AL122127.30-001|IGHD6-19|21|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000390575.1|ENSG00000211915.1|OTTHUMG00000152433.2|OTTHUMT00000326211.2|AL122127.29-001|IGHD5-18|20|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000431870.1|ENSG00000227800.1|OTTHUMG00000152437.2|OTTHUMT00000326215.2|AL122127.28-001|IGHD4-17|16|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000451044.1|ENSG00000227108.1|OTTHUMG00000152369.2|OTTHUMT00000326003.2|AB019441.47-001|IGHD1-14|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000390580.1|ENSG00000211920.1|OTTHUMG00000152370.2|OTTHUMT00000326004.2|AB019441.46-001|IGHD6-13|21|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000390581.1|ENSG00000211921.1|OTTHUMG00000152367.2|OTTHUMT00000326001.2|AB019441.45-001|IGHD5-12|23|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000431440.2|ENSG00000232543.2|OTTHUMG00000152368.2|OTTHUMT00000326002.2|AB019441.44-001|IGHD4-11|16|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000430425.1|ENSG00000237197.1|OTTHUMG00000152357.2|OTTHUMT00000325963.2|AB019441.40-001|IGHD1-7|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000454691.1|ENSG00000228131.1|OTTHUMG00000152353.2|OTTHUMT00000325959.2|AB019441.39-001|IGHD6-6|18|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000390588.1|ENSG00000211928.1|OTTHUMG00000152360.2|OTTHUMT00000325966.2|AB019441.38-001|IGHD5-5|20|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000414852.1|ENSG00000233655.1|OTTHUMG00000152355.2|OTTHUMT00000325961.2|AB019441.37-001|IGHD4-4|16|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000454908.1|ENSG00000236170.1|OTTHUMG00000152359.2|OTTHUMT00000325965.2|AB019441.34-001|IGHD1-1|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.546] [jointLog] [warning] Entry with header [ENST00000518246.1|ENSG00000254045.1|OTTHUMG00000152060.1|OTTHUMT00000325154.1|AB019439.71-001|IGHVIII-22-2|28|IG_V_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.547] [jointLog] [warning] Entry with header [ENST00000604642.1|ENSG00000270961.1|OTTHUMG00000184622.2|OTTHUMT00000468982.2|RP11-1360M22.8-001|IGHD5OR15-5A|23|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.547] [jointLog] [warning] Entry with header [ENST00000603326.1|ENSG00000271317.1|OTTHUMG00000184621.3|OTTHUMT00000468981.3|RP11-1360M22.7-001|IGHD4OR15-4A|19|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.547] [jointLog] [warning] Entry with header [ENST00000605284.1|ENSG00000271336.1|OTTHUMG00000184580.2|OTTHUMT00000468908.2|RP11-1360M22.3-001|IGHD1OR15-1A|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.549] [jointLog] [warning] Entry with header [ENST00000604446.1|ENSG00000270824.1|OTTHUMG00000184624.2|OTTHUMT00000468984.2|RP11-810K23.15-001|IGHD5OR15-5B|23|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.549] [jointLog] [warning] Entry with header [ENST00000603693.1|ENSG00000270451.1|OTTHUMG00000184611.3|OTTHUMT00000468945.3|RP11-810K23.14-001|IGHD4OR15-4B|19|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.549] [jointLog] [warning] Entry with header [ENST00000604838.1|ENSG00000270185.1|OTTHUMG00000184585.2|OTTHUMT00000468915.2|RP11-1360M22.4-001|IGHD1OR15-1B|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
counted k-mers for 150000 transcripts[2018-08-02 16:23:32.097] [jointLog] [warning] Entry with header [ENST00000579054.1|ENSG00000266416.1|OTTHUMG00000179204.1|OTTHUMT00000445280.1|RP1-66C13.2-001|RP1-66C13.2|28|processed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
counted k-mers for 170000 transcripts[2018-08-02 16:23:32.554] [jointLog] [warning] Entry with header [ENST00000634174.1|ENSG00000282732.1|OTTHUMG00000191398.1|OTTHUMT00000487783.1|RP11-157B13.10-001|RP11-157B13.10|28|unprocessed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
counted k-mers for 200000 transcriptsElapsed time: 5.76935s

[2018-08-02 16:23:33.248] [jointLog] [warning] There were 808 transcripts that would need to be removed to avoid duplicates.
Replaced 4 non-ATCG nucleotides
Clipped poly-A tails from 1586 transcripts
Building rank-select dictionary and saving to disk done
Elapsed time: 0.0169059s
Writing sequence data to file . . . done
Elapsed time: 0.13359s
[info] Building 32-bit suffix array (length of generalized text is 309778559)
Building suffix array . . . success
saving to disk . . . done
Elapsed time: 6.96499s
done
Elapsed time: 33.5821s
processed 309000000 positions
khash had 130317526 keys
saving hash to disk . . . done
Elapsed time: 34.8185s
[2018-08-02 16:26:58.153] [jLog] [info] done building index

I reproduced the warnings from the initial run w/o the --keepDuplicates argument.

[Step 1 of 4] : counting k-mers
[2018-08-06 09:29:02.061] [jointLog] [warning] Entry with header [ENST00000473810.1|ENSG00000239255.1|OTTHUMG00000157482.1|OTTHUMT00000348942.1|RP11-145M9.2-001|RP11-145M9.2|25|processed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:02.143] [jointLog] [warning] Entry with header [ENST00000603775.1|ENSG00000271544.1|OTTHUMG00000184300.1|OTTHUMT00000468575.1|AC006499.9-001|AC006499.9|23|processed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:03.084] [jointLog] [warning] Entry with header [ENST00000632684.1|ENSG00000282431.1|OTTHUMG00000190602.2|OTTHUMT00000485301.2|RP11-520H11.10-001|TRBD1|12|TR_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.306] [jointLog] [warning] Entry with header [ENST00000626826.1|ENSG00000281344.1|OTTHUMG00000189570.1|OTTHUMT00000479989.1|RP11-210L7.2-001|HELLPAR|205012|macro_lncRNA|] was longer than 200000 nucleotides.  Are you certain that we are indexing a transcriptome and not a genome?
[2018-08-06 09:29:04.359] [jointLog] [warning] Entry with header [ENST00000543745.1|ENSG00000255972.1|OTTHUMG00000168883.1|OTTHUMT00000401485.1|RP11-324E6.8-001|RP11-324E6.8|28|processed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.496] [jointLog] [warning] Entry with header [ENST00000415118.1|ENSG00000223997.1|OTTHUMG00000170844.2|OTTHUMT00000410670.2|AE000661.52-001|TRDD1|8|TR_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.496] [jointLog] [warning] Entry with header [ENST00000434970.2|ENSG00000237235.2|OTTHUMG00000170845.2|OTTHUMT00000410671.2|AE000661.53-001|TRDD2|9|TR_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.496] [jointLog] [warning] Entry with header [ENST00000448914.1|ENSG00000228985.1|OTTHUMG00000170846.2|OTTHUMT00000410672.2|AE000661.54-001|TRDD3|13|TR_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.748] [jointLog] [warning] Entry with header [ENST00000439842.1|ENSG00000236597.1|OTTHUMG00000152435.2|OTTHUMT00000326213.2|AL122127.38-001|IGHD7-27|11|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.748] [jointLog] [warning] Entry with header [ENST00000390567.1|ENSG00000211907.1|OTTHUMG00000152429.2|OTTHUMT00000326207.2|AL122127.37-001|IGHD1-26|20|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.748] [jointLog] [warning] Entry with header [ENST00000452198.1|ENSG00000225825.1|OTTHUMG00000152436.2|OTTHUMT00000326214.2|AL122127.36-001|IGHD6-25|18|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.748] [jointLog] [warning] Entry with header [ENST00000390569.1|ENSG00000211909.1|OTTHUMG00000152427.2|OTTHUMT00000326205.2|AL122127.35-001|IGHD5-24|20|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.748] [jointLog] [warning] Entry with header [ENST00000437320.1|ENSG00000227196.1|OTTHUMG00000152438.2|OTTHUMT00000326216.2|AL122127.34-001|IGHD4-23|19|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.748] [jointLog] [warning] Entry with header [ENST00000390572.1|ENSG00000211912.1|OTTHUMG00000152428.2|OTTHUMT00000326206.2|AL122127.32-001|IGHD2-21|28|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000450276.1|ENSG00000237020.1|OTTHUMG00000152432.2|OTTHUMT00000326210.2|AL122127.31-001|IGHD1-20|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000390574.1|ENSG00000211914.1|OTTHUMG00000152431.2|OTTHUMT00000326209.2|AL122127.30-001|IGHD6-19|21|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000390575.1|ENSG00000211915.1|OTTHUMG00000152433.2|OTTHUMT00000326211.2|AL122127.29-001|IGHD5-18|20|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000431870.1|ENSG00000227800.1|OTTHUMG00000152437.2|OTTHUMT00000326215.2|AL122127.28-001|IGHD4-17|16|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000451044.1|ENSG00000227108.1|OTTHUMG00000152369.2|OTTHUMT00000326003.2|AB019441.47-001|IGHD1-14|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000390580.1|ENSG00000211920.1|OTTHUMG00000152370.2|OTTHUMT00000326004.2|AB019441.46-001|IGHD6-13|21|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000390581.1|ENSG00000211921.1|OTTHUMG00000152367.2|OTTHUMT00000326001.2|AB019441.45-001|IGHD5-12|23|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000431440.2|ENSG00000232543.2|OTTHUMG00000152368.2|OTTHUMT00000326002.2|AB019441.44-001|IGHD4-11|16|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000430425.1|ENSG00000237197.1|OTTHUMG00000152357.2|OTTHUMT00000325963.2|AB019441.40-001|IGHD1-7|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000454691.1|ENSG00000228131.1|OTTHUMG00000152353.2|OTTHUMT00000325959.2|AB019441.39-001|IGHD6-6|18|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000390588.1|ENSG00000211928.1|OTTHUMG00000152360.2|OTTHUMT00000325966.2|AB019441.38-001|IGHD5-5|20|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000414852.1|ENSG00000233655.1|OTTHUMG00000152355.2|OTTHUMT00000325961.2|AB019441.37-001|IGHD4-4|16|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000454908.1|ENSG00000236170.1|OTTHUMG00000152359.2|OTTHUMT00000325965.2|AB019441.34-001|IGHD1-1|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000518246.1|ENSG00000254045.1|OTTHUMG00000152060.1|OTTHUMT00000325154.1|AB019439.71-001|IGHVIII-22-2|28|IG_V_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.750] [jointLog] [warning] Entry with header [ENST00000604642.1|ENSG00000270961.1|OTTHUMG00000184622.2|OTTHUMT00000468982.2|RP11-1360M22.8-001|IGHD5OR15-5A|23|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.750] [jointLog] [warning] Entry with header [ENST00000603326.1|ENSG00000271317.1|OTTHUMG00000184621.3|OTTHUMT00000468981.3|RP11-1360M22.7-001|IGHD4OR15-4A|19|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.750] [jointLog] [warning] Entry with header [ENST00000605284.1|ENSG00000271336.1|OTTHUMG00000184580.2|OTTHUMT00000468908.2|RP11-1360M22.3-001|IGHD1OR15-1A|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.752] [jointLog] [warning] Entry with header [ENST00000604446.1|ENSG00000270824.1|OTTHUMG00000184624.2|OTTHUMT00000468984.2|RP11-810K23.15-001|IGHD5OR15-5B|23|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.752] [jointLog] [warning] Entry with header [ENST00000603693.1|ENSG00000270451.1|OTTHUMG00000184611.3|OTTHUMT00000468945.3|RP11-810K23.14-001|IGHD4OR15-4B|19|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.752] [jointLog] [warning] Entry with header [ENST00000604838.1|ENSG00000270185.1|OTTHUMG00000184585.2|OTTHUMT00000468915.2|RP11-1360M22.4-001|IGHD1OR15-1B|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:05.304] [jointLog] [warning] Entry with header [ENST00000579054.1|ENSG00000266416.1|OTTHUMG00000179204.1|OTTHUMT00000445280.1|RP1-66C13.2-001|RP1-66C13.2|28|processed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:05.761] [jointLog] [warning] Entry with header [ENST00000634174.1|ENSG00000282732.1|OTTHUMG00000191398.1|OTTHUMT00000487783.1|RP11-157B13.10-001|RP11-157B13.10|28|unprocessed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
Elapsed time: 5.65811s

[2018-08-06 09:29:06.451] [jointLog] [warning] Removed 808 transcripts that were sequence duplicates of indexed transcripts.
[2018-08-06 09:29:06.451] [jointLog] [warning] If you wish to retain duplicate transcripts, please use the `--keepDuplicates` flag
Replaced 4 non-ATCG nucleotides
Clipped poly-A tails from 1586 transcripts
Building rank-select dictionary and saving to disk done
Elapsed time: 0.0178594s
Writing sequence data to file . . . done
Elapsed time: 0.702003s
[info] Building 32-bit suffix array (length of generalized text is 308972089)
Building suffix array . . . success
saving to disk . . . done
Elapsed time: 8.62493s
done
Elapsed time: 35.9517s
processed 308000000 positions
khash had 130317526 keys
saving hash to disk . . . done
Elapsed time: 29.414s
[2018-08-06 09:34:12.370] [jLog] [info] done building index

@kvittingseerup
Copy link
Author

Could you also specify exactly which of the GENCODE files you are using?

@rbenel
Copy link

rbenel commented Aug 6, 2018

Yes, it is in the previous post.. https://www.gencodegenes.org/releases/current.html -> PRI.

@rob-p
Copy link
Collaborator

rob-p commented Aug 6, 2018

Could you post one of your output quant.sf files? I can investigate.

@rbenel
Copy link

rbenel commented Aug 6, 2018

Hi,

Here is link to dropbox, https://www.dropbox.com/s/herbw9te1g9sgv2/quant.sf?dl=0

@rob-p
Copy link
Collaborator

rob-p commented Aug 6, 2018

Hi @rbenel,

This is quite interesting. So I downloaded both the Gencode transcriptome (all transcript sequences) and the annotation you point out (PRI --- comprehensive gene annotation). There are a few transcripts present in the latter but not the former:

-ENST00000618686.1
-ENST00000613230.1
-ENST00000400754.4
-ENST00000618679.1
-ENST00000612465.1
-ENST00000611619.1
-ENST00000620032.1
-ENST00000621382.1
-ENST00000616049.4
-ENST00000616157.1
-ENST00000616468.1
-ENST00000611062.1
-ENST00000612565.1
-ENST00000612919.1
-ENST00000619317.1
-ENST00000611446.1
-ENST00000614535.1
-ENST00000619779.1
-ENST00000621409.1
-ENST00000611690.1
-ENST00000620265.1
-ENST00000614336.4
-ENST00000612640.4
-ENST00000612721.4
-ENST00000616361.1
-ENST00000619109.1
-ENST00000618083.1
-ENST00000612315.1
-ENST00000601199.2
-ENST00000612848.1
-ENST00000612801.1
-ENST00000617089.1
-ENST00000614351.1
-ENST00000619729.1
-ENST00000618003.1
-ENST00000615005.1
-ENST00000516246.2
-ENST00000621137.1
-ENST00000614604.4
-ENST00000620810.1
-ENST00000613373.1
-ENST00000612882.1
-ENST00000622674.1
-ENST00000616048.1
-ENST00000616638.1
-ENST00000618201.1
-ENST00000621028.1
-ENST00000619806.1
-ENST00000611339.1
-ENST00000613216.4
-ENST00000619130.1
-ENST00000612243.1
-ENST00000614110.1
-ENST00000611746.1
-ENST00000619792.1
-ENST00000620795.1
-ENST00000618675.1
-ENST00000616292.1
-ENST00000615130.1
-ENST00000618998.1
-ENST00000615362.1
-ENST00000617983.1
-ENST00000613204.1
-ENST00000615165.1
-ENST00000621424.4
-ENST00000616830.1
-ENST00000612925.1

Specifically, these are not dropped by salmon. They are not in the input reference transcriptome file. So it looks like Gencode includes these in the GTF, but not in the transcriptome fasta. I looked at the first few, and nothing immediately jumps out as to why Gencode would have dropped them from the fasta file. Do these transcript names have any special significance to you?

If you really want to include them, one option would be to use the GTF + the genome, and a tool like gffread to extract the transcriptome sequences from the genome and annotation. However, I might first try to investigate what these transcripts are, and if they are something that you want to quantify / consider.

@kvittingseerup
Copy link
Author

GENCODE provide 1 FASTA File called "Transcript sequences" which "only" contains the "CHR" (chromosomal) regions.

GENCODE provide many GTF files (specifically 9). The GTF file corresponding to the FASTA file mentioned above is the "Comprehensive gene annotation" from the "CHR" regions (aka chromosomal regions) (which is the first on their list).

You have downloaded the "Pri" (third entry) which is the normal chromosomes (Chr) as well as as well as scaffolds. which explain the 68 extra transcripts. Specifically the scaffolds included in "Pri" but not in "Chr" are:

"GL000009.2" "GL000194.1" "GL000195.1" "GL000205.2" "GL000213.1"
"GL000216.2" "GL000218.1" "GL000219.1" "GL000220.1" "GL000225.1"
"KI270442.1" "KI270711.1" "KI270713.1" "KI270721.1" "KI270726.1"
"KI270727.1" "KI270728.1" "KI270731.1" "KI270733.1" "KI270734.1"
"KI270744.1" "KI270750.1"

So the solution is as @rob-p suggested:

  1. Use gffread to make your own fasta file
  2. Remove those extra transcripts (or the "Chr" GTF file)

Cheers
Kristoffer

@rbenel
Copy link

rbenel commented Aug 8, 2018

Thank you both! I need to look into those transcripts, to see if anything looks important.

@Tima-Ze

This comment has been minimized.

@Tima-Ze
Copy link

Tima-Ze commented Dec 26, 2020

Hi all,
Just an update:
I also got same warning message (as @rbenel talk about it here) when creating index along with decoy sequences I took @kvittingseerup's advice and made a transcripts.fa file by gffread command. Here is my input files and commend:
All gtf and genome references were downloaded from GENCODE: GRCh38.primary_assembly.genome.fa.gz, gencode.v36.annotation.gtf (CHR) and gencode.v36.transcripts.fa.gz.
commends:
grep "^>" <(gunzip -c GRCh38.primary_assembly.genome.fa.gz) | cut -d " " -f 1 > decoys.txt
sed -i.bak -e 's/>//g' decoys.txt
cat salmon_transcripts.fa.gz GRCh38.primary_assembly.genome.fa.gz > gentrome.fa.gz
salmon index -t gentrome.fa.gz -d decoys.txt -p 12 -i salmon-decoy-sa-index --gencode
warnings:

[Step 1 of 4] : counting k-mers
[2020-12-26 10:47:50.823] [puff::index::jointLog] [warning] Entry with header [ENST00000473810.1|ENSG00000239255.1|OTTHUMG00000157482|OTTHUMT00000348942.1|AC007620.1-201|AC007620.1|25|processed_pseudogene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:50.973] [puff::index::jointLog] [warning] Entry with header [ENST00000603775.1|ENSG00000271544.1|OTTHUMG00000184300|OTTHUMT00000468575.1|AC006499.8-201|AC006499.8|23|processed_pseudogene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:52.758] [puff::index::jointLog] [warning] Entry with header [ENST00000632684.1|ENSG00000282431.1|OTTHUMG00000190602|OTTHUMT00000485301.1|TRBD1-201|TRBD1|12|TR_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:54.950] [puff::index::jointLog] [warning] Entry with header [ENST00000543745.1|ENSG00000255972.1|OTTHUMG00000168883|OTTHUMT00000401485.1|AC026333.1-201|AC026333.1|28|processed_pseudogene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.202] [puff::index::jointLog] [warning] Entry with header [ENST00000415118.1|ENSG00000223997.1|OTTHUMG00000170844|OTTHUMT00000410670.1|TRDD1-201|TRDD1|8|TR_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.202] [puff::index::jointLog] [warning] Entry with header [ENST00000434970.2|ENSG00000237235.2|OTTHUMG00000170845|OTTHUMT00000410671.1|TRDD2-201|TRDD2|9|TR_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.202] [puff::index::jointLog] [warning] Entry with header [ENST00000448914.1|ENSG00000228985.1|OTTHUMG00000170846|OTTHUMT00000410672.1|TRDD3-201|TRDD3|13|TR_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000439842.1|ENSG00000236597.1|OTTHUMG00000152435|OTTHUMT00000326213.1|IGHD7-27-201|IGHD7-27|11|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000390567.1|ENSG00000211907.1|OTTHUMG00000152429|OTTHUMT00000326207.1|IGHD1-26-201|IGHD1-26|20|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000452198.1|ENSG00000225825.1|OTTHUMG00000152436|OTTHUMT00000326214.1|IGHD6-25-201|IGHD6-25|18|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000390569.1|ENSG00000211909.1|OTTHUMG00000152427|OTTHUMT00000326205.1|IGHD5-24-201|IGHD5-24|20|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000437320.1|ENSG00000227196.1|OTTHUMG00000152438|OTTHUMT00000326216.1|IGHD4-23-201|IGHD4-23|19|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000390571.1|ENSG00000211911.1|OTTHUMG00000152434|OTTHUMT00000326212.1|IGHD3-22-201|IGHD3-22|31|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000390572.1|ENSG00000211912.1|OTTHUMG00000152428|OTTHUMT00000326206.1|IGHD2-21-201|IGHD2-21|28|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000450276.1|ENSG00000237020.1|OTTHUMG00000152432|OTTHUMT00000326210.1|IGHD1-20-201|IGHD1-20|17|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000390574.1|ENSG00000211914.1|OTTHUMG00000152431|OTTHUMT00000326209.1|IGHD6-19-201|IGHD6-19|21|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000390575.1|ENSG00000211915.1|OTTHUMG00000152433|OTTHUMT00000326211.1|IGHD5-18-201|IGHD5-18|20|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000431870.1|ENSG00000227800.1|OTTHUMG00000152437|OTTHUMT00000326215.1|IGHD4-17-201|IGHD4-17|16|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000390578.1|ENSG00000211918.1|OTTHUMG00000152430|OTTHUMT00000326208.1|IGHD2-15-201|IGHD2-15|31|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000451044.1|ENSG00000227108.1|OTTHUMG00000152369|OTTHUMT00000326003.1|IGHD1-14-201|IGHD1-14|17|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000390580.1|ENSG00000211920.1|OTTHUMG00000152370|OTTHUMT00000326004.1|IGHD6-13-201|IGHD6-13|21|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000390581.1|ENSG00000211921.1|OTTHUMG00000152367|OTTHUMT00000326001.1|IGHD5-12-201|IGHD5-12|23|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000431440.2|ENSG00000232543.2|OTTHUMG00000152368|OTTHUMT00000326002.1|IGHD4-11-201|IGHD4-11|16|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000390583.1|ENSG00000211923.1|OTTHUMG00000152354|OTTHUMT00000325960.1|IGHD3-10-201|IGHD3-10|31|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000390584.1|ENSG00000211924.1|OTTHUMG00000152352|OTTHUMT00000325958.1|IGHD3-9-201|IGHD3-9|31|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000390585.1|ENSG00000211925.1|OTTHUMG00000152361|OTTHUMT00000325967.1|IGHD2-8-201|IGHD2-8|31|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000430425.1|ENSG00000237197.1|OTTHUMG00000152357|OTTHUMT00000325963.1|IGHD1-7-201|IGHD1-7|17|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000454691.1|ENSG00000228131.1|OTTHUMG00000152353|OTTHUMT00000325959.1|IGHD6-6-201|IGHD6-6|18|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000390588.1|ENSG00000211928.1|OTTHUMG00000152360|OTTHUMT00000325966.1|IGHD5-5-201|IGHD5-5|20|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.499] [puff::index::jointLog] [warning] Entry with header [ENST00000414852.1|ENSG00000233655.1|OTTHUMG00000152355|OTTHUMT00000325961.1|IGHD4-4-201|IGHD4-4|16|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.500] [puff::index::jointLog] [warning] Entry with header [ENST00000390590.1|ENSG00000211930.1|OTTHUMG00000152358|OTTHUMT00000325964.1|IGHD3-3-201|IGHD3-3|31|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.500] [puff::index::jointLog] [warning] Entry with header [ENST00000390591.1|ENSG00000211931.1|OTTHUMG00000152356|OTTHUMT00000325962.1|IGHD2-2-201|IGHD2-2|31|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.500] [puff::index::jointLog] [warning] Entry with header [ENST00000454908.1|ENSG00000236170.1|OTTHUMG00000152359|OTTHUMT00000325965.1|IGHD1-1-201|IGHD1-1|17|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.501] [puff::index::jointLog] [warning] Entry with header [ENST00000518246.1|ENSG00000254045.1|OTTHUMG00000152060|OTTHUMT00000325154.1|IGHVIII-22-2-201|IGHVIII-22-2|28|IG_V_pseudogene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.502] [puff::index::jointLog] [warning] Entry with header [ENST00000604642.1|ENSG00000270961.1|OTTHUMG00000184622|OTTHUMT00000468982.1|IGHD5OR15-5A-201|IGHD5OR15-5A|23|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.502] [puff::index::jointLog] [warning] Entry with header [ENST00000603326.1|ENSG00000271317.1|OTTHUMG00000184621|OTTHUMT00000468981.1|IGHD4OR15-4A-201|IGHD4OR15-4A|19|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.502] [puff::index::jointLog] [warning] Entry with header [ENST00000604950.1|ENSG00000282520.1|OTTHUMG00000184598|OTTHUMT00000468928.1|IGHD3OR15-3A-201|IGHD3OR15-3A|31|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.502] [puff::index::jointLog] [warning] Entry with header [ENST00000603077.1|ENSG00000282599.1|OTTHUMG00000184594|OTTHUMT00000468924.1|IGHD2OR15-2A-201|IGHD2OR15-2A|31|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.502] [puff::index::jointLog] [warning] Entry with header [ENST00000605284.1|ENSG00000271336.1|OTTHUMG00000184580|OTTHUMT00000468908.1|IGHD1OR15-1A-201|IGHD1OR15-1A|17|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.505] [puff::index::jointLog] [warning] Entry with header [ENST00000604446.1|ENSG00000270824.1|OTTHUMG00000184624|OTTHUMT00000468984.1|IGHD5OR15-5B-201|IGHD5OR15-5B|23|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.505] [puff::index::jointLog] [warning] Entry with header [ENST00000603693.1|ENSG00000270451.1|OTTHUMG00000184611|OTTHUMT00000468945.1|IGHD4OR15-4B-201|IGHD4OR15-4B|19|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.505] [puff::index::jointLog] [warning] Entry with header [ENST00000603935.1|ENSG00000282089.1|OTTHUMG00000184596|OTTHUMT00000468926.1|IGHD3OR15-3B-201|IGHD3OR15-3B|31|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.505] [puff::index::jointLog] [warning] Entry with header [ENST00000604102.1|ENSG00000282268.1|OTTHUMG00000184595|OTTHUMT00000468925.1|IGHD2OR15-2B-201|IGHD2OR15-2B|31|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:55.505] [puff::index::jointLog] [warning] Entry with header [ENST00000604838.1|ENSG00000270185.1|OTTHUMG00000184585|OTTHUMT00000468915.1|IGHD1OR15-1B-201|IGHD1OR15-1B|17|IG_D_gene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:56.430] [puff::index::jointLog] [warning] Entry with header [ENST00000579054.1|ENSG00000266416.1|OTTHUMG00000179204|OTTHUMT00000445280.1|AC130289.2-201|AC130289.2|28|processed_pseudogene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:47:57.304] [puff::index::jointLog] [warning] Entry with header [ENST00000634174.1|ENSG00000282732.1|OTTHUMG00000191398|OTTHUMT00000487783.1|AC073539.14-201|AC073539.14|28|unprocessed_pseudogene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 10:49:06.436] [puff::index::jointLog] [warning] Removed 829 transcripts that were sequence duplicates of indexed transcripts.
[2020-12-26 10:49:06.436] [puff::index::jointLog] [warning] If you wish to retain duplicate transcripts, please use the --keepDuplicates flag
[2020-12-26 10:49:06.448] [puff::index::jointLog] [info] Replaced 151,122,967 non-ATCG nucleotides
[2020-12-26 10:49:06.448] [puff::index::jointLog] [info] Clipped poly-A tails from 1,829 transcripts
wrote 231443 cleaned references
[2020-12-26 10:49:09.969] [puff::index::jointLog] [info] Filter size not provided; estimating from number of distinct k-mers
[2020-12-26 10:49:40.159] [puff::index::jointLog] [info] ntHll estimated 2628436199 distinct k-mers, setting filter size to 2^36
Threads = 12
Vertex length = 31
Hash functions = 5
Filter size = 68719476736
Capacity = 2
Files:
salmon-decoy-sa-index/ref_k31_fixed.fa

**So using gffread I created a transcripts.fa file:
gffread -w salmon_transcripts.fa -g GRCh38.primary_assembly.genome.fa gencode.v36.annotation.gtf

using this new transcripts.fa I run again the above mentioned salmon index with decoy command, but the warning message was shown up again:**

[Step 1 of 4] : counting k-mers
[2020-12-26 11:30:08.799] [puff::index::jointLog] [warning] Entry with header [ENST00000473810.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:08.951] [puff::index::jointLog] [warning] Entry with header [ENST00000603775.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:10.751] [puff::index::jointLog] [warning] Entry with header [ENST00000632684.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:12.936] [puff::index::jointLog] [warning] Entry with header [ENST00000543745.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.188] [puff::index::jointLog] [warning] Entry with header [ENST00000415118.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.188] [puff::index::jointLog] [warning] Entry with header [ENST00000434970.2], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.188] [puff::index::jointLog] [warning] Entry with header [ENST00000448914.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000439842.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390567.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000452198.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390569.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000437320.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390571.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390572.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000450276.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390574.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390575.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000431870.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390578.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000451044.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390580.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390581.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000431440.2], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390583.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390584.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390585.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000430425.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000454691.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390588.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000414852.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390590.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390591.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.484] [puff::index::jointLog] [warning] Entry with header [ENST00000454908.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.484] [puff::index::jointLog] [warning] Entry with header [ENST00000518246.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.486] [puff::index::jointLog] [warning] Entry with header [ENST00000604642.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.486] [puff::index::jointLog] [warning] Entry with header [ENST00000603326.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.486] [puff::index::jointLog] [warning] Entry with header [ENST00000604950.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.486] [puff::index::jointLog] [warning] Entry with header [ENST00000603077.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.486] [puff::index::jointLog] [warning] Entry with header [ENST00000605284.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.489] [puff::index::jointLog] [warning] Entry with header [ENST00000604446.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.489] [puff::index::jointLog] [warning] Entry with header [ENST00000603693.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.489] [puff::index::jointLog] [warning] Entry with header [ENST00000603935.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.489] [puff::index::jointLog] [warning] Entry with header [ENST00000604102.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.489] [puff::index::jointLog] [warning] Entry with header [ENST00000604838.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:14.411] [puff::index::jointLog] [warning] Entry with header [ENST00000579054.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:15.280] [puff::index::jointLog] [warning] Entry with header [ENST00000634174.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:31:24.590] [puff::index::jointLog] [warning] Removed 829 transcripts that were sequence duplicates of indexed transcripts.
[2020-12-26 11:31:24.590] [puff::index::jointLog] [warning] If you wish to retain duplicate transcripts, please use the --keepDuplicates flag
[2020-12-26 11:31:24.641] [puff::index::jointLog] [info] Replaced 151,122,967 non-ATCG nucleotides
[2020-12-26 11:31:24.641] [puff::index::jointLog] [info] Clipped poly-A tails from 1,829 transcripts
wrote 231443 cleaned references
[2020-12-26 11:31:28.118] [puff::index::jointLog] [info] Filter size not provided; estimating from number of distinct k-mers
[2020-12-26 11:31:58.286] [puff::index::jointLog] [info] ntHll estimated 2628436199 distinct k-mers, setting filter size to 2^36
Threads = 12
Vertex length = 31
Hash functions = 5
Filter size = 68719476736
Capacity = 2
Files:
salmon-decoy-sa-index/ref_k31_fixed.fa

**My concern is would it make problem for rest of downstream analysis?

Thanks,
Tima**

@rob-p
Copy link
Collaborator

rob-p commented Dec 26, 2020

Hi @Tima-Ze,

This should not cause any trouble with downstream analysis. The indexing procedure is simply informing you that these transcripts (about which you are being warned) are shorter than the seed length used for alignment. This means that it simply won't be possible for fragments to align to these transcripts, and so they will always have a 0 abundance in the resulting quant.sf files. This isn't a problem, as these transcripts are too short to be measured via RNA-seq anyway. The indexing messages just let you know this in advance. You can safely ignore these warnings for your downstream analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants