Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Salmon indexing on UCSC genome.fa files fail for mm9 #49

Closed
vd4mmind opened this issue Mar 17, 2016 · 5 comments
Closed

Salmon indexing on UCSC genome.fa files fail for mm9 #49

vd4mmind opened this issue Mar 17, 2016 · 5 comments

Comments

@vd4mmind
Copy link

I am intending to run salmon on a set of RNA-Seq data lying in our lab for a long time. They are for mm9 and since there are >50 samples I was intending to run it using Salmon version : 0.6.0. I have used earlier versions of salmon on hg19 data from both UCSC, NCBI (spiked-in and non-spiked in data) without alignment mode and have run them successfully. Recently we were able to download the latest version and compile and trying to run the indexing on the UCSC mm9 genome.fa file so that I can use quasi-mapping indexes that can be then used to run quant for my samples downstream so getting read counts as well as TPM much faster than any other tool. Can you tell me what is the problem.

Command line used
salmon index -t /path_to/genome.fa -i salmonquasi-indexes --type quasi -k 31

Here is the error message while using the Ram-Map

Version Info: This is the most recent version of Salmon.
index ["salmonquasi-indexes"] did not previously exist  . . . creating it
[2016-03-17 10:41:34.655] [jointLog] [info] building index
RapMap Indexer

[Step 1 of 4] : counting k-mers
Elapsed time: 53.9731s

Replaced 96385738 non-ATCG nucleotides
Clipped poly-A tails from 0 transcripts
Building rank-select dictionary and saving to disk done
Elapsed time: 0.196609s
Writing sequence data to file . . . done
Elapsed time: 1.56391s
[info] Building 64-bit suffix array (length of generalized text is 2654911539 )
Building suffix array . . . success
saving to disk . . . done
Elapsed time: 126.003s
done
Elapsed time: 883.472s
processed 615000000 positionssalmon: /home/vagrant/salmon/external/install/include/sparsehash/internal/densehashtable.h:782: void google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::clear_to_size(google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::size_type) [with Value = std::pair<const long unsigned int, rapmap::utils::SAInterval<long int> >; Key = long unsigned int; HashFcn = rapmap::utils::KmerKeyHasher; ExtractKey = google::dense_hash_map<long unsigned int, rapmap::utils::SAInterval<long int>, rapmap::utils::KmerKeyHasher, std::equal_to<long unsigned int>, google::libc_allocator_with_realloc<std::pair<const long unsigned int, rapmap::utils::SAInterval<long int> > > >::SelectKey; SetKey = google::dense_hash_map<long unsigned int, rapmap::utils::SAInterval<long int>, rapmap::utils::KmerKeyHasher, std::equal_to<long unsigned int>, google::libc_allocator_with_realloc<std::pair<const long unsigned int, rapmap::utils::SAInterval<long int> > > >::SetKey; EqualKey = std::equal_to<long unsigned int>; Alloc = google::libc_allocator_with_realloc<std::pair<const long unsigned int, rapmap::utils::SAInterval<long int> > >; google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::size_type = long unsigned int]: Assertion table' failed.
Aborted

I also checked the log file and it shows nothing except.

more indexing.log
[2016-03-17 10:41:34.655] [jointLog] [info] building index

output:

-rw-r--r-- 1 vdas DPT          59 Mar 17 10:41 indexing.log
-rw-r--r-- 1 vdas DPT   331863951 Mar 17 10:42 rsd.bin
-rw-r--r-- 1 vdas DPT  2654912013 Mar 17 10:43 txpInfo.bin
-rw-r--r-- 1 vdas DPT 21239292320 Mar 17 10:59 sa.bin

So can you give me a workaround or inputs to solve this issue? Thanks

@mdshw5
Copy link
Contributor

mdshw5 commented Mar 17, 2016

It appears that you're trying to index the entire mm9 genome using salmon. Both salmon and rapmap are designed to work with a smaller sequence space such as what you would find in a transcriptome. Your log file shows that salmon processes 615,000,000 bases from the genome and then aborts. Depending on how many transcripts are in your feature file, a human transcriptome might be 5-10X smaller.

@rob-p
Copy link
Collaborator

rob-p commented Mar 17, 2016

Hi @vd4mmind,

Indeed, @mdshw5 is spot on. The issue you're seeing is a result of the hash table doubling failing to allocate sufficient memory when attempting to build a hash table for all 31-mers in the mouse genome. In addition to the memory requirements of building a quasi-index on the genome (which we're actually working to mitigate b/c we think it could be useful in another context), this won't be particularly useful for quantification. Salmon treats each entry in the multifasta file as a distinct transcriptional target. Thus, here, even if the index did build successfully, you'd be quantifying the abundance of different chromosomes & contigs, rather than the transcripts. What you should do (as pointed out by @mdshw5 above), is to grab a file that contains the mouse transcripts (or take your mm9 genome and an appropriate gtf file and use a tool like gffread to extract the transcript sequences).

@vd4mmind
Copy link
Author

Ah yes this is actually true, I realised it now. Infact I always ran salmon on the transcripts file for human rather than genome. Yes the mm9 does not have transcripts fasta file in our lab, so I will create one and then run indexes on it. Yes it is my bad. Thanks for the suggestions. I will do the needful and run the index once it is done I will report it here. If its not a problem till that time I would like to keep this ticket open.

@vd4mmind
Copy link
Author

I have a question , if you guys would like to answer. Where can I get the transcripts.gtf file for mm9. Is there any link from where I can download or do I have to create on my own. I am a bit confused and different forums are adding up to my confusion if you would like to suggest.

@vd4mmind
Copy link
Author

Done the required work. Sorry for bothering everyone. Downloaded the refGene.gtf file from UCSC for mm9 having transcript information and then used gffread to build the transcript.fa for the mm9. Finally ran salmon indexes and to my surprise it finished in matter of few minutes < 3'. Thanks for all the suggestions. This is something which I always like getting to learn something new every day. Closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants