Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

low mapping rate ? #160

Open
atasub opened this issue Oct 6, 2017 · 13 comments
Open

low mapping rate ? #160

atasub opened this issue Oct 6, 2017 · 13 comments

Comments

@atasub
Copy link

atasub commented Oct 6, 2017

I recently ran Salmon by quasi-mapping-based mode and when I checked the salmon_quant.log file, saw that mapping rate was around ~%65-68 for all of the samples. Do you have any suggestions to improve the mapping rate? I used "--libType A" to to infer the library type info and got a warning that "Greater than 5% of the fragments disagreed with the provided library typ", but I guess this is not an issue. This is an example for one of the "lib_format_counts.json" files:

{
    "read_files": "( /mnt/dznehomes/homes/simonj/RNAseq_pipeline/frontal_data/samples/Trimmed_FASTQ_files/00116_GFM_R1_trimmed.fastq.gz, /mnt/dznehomes/homes/simonj/RNAseq_pipeline/frontal_data/samples/Trimmed_FASTQ_files/00116_GFM_R2_trimmed.fastq.gz )",
    "expected_format": "ISR",
    "compatible_fragment_ratio": 0.9241470144855659,
    "num_compatible_fragments": 34584460,
    "num_assigned_fragments": 37423115,
    "num_consistent_mappings": 334748580,
    "num_inconsistent_mappings": 28046150,
    "MSF": 0,
    "OSF": 32448,
    "ISF": 20518131,
    "MSR": 0,
    "OSR": 487250,
    "ISR": 334748580,
    "SF": 1833525,
    "SR": 5088606,
    "MU": 0,
    "OU": 0,
    "IU": 0,
    "U": 0
}
@rob-p
Copy link
Collaborator

rob-p commented Oct 6, 2017

Hi @atasub,

It's hard to say exactly if this mapping rate is much lower than expected or not. Many RNA-seq experiments do end up with a mapping rate of 65-70%. One thing that might contribute to a lower mapping rate would be short reads relative to the minimum required exact match length (default of 31). If your reads are relatively short (after trimming, which it looks like you are doing here) --- say ~50bp, then one might try lowering the k value with which the index is built. This will allow more sensitive mapping.

However, the other thing to try is simply to align one of these samples to the genome with a tool like STAR or HISAT2 and look at their mapping rate to known features. If it's similar, then the other reads could be accounted for by e.g. intron retention or even contamination. Finally, @vals has an excellent series of blog posts on investigating and addressing low mapping rates (albeit in single-cell data) that you might find useful. Let me know what you find.

@atasub
Copy link
Author

atasub commented Oct 6, 2017

Hi @rob-p ,
thank you for your reply, it was very helpful.
I had used HISAT2 before and got overall alignment rate around ~%97-99. So, in this case would you recommend to use alignment-based mode using HISAT2 based bam files?

@roryk
Copy link
Contributor

roryk commented Oct 6, 2017

Almost always when I've seen stuff that is a low mapping rate to RNA and a high genomic mapping rate, the culprit is the sample failed and had little to no RNA in it, and what actually got sequenced was DNA. I'm guessing if you'll see a lot of intergenic reads in your hisat2 alignments.

@vals
Copy link
Contributor

vals commented Oct 6, 2017

Hi @atasub

If you're using the same reference and gene annotation for HISAT2 and Salmon but getting lower mapping rate with Salmon, you probably have some DNA contamination.

You should get the same gene expression results from either strategy. (Because in the end the GTF file for the genome and the Fasta file for the transcriptome are equivalent).

@hiraksarkar
Copy link
Collaborator

Hi @atasub ,
This is interesting, can you tell us a little about the read sequence size. I am currently looking into such RNA-seq files which have bad mapping rates, so curious to know a little more. Also can you try running salmon in selective alignment mode, not sure if that improves mapping rate or quantification, but it is worth a try. https://github.com/COMBINE-lab/salmon/tree/selective-alignment, the associated pre-print is here.

@roryk
Copy link
Contributor

roryk commented Oct 6, 2017

Another red flag would be a high rRNA rate going along with it-- the rRNA depletion methods don't work 100%, and if you have no mRNA then the rRNA rate will tend to be higher.

@vals
Copy link
Contributor

vals commented Oct 6, 2017

Yes that is also a good explanation, I recommend putting human rRNA in the Salmon index.

@InesdeSantiago
Copy link

@hiraksarkar
How can I use selective mapping in Salmon? I dont see any info in the docs (http://salmon.readthedocs.io/en/latest/salmon.html?highlight=selective)

Say, I have paired-end data, I do:
./bin/salmon quant -i transcripts_index -l -1 reads1.fq -2 reads2.fq -o transcripts_quant

How can I specify selective mapping over quasi-mapping?

@hiraksarkar
Copy link
Collaborator

hiraksarkar commented Feb 21, 2018

@InesdeSantiago
Sorry, that the docs are not updated yet. It's definitely on our to-do list. The selective alignment needs a separate index from normal quasi-index. The steps are as following,
If you are in salmon root folder, the most updated branch that implements Selective Alignment is,
this branch

git checkout rescue-orphan
(re-build it)
build/src/salmon index -i selective_alignment_ind -t transcript.fa
salmon quant -i selective_alignment_ind -la -1 reads1.fa -2 reads2.fq -o transcript_quant --softFilter --editDistance 4 --rangeFactorization 4

We strongly recommend these options while using selective alignment, as they tend to produce superior result almost always (I am considering them making default soon :) )

Please let me know if you face problem in any of the above steps, or if the results are not expected.
Thanks again for using selective alignment and Salmon.

@InesdeSantiago
Copy link

@hiraksarkar
Thanks!
So, apart from the extra options (softFilter, editDistance, rangeFactorization) the only difference is the indexed genome file?

@hiraksarkar
Copy link
Collaborator

@InesdeSantiago
That is correct. Just to make it sure, Salmon is not designed for the genome, so probably you want to use it only with transcriptome.

@InesdeSantiago
Copy link

@hiraksarkar. Yes, force of habit, I meant the transcriptome! ;-)

@RaymondSHANG
Copy link

Hi, I have a similar case, with 30~40% mapping rate by Salmon. I tried hisat2, the mapping rate goes to >80%. samtools sort the sam files to bam, and them qualimap2 gives me the QC results:

Exonic: | 31,212,828 / 41.39%
Intronic: | 39,191,136 / 51.97%
Intergenic: | 5,008,406 / 6.64%
Intronic/intergenic overlapping exon: | 6,243,753 / 8.28%

There is not too much DNA contamination, but a large portion of intronic mappings.
What can I do with these data? Any suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants