New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about salmon mapping rate #482
Comments
Hi @lauraht, Thanks for the detailed question (with data!). I will try to answer these in order. (1) The mapping rate of selective-alignment is expected to generally be similar to that of quasi-mapping, but there are some important exceptions. You can find some aggregate statistics in supplementary figure 1 of the pre-print that introduces selective alignment Here, "orcale" is a method that aligns the reads both to the transcriptome with Bowtie2 and the genome with STAR, and removes reads from the Bowtie2 (2) This is certainly possible that some samples get very little to no mapping. However, there are a few points worth noting about how the data are processed that is worth being aware of before you write such samples off.
(3) This is a great question, and I don't have a "standard" practice for such things. Generally, I might consider setting a threshold based on the number of reads that could be mapped rather than the relative mapping rate itself. For example, if you have a really large sample, even a lower mapping rate may be OK if the total number of mapped reads is respectable. However, "respectable" is a weasel word here, and I don't have any specific suggestion as to what number would be ideal. I also don't think that doing it based on mapping percentage is a bad idea either, and in that case, requiring at least a 30-40% mapping rate could certainly be argued to be a reasonable QC metric. Please let me know if you dig into any of the above (specifically the points raised in (2)) and find anything interesting or would like to discuss further. Best, |
Hi Rob, Thank you so much for your very thorough and comprehensive explanations (with statistics) and advice! I have looked into the For the 5 datasets with minimal mapping rates: SRR9007475:
SRR5866113:
SRR448056:
SRR1535539:
SRR3129941:
For the 4 datasets with 0 mapping rate (they do not have other info besides SRR764657:
SRR067901:
SRR2913241:
SRR1182629:
I will dig into the other possibilities that you suggested when I get time, and post the update. Thank you very much for all your help! |
Hi @lauraht, So I decided to explore just one of these to see if I could figure out what might be going on. The below is with respect to The first thing I did was to quality and adapter trim the data (using So, the next thing I tried was indexing with a smaller k; a really small one in this case, The final thing I tried was seeing how the mapping rate changed as I altered So, my conclusion, at least on this sample, is that the main issue is data quality. Trimming the reads and indexing with a smaller k can lead to a mapping rate |
Hi Rob, You actually did a very thorough investigation on this dataset for me! I greatly appreciate it! From what you found out, it looks like salmon’s mapping rate (with default parameters) could be a quite reasonable indicator of the dataset’s quality. This is very helpful information. Thank you so much again for your time and help! |
Hi @lauraht,
Yes, I think so too. The one small modification I might put here is "salmon’s mapping rate (with default parameters, and after trimming)". Otherwise, I think this makes sense, as salmon's alignment procedure will try quite hard to find a suitable alignment for the read if one exists. If there's no objection, I'll close this issue as I think the main question has been resolved. However, if you have a related question, or run into something else, please feel free to open up a new issue. |
Hi Rob,
I have a few questions regarding salmon mapping rate and would appreciate your advice.
(1) Generally, is the mapping rate of using selective alignment expected to be lower than that of using the previous quasi-mapping?
I am currently using salmon 1.1.0. I re-built the index of transcriptome using v1.1.0 with the default options (I did not use the --decoys option). I have a few datasets on which I had run the older salmon version 0.9.1 before, so I can compare their mapping rates with the current version. For both versions, I used the default options when running salmon quant (using “-l A”). The following is the comparison of mapping rates:
Except for the last dataset (in which the new version has a slightly higher mapping rate than the old version), the new version has lower mapping rates than the old version. For dataset SRR9698990, the mapping rate of the new version is considerably lower than that of the old version.
Since v1.1.0 uses selective alignment as default while v0.9.1 uses quasi-mapping, I was wondering whether it is an expected behavior that the mapping rate of using selective alignment (without using --decoys in index) is generally lower than that of using quasi-mapping?
(2) I ran the new salmon version v1.1.0 on a set of SRA datasets (fastq). The resulting mapping rates fall in all ranges (from 0.0001% to 99%). Some datasets did not get any mapping. For example, the following datasets got very minimal mapping rates:
And the following datasets did not get any mapping (“salmon was only able to assign 0 fragments to transcripts in the index”):
SRR764657
SRR067901
SRR2913241
SRR1182629
Those are all human RNA-seq datasets using Illumina. I used GRCh38 transcriptome (Homo_sapiens.GRCh38.cdna.all.fa) from Ensembl (without using --decoys in index).
I was wondering what may be the possible reasons for those datasets to have minimal or zero mapping rates?
(3) I plan to use the salmon quantification results of this set of SRA datasets to obtain the gene expression levels of these datasets (by using tximport). Since the resulting mapping rates fall in all ranges with a considerable number of datasets having quite low mapping rates, I was wondering approximately above what level of mapping rate a dataset could be considered as “useful” for getting the gene expression levels (based on your experience)? In other words, what mapping-rate threshold would you think could be used to filter out “not useful” datasets? I was considering 40% mapping rate as a possible threshold.
Thank you very much for your help!
The text was updated successfully, but these errors were encountered: