Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about salmon mapping rate #482

Closed
lauraht opened this issue Feb 6, 2020 · 5 comments
Closed

Questions about salmon mapping rate #482

lauraht opened this issue Feb 6, 2020 · 5 comments

Comments

@lauraht
Copy link

lauraht commented Feb 6, 2020

Hi Rob,

I have a few questions regarding salmon mapping rate and would appreciate your advice.

(1) Generally, is the mapping rate of using selective alignment expected to be lower than that of using the previous quasi-mapping?

I am currently using salmon 1.1.0. I re-built the index of transcriptome using v1.1.0 with the default options (I did not use the --decoys option). I have a few datasets on which I had run the older salmon version 0.9.1 before, so I can compare their mapping rates with the current version. For both versions, I used the default options when running salmon quant (using “-l A”). The following is the comparison of mapping rates:

Dataset v0.9.1 v1.1.0
ERR1586786 59.1893% 59.1338%
ERR2399747 59.5182% 58.2916%
SRR9698990 67.3904% 57.9677%
SRR10883703 40.4469% 40.5271%

Except for the last dataset (in which the new version has a slightly higher mapping rate than the old version), the new version has lower mapping rates than the old version. For dataset SRR9698990, the mapping rate of the new version is considerably lower than that of the old version.

Since v1.1.0 uses selective alignment as default while v0.9.1 uses quasi-mapping, I was wondering whether it is an expected behavior that the mapping rate of using selective alignment (without using --decoys in index) is generally lower than that of using quasi-mapping?

(2) I ran the new salmon version v1.1.0 on a set of SRA datasets (fastq). The resulting mapping rates fall in all ranges (from 0.0001% to 99%). Some datasets did not get any mapping. For example, the following datasets got very minimal mapping rates:

Dataset Mapping rate
SRR9007475 0.00208642%
SRR5866113 0.0171606%
SRR448056 0.0524797%
SRR1535539 0.00304504%
SRR3129941 0.000626134%

And the following datasets did not get any mapping (“salmon was only able to assign 0 fragments to transcripts in the index”):
SRR764657
SRR067901
SRR2913241
SRR1182629

Those are all human RNA-seq datasets using Illumina. I used GRCh38 transcriptome (Homo_sapiens.GRCh38.cdna.all.fa) from Ensembl (without using --decoys in index).

I was wondering what may be the possible reasons for those datasets to have minimal or zero mapping rates?

(3) I plan to use the salmon quantification results of this set of SRA datasets to obtain the gene expression levels of these datasets (by using tximport). Since the resulting mapping rates fall in all ranges with a considerable number of datasets having quite low mapping rates, I was wondering approximately above what level of mapping rate a dataset could be considered as “useful” for getting the gene expression levels (based on your experience)? In other words, what mapping-rate threshold would you think could be used to filter out “not useful” datasets? I was considering 40% mapping rate as a possible threshold.

Thank you very much for your help!

@rob-p
Copy link
Collaborator

rob-p commented Feb 6, 2020

Hi @lauraht,

Thanks for the detailed question (with data!). I will try to answer these in order.

(1) The mapping rate of selective-alignment is expected to generally be similar to that of quasi-mapping, but there are some important exceptions. You can find some aggregate statistics in supplementary figure 1 of the pre-print that introduces selective alignment

image

Here, "orcale" is a method that aligns the reads both to the transcriptome with Bowtie2 and the genome with STAR, and removes reads from the Bowtie2 bam that seem to be spuriously aligned to the transcriptome (they align to the genome outside of an annotated transcriptome with a better score than that assigned by Bowtie2 within the transcriptome). Here, you can see that in most cases most methods map a similar number of reads, but there are definitely samples where methods map more reads than the oracle, and sometimes quasi-mapping maps quite a few more. This is, to a large extent, because it doesn't validate those mappings and some of them may be spurious (i.e. the exact matches used to find the mapping in the given location would not support a high quality alignment at that location).

(2) This is certainly possible that some samples get very little to no mapping. However, there are a few points worth noting about how the data are processed that is worth being aware of before you write such samples off.

  • There is a change in default behavior between salmon < 0.13 and >= 0.13 with which mappings are considered as "concordant" and therefore used for quantification by default. Specifically, starting with 0.14, "dovetail" alignments (as described in the Bowtie2 manual) are considered discordant. This is the same default behavior imposed by Bowtie2. If you look in the meta_info.json file for some of these samples (which is in the aux_info subdirectory of the quantification directory for a sample), you can see how many mappings are being discarded by virtue of being dovetail mappings. It is possible to allow such alignments (consider them as concordant) by passing the --allowDovetail flag. It is not the case that such alignments are always "bad", its simply that one would not expect many fragments to align in such a way, and if these constitute the overwhelming majority of the mappings, one might be suspicious about the underlying data.

  • Selective alignment actually aligns the reads to the transcriptome. For this purpose, it performs end-to-end alignment. This means that if you suspect that the sample may contain adapters or very low-quality read ends, the reads should be trimmed prior to quantification. It is, therefore, worth checking how the mapping rate changes for some of these samples if the reads are trimmed first.

  • Selective alignment is more robust than quasi-mapping to the chosen value of -k, the minimum match length used when searching for alignments. I noticed that some of the samples contain relatively short reads, so you might see if the mapping rate changes if you adopt a smaller value of -k in the index (e.g. we use 23 in the pre-print).

  • You mention that this index doesn't contain any decoy sequence. This of course, should not affect the mapping rate. However, I'd be quite curious to see if you index the reference using the whole genome as decoy (i.e. the SAF method from the pre-print), how many reads are discarded because they map better to a decoy sequence (this information can also be obtained from meta_info.json). This will tell you if something strange might be going on in those samples, like extensive rRNA contamination, that would lead to the observed mapping rates. You could also get this information using a tool like STAR, by asking it to produce both genomic and transcriptomic alignments for the read (by passing --quantMode TranscriptomeSAM). You could see what the mapping rate in the transcriptomic bam is compared to the genomic bam to see where the reads might be deriving from.

(3) This is a great question, and I don't have a "standard" practice for such things. Generally, I might consider setting a threshold based on the number of reads that could be mapped rather than the relative mapping rate itself. For example, if you have a really large sample, even a lower mapping rate may be OK if the total number of mapped reads is respectable. However, "respectable" is a weasel word here, and I don't have any specific suggestion as to what number would be ideal. I also don't think that doing it based on mapping percentage is a bad idea either, and in that case, requiring at least a 30-40% mapping rate could certainly be argued to be a reasonable QC metric.

Please let me know if you dig into any of the above (specifically the points raised in (2)) and find anything interesting or would like to discuss further.

Best,
Rob

@lauraht
Copy link
Author

lauraht commented Feb 8, 2020

Hi Rob,

Thank you so much for your very thorough and comprehensive explanations (with statistics) and advice!
I really appreciate it!

I have looked into the meta_info.json files of these datasets. It looks like none of them has dovetail fragments. The following are the relevant information from the meta_info.json for each dataset:

For the 5 datasets with minimal mapping rates:

SRR9007475:

    "num_processed": 64991581,
    "num_mapped": 1356,
    "num_decoy_fragments": 0,
    "num_dovetail_fragments": 0,
    "num_fragments_filtered_vm": 1272836,
    "num_alignments_below_threshold_for_mapped_fragments_vm": 3823916,

SRR5866113:

    "num_processed": 8065000,
    "num_mapped": 1384,
    "num_decoy_fragments": 0,
    "num_dovetail_fragments": 0,
    "num_fragments_filtered_vm": 5154685,
    "num_alignments_below_threshold_for_mapped_fragments_vm": 165001284,

SRR448056:

    "num_processed": 13530942,
    "num_mapped": 7101,
    "num_decoy_fragments": 0,
    "num_dovetail_fragments": 0,
    "num_fragments_filtered_vm": 54,
    "num_alignments_below_threshold_for_mapped_fragments_vm": 14420,

SRR1535539:

    "num_processed": 8045873,
    "num_mapped": 245,
    "num_decoy_fragments": 0,
    "num_dovetail_fragments": 0,
    "num_fragments_filtered_vm": 111743,
    "num_alignments_below_threshold_for_mapped_fragments_vm": 593521,

SRR3129941:

    "num_processed": 57495682,
    "num_mapped": 360,
    "num_decoy_fragments": 0,
    "num_dovetail_fragments": 0,
    "num_fragments_filtered_vm": 358659,
    "num_alignments_below_threshold_for_mapped_fragments_vm": 6631423,

For the 4 datasets with 0 mapping rate (they do not have other info besides "num_processed" and "num_mapped"):

SRR764657:

    "num_processed": 28342632,
    "num_mapped": 0,

SRR067901:

    "num_processed": 3571366,
    "num_mapped": 0,

SRR2913241:

    "num_processed": 40070874,
    "num_mapped": 8,

SRR1182629:

    "num_processed": 15381872,
    "num_mapped": 0,

I will dig into the other possibilities that you suggested when I get time, and post the update.

Thank you very much for all your help!

@rob-p
Copy link
Collaborator

rob-p commented Feb 9, 2020

Hi @lauraht,

So I decided to explore just one of these to see if I could figure out what might be going on. The below is with respect to SRR9007475. So first, even though I processed the data with the latest version of the develop branch (which will become 1.2.0), I got basically identical results to what you reported. Simply aligning the data against an index built on a human Gencode v26 transcriptome (with no decoys) gives me a mapping rate of 0.00378202832148367%.

The first thing I did was to quality and adapter trim the data (using fastp -i SRR9007475.fastq.gz -o SRR9007475_trimmed.fastq.gz -q 10 -w 8) and ... whoa. This is the fastp html report fastp.html.zip. So the first astounding statistic, the mean read length before trimming is 51bp (these are relatively short single-end reads). The mean read length after trimming is 21bp! So, the average read length is, in fact, less than the k-mer length used for indexing (default is k=31). On the trimmed data, the mapping rate goes up to 2.3545475882931305%, still very low, but now there's somewhat of an explanation, the average read is shorter than a single k-mer.

So, the next thing I tried was indexing with a smaller k; a really small one in this case,k=15. Then, I re-ran on the trimmed reads (the fact that the trimming took us from 51-21bp suggests that the reads had a lot of low quality bases, adapter contamination, or both). Under this setting, I still get a very low mapping rate, but it was much higher — 16.766993524863488%.

The final thing I tried was seeing how the mapping rate changed as I altered --minScoreFraction, which is the salmon parameter that determines the alignment score that a read must achieve in order to be mapped validly. The default is 0.65. This means that the read cannot have a score < 0.65 * the maximum achievable score for the read given it's length. In the case of a 21bp read, the best score would be a score of 42, so a read must obtain a score >= 27 in order to be mapped. This is already a pretty poor mapping, but I reduced it even more to 0.3 (so any read with a score > 12 would pass). This led to a mapping rate of ~46%. However, at this point, I'm not sure I would be confident in such mappings. For example, the situation here would be a 21bp read with multiple mismatches and, much of the time, one or more indels.

So, my conclusion, at least on this sample, is that the main issue is data quality. Trimming the reads and indexing with a smaller k can lead to a mapping rate ~16%, and then allowing really bad alignments can take it up to ~45% (and even more — when I set --minScoreFraction to 0.1, I get a mapping rate of 57%). But the level of confidence that one might derive from poorly-aligned 21bp reads is (and probably should be) quite low. I can't say, of course, that this is the situation with the other samples, as I've not looked at them. However, for this sample, and likely for some or all of the others, there is likely a data quality issue. So, perhaps the first order of business should be thinking about what level of data quality is worth trying to extract a signal from, and where the line is to determine if a sample is worth the trouble.

@lauraht
Copy link
Author

lauraht commented Feb 9, 2020

Hi Rob,

You actually did a very thorough investigation on this dataset for me! I greatly appreciate it!

From what you found out, it looks like salmon’s mapping rate (with default parameters) could be a quite reasonable indicator of the dataset’s quality. This is very helpful information.

Thank you so much again for your time and help!

@rob-p
Copy link
Collaborator

rob-p commented Feb 9, 2020

Hi @lauraht,

From what you found out, it looks like salmon’s mapping rate (with default parameters) could be a quite reasonable indicator of the dataset’s quality.

Yes, I think so too. The one small modification I might put here is "salmon’s mapping rate (with default parameters, and after trimming)". Otherwise, I think this makes sense, as salmon's alignment procedure will try quite hard to find a suitable alignment for the read if one exists.

If there's no objection, I'll close this issue as I think the main question has been resolved. However, if you have a related question, or run into something else, please feel free to open up a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants