-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Salmon read number discrepancy across identical runs #613
Comments
Hi @rmurray2, Thank you for the report. First, I just want to mention that I don't believe v0.99.0 to be an officially released version number. That is, there was a v0.14.x and a (released in source only v0.15.0), and then the versions moved to 1.0.0 and beyond. However, this behavior certainly isn't related to that. There are 2 things going on that can lead to this effect. The first one, which is relatively easy to test, is that there may be small changes in when the inferred library type starts to be enforced (if it is not The second and more fundamental thing going on is that the observed behavior is intended. Even with a single thread of execution provided to salmon for mapping and quantification, there is a separate background thread that simply consumes reads from file and puts them in memory for quantification, and while e.g. pairing information between files is guaranteed to be preserved, exact read order is not. This can lead to differences in the order in which reads are processed and, as a result, differences in the initialization conditions of the optimization. The ultimate result is that for transcripts that have large inferential uncertainty, different numbers of reads can be assigned between runs. We have thought a lot about this behavior, what it means, and how the One can make an argument for trying to provide a way to enforce removal of this variation (which, granted, would be a challenge). However, the reason we decided against even attempting this is because it doesn't properly address any issue with respect to an actual biological analysis. That is, even if you could fix, precisely, the update order and initialization conditions for a specific sample to eliminate any variation between runs, almost all experiments consist of multiple samples. In other samples, the same transcript fractions could give rise to a slightly different set of observed fragments that induce exactly the same type of variation under uncertainty; and since that uncertainty is baked into the sample, it cannot and should not be removed. Having exact replication of a sample at a numerical threshold below the inferential uncertainty for a transcript conveys false confidence in the precision of the estimate. This is why, for transcript-level analysis, we highly recommend having salmon produce posterior gibbs samples (with the |
We noticed recently that doing quantification multiple times (using exact same settings) on the same file using salmon v0.99.0 resulted in some transcripts having different read number values (NumReads column) across different runs.
What seems to happen is that for the transcripts that differ across runs, one value will be zero, and the other will be non-zero (I recall seeing ~30 and ~75). We only looked at one biological replicate, but didn't see any examples in which multiple runs would produce more than two values (it's either zero or another number, but never another number). The total number of transcripts for which this was happening was fewer than 10.
We tried specifying that the quantification be done on one CPU core, thinking that perhaps the discrepancy was coming from multithreading somehow, but we observed the same phenomenon.
salmon quant -i salmon_index_noThread_2 -l A -r input_file.fq.gz -g Mus_musculus.GRCm38.100.gtf --validateMappings -o out/fq_quant
The text was updated successfully, but these errors were encountered: