-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thoughts on a more scalable algorithm for multiple references #94
Comments
As described in #99, I have written a new function that pushes back the combining to the last step. This allows us to run The new combining step is run at the very end and recomputes the scores in a manner that is (more) comparable across references. For each cell, it identifies the assigned label from each reference and collects all markers for those labels. It then recomputes the score for each assigned label in each reference and chooses the reference that has the highest recomputed score. While there is still a score calculation across a union of markers, this should be faster and more specific as the union is only taken across the assigned labels for each cell, not across all labels in all references. In effect, we shift the compute time from |
In testing with diverse datasets, I find the new and old methods to perform comparably from a quality of scores perspective:
*** = available memory may not have been equal across runs, and n = 1 for each. 1&4: For quality data where provided refs match the test cells and cell type borders are non-fuzzy, 1 and 4, both methods perform great:
2: For low genes-per-cell data where provided refs match the test cells but borders between cells are fuzzy due to progressive differentiation:
3: For quality data where provided refs did not match the test cells:
|
Great insights! |
My No. 3 and 4 above can both be used. They both have published annotations =). I just need to nudge my labmate to help me get them organized properly and either into scRNAseq or directly into SingleR 3 = Grubman et al, Nature Neuroscience 2019, https://www.nature.com/articles/s41593-019-0539-4 |
A reproducible example to test performance. Note that this without the changes Aaron made last night. This is using an everything and the kitchen sink approach, using all the references that contain basically any immune/haematopoietic cells. Setup
RunsFine labels - single method
Fine labels - single method - no fine tuning
Main labels - single method
Main labels - single method - no fine tuning
Overall, the recompute method does speed things up, at least for this dataset. Another observation is that I think it generally provides better granularity compared to the common method.
The recompute results:
This is somewhat muddied here due to the overlapping/similar labels between the references, but in general, there are fewer cells classified with low-granularity (e.g. CD4+ T cell) using this method. I've also observed this using the Will post images and more info after finishing harmonizing labels between references once I get a chance. |
Excellent insights from @dtm2451 and @j-andrews7. It seems that we get a modest speed boost in many cases and accuracy that is, at the very least, no worse than before. I'll add one more theoretical comment for future reference. As I may have previously mentioned, the inter-reference score calculations are done using the markers for the top scoring label in each reference. That is, we take the markers that distinguish the top-scoring label from other labels in the same reference, and then we take the union of those markers across all references. This approach improves speed by avoiding the use of a union of all markers across all references. A potential pitfall of this approach is that, in theory, we may not include "negative markers", i.e., genes that are not expressed in the true label for a given cell. Without these negatives, we are computing correlations across only the set of positive markers, equivalent to looking for within-cell-type structure rather than across-cell-type structure. If you want to visualize this, have a look at common depictions of Simpson's paradox where removing a group of points can change the correlation. In practice, I don't think this is a problem, mostly because we would only fail to include negative markers if all of the top-scoring labels are correct! Well, perhaps it's not so clear-cut, but the most egregious failures of the inter-reference score calculation should only occur when the top-scoring labels are all close enough to the correct cell type that it doesn't really matter all that much. Anyway, I think I'm pretty happy with this and will merge #99. I have found this division of labor among ourselves to be quite effective and I hope we will be able to continue doing this moving forward; I no longer have the knowledge (and certainly not the time) to wade through datasets to test things out to this depth. See also ab61e15 if you want to impress people in a bar. |
Listing thoughts here so that I don't forget it. Maybe someone would be willing to help try to implement this, it shouldn't take (much) C++ code.
Context:
Our current approach with dealing with multiple references is to do the assignment within each reference; compare the top non-fine-tuned scores across references; and pick the (possibly fine-tuned) label from the reference with the highest non-fine-tuned score. Most of the work is done within each reference, which is logistically convenient and side-steps problems with batch effects between references. The fact that we compare non-fine-tuned scores is rather unavoidable, because the scores are not comparable after fine-tuning, but that's probably okay for the most part; the goal is just to get us close enough to avoid obviously wrong matches to completely different cell types.
However, a practical difficulty with this approach is that we must always take the union of genes from all references in the initial search for each reference, otherwise the scores are not directly comparable. This does not scale well as we combine references that contain more diverse cell types; it is possible that we will end up including a good chunk of the transcriptome in our initial score calculation. This slows everything down, which is bad enough, but more problematic is the increased risk of irrelevant noise overwhelming the signal. The latter can potentially cause the correct cell type to be discarded before we can even perform fine-tuning that might salvage the situation.
Proposed solution:
The lightest touch is to adjust how we combine results from multiple references. Namely, we perform assignment within each reference using
SingleR()
as before without expanding to the union of all markers. We take the final label (possibly after fine-tuning!) for each cell from each reference; I will refer to this as the set of "top labels". We then do a second round of fine-tuning across those top labels, using the same algorithm as before but on the relevant set of markers.The question here is "what are the relevant set of markers?" If we were comparing, e.g., top labels A1 (A is the reference, 1 is the label) and B2, I would say that this is the union of all genes involved in A1 vs A2, A3, ... Bm and B2 vs B1, B3, ... Bn. This allows us to get a reasonably informative subset of genes without being inundated by batch effects. (There are some failure points where, say, each reference contains only a single cell type so you never get markers that define the differences between cell types... but that seems pretty pathological.)
Despite the fact that we're still taking a union, the number of genes involved is still much lower than the union of all markers across all labels across all references, which should mitigate problems with noise and speed. This approach is also appealing as it builds off exactly the same results from the assignments to individual references, reducing the scope for discrepancies in the combined results. For starters, we would now use the top label after within-reference fine-tuning, getting around the uncomfortable situation that we find ourselves in with the current algorithm.
The text was updated successfully, but these errors were encountered: