Is it appropriate to use RNAseq data to annotate flow cytometry data via SingleR? #246

denvercal1234GitHub · 2023-09-06T19:24:41Z

Thanks so much for this great package.

I performed clustering of my flow cytometry data and have the object as sce.

Would you mind giving me some insights on the appropriateness of using RNAseq as refs to annotate clusters of flow cytometry data?

Briefly, I compensated, and bi-exponential transform my flow data in FlowJo, then export the data as channel values so that I do not need to transform the data in R for clustering. Once I have the clusters of my data as a sce object, I apply SingleR:

Thank you again for your help.

F37_sce_backboneClustering <- assays(F37_sce_backboneClustering)$exprs %>%
    Matrix::Matrix(sparse = F) %>%
    SingleR::SingleR(
        ref = list(DICE=DICE_ref, Monaco = Monaco_ref),
        labels = list(DICE_ref$label.fine, Monaco_ref$label.fine), de.method="wilcox", de.n=50
    ) %>%
    as.data.frame() %>%
    as_tibble(rownames="cell")

The text was updated successfully, but these errors were encountered:

dtm2451 · 2023-09-06T20:45:21Z

Hi there,

This is an interesting question.

Aside from the facts that 1) you are using an RNA reference for protein data, and we know these don't always correlate perfectly, and 2) your flow data likely has only a handful of markers compared to the thousands in a sequencing dataset, I think 3) it's also possible that flow data might break a primary assumption made in the SingleR algorithm. Namely, that a cell with higher expression value for a 'markerA' than a 'markerB' with also have higher signal for 'markerA' relative to 'markerB'.

We can assume this to be true in (properly normalized) scRNAseq or bulk RNAseq data in that we expect a more highly expressed gene to have more sequencing read counts than a lowly expressed gene, within a given cell or tissue sample.

But flow cytometer tuning prioritizes signal separation within each marker individually while caring little (except in the case of heavy compensation issues) for relative leveling between markers. Thus, you might end up with very different value ranges for your different markers, and thus the assumption that higher expression means higher measurement relative to a marker with lower expression may break down. (Said another way, the same expression value might translate to high expression of markerA but only medium expression for markerB.) If so, the spearman correlation metric at the heart of SingleR's scoring may fail to score test<->ref matches accurately.

Of course, this is just theoretical. I've never actually looked at how values scale between markers in any of the flow data I analyzed in the past, and am just making some hypothetical extensions from how I remember compensating and adjusting voltages before running my samples. So I am curious about how well you think SingleR performed for your flow data after you run it!

LTLA · 2023-09-11T09:17:32Z

I too would be curious. In addition to the concerns raised by Dan, there is also the issue of the number of genes involved. Flow cytometry uses fewer features, even when highly multiplexed (10-20 nowadays, maybe?) and each cell type can probably expect to be positive for one or two markers, with the rest being background noise. This doesn't give a lot for the Spearman correlation to work with, especially as it's not allowed to consider the magnitude of the signal in the positive markers; a single strongly upregulated marker won't translate to a big effect in SingleR's scoring.

denvercal1234GitHub · 2023-09-11T09:57:56Z

It would be really good to adapt SingleR to address at least some of these caveats for flow data either to predict cell types of flow data using other flow data or of flow data using RNAseq. @LTLA and @dtm2451 — do you by chance know packages that do either of these tasks?

LTLA · 2023-09-11T10:20:17Z

There might be something in the flow* set of Bioconductor packages that would try to do this.

If not, I would suggest just doing something very simple to begin with, e.g., nearest neighbor classification. Use BiocNeighbors to build an index with the average reference profile for each cell type, and then just search for the nearest neighbor for each cell in the test dataset. Some tricks may need to be applied, e.g., to use correlation-based distances and to account for differences in the number of reference profiles per cell type.

Modifying SingleR to do this is theoretically straightforward but practically difficult as there are many places in SingleR's optimized C++ code where integer ranks are expected, under the assumption that Spearman's correlation is the way to go. I wouldn't undertake this modification without some expectation that it would work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it appropriate to use RNAseq data to annotate flow cytometry data via SingleR? #246

Is it appropriate to use RNAseq data to annotate flow cytometry data via SingleR? #246

denvercal1234GitHub commented Sep 6, 2023 •

edited

dtm2451 commented Sep 6, 2023

LTLA commented Sep 11, 2023

denvercal1234GitHub commented Sep 11, 2023

LTLA commented Sep 11, 2023

Is it appropriate to use RNAseq data to annotate flow cytometry data via SingleR? #246

Is it appropriate to use RNAseq data to annotate flow cytometry data via SingleR? #246

Comments

denvercal1234GitHub commented Sep 6, 2023 • edited

dtm2451 commented Sep 6, 2023

LTLA commented Sep 11, 2023

denvercal1234GitHub commented Sep 11, 2023

LTLA commented Sep 11, 2023

denvercal1234GitHub commented Sep 6, 2023 •

edited