Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it appropriate to use RNAseq data to annotate flow cytometry data via SingleR? #246

Open
denvercal1234GitHub opened this issue Sep 6, 2023 · 4 comments

Comments

@denvercal1234GitHub
Copy link

denvercal1234GitHub commented Sep 6, 2023

Hi, @LTLA,

Thanks so much for this great package.

I performed clustering of my flow cytometry data and have the object as sce.

Would you mind giving me some insights on the appropriateness of using RNAseq as refs to annotate clusters of flow cytometry data?

Briefly, I compensated, and bi-exponential transform my flow data in FlowJo, then export the data as channel values so that I do not need to transform the data in R for clustering. Once I have the clusters of my data as a sce object, I apply SingleR:

Thank you again for your help.

F37_sce_backboneClustering <- assays(F37_sce_backboneClustering)$exprs %>%
    Matrix::Matrix(sparse = F) %>%
    SingleR::SingleR(
        ref = list(DICE=DICE_ref, Monaco = Monaco_ref),
        labels = list(DICE_ref$label.fine, Monaco_ref$label.fine), de.method="wilcox", de.n=50
    ) %>%
    as.data.frame() %>%
    as_tibble(rownames="cell") 
@dtm2451
Copy link
Collaborator

dtm2451 commented Sep 6, 2023

Hi there,

This is an interesting question.

Aside from the facts that 1) you are using an RNA reference for protein data, and we know these don't always correlate perfectly, and 2) your flow data likely has only a handful of markers compared to the thousands in a sequencing dataset, I think 3) it's also possible that flow data might break a primary assumption made in the SingleR algorithm. Namely, that a cell with higher expression value for a 'markerA' than a 'markerB' with also have higher signal for 'markerA' relative to 'markerB'.

We can assume this to be true in (properly normalized) scRNAseq or bulk RNAseq data in that we expect a more highly expressed gene to have more sequencing read counts than a lowly expressed gene, within a given cell or tissue sample.

But flow cytometer tuning prioritizes signal separation within each marker individually while caring little (except in the case of heavy compensation issues) for relative leveling between markers. Thus, you might end up with very different value ranges for your different markers, and thus the assumption that higher expression means higher measurement relative to a marker with lower expression may break down. (Said another way, the same expression value might translate to high expression of markerA but only medium expression for markerB.) If so, the spearman correlation metric at the heart of SingleR's scoring may fail to score test<->ref matches accurately.

Of course, this is just theoretical. I've never actually looked at how values scale between markers in any of the flow data I analyzed in the past, and am just making some hypothetical extensions from how I remember compensating and adjusting voltages before running my samples. So I am curious about how well you think SingleR performed for your flow data after you run it!

@LTLA
Copy link
Owner

LTLA commented Sep 11, 2023

I too would be curious. In addition to the concerns raised by Dan, there is also the issue of the number of genes involved. Flow cytometry uses fewer features, even when highly multiplexed (10-20 nowadays, maybe?) and each cell type can probably expect to be positive for one or two markers, with the rest being background noise. This doesn't give a lot for the Spearman correlation to work with, especially as it's not allowed to consider the magnitude of the signal in the positive markers; a single strongly upregulated marker won't translate to a big effect in SingleR's scoring.

@denvercal1234GitHub
Copy link
Author

It would be really good to adapt SingleR to address at least some of these caveats for flow data either to predict cell types of flow data using other flow data or of flow data using RNAseq. @LTLA and @dtm2451 — do you by chance know packages that do either of these tasks?

@LTLA
Copy link
Owner

LTLA commented Sep 11, 2023

There might be something in the flow* set of Bioconductor packages that would try to do this.

If not, I would suggest just doing something very simple to begin with, e.g., nearest neighbor classification. Use BiocNeighbors to build an index with the average reference profile for each cell type, and then just search for the nearest neighbor for each cell in the test dataset. Some tricks may need to be applied, e.g., to use correlation-based distances and to account for differences in the number of reference profiles per cell type.

Modifying SingleR to do this is theoretically straightforward but practically difficult as there are many places in SingleR's optimized C++ code where integer ranks are expected, under the assumption that Spearman's correlation is the way to go. I wouldn't undertake this modification without some expectation that it would work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants