Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choice of normalization strategy with multiple batches #12

Closed
gdagstn opened this issue Dec 10, 2019 · 4 comments
Closed

Choice of normalization strategy with multiple batches #12

gdagstn opened this issue Dec 10, 2019 · 4 comments

Comments

@gdagstn
Copy link

gdagstn commented Dec 10, 2019

Hi,

first of all I just wanted to say it's a pleasure to read your documentation, answers and even the code you write as they're always clear and full of opportunities for learning.

Now to the matter at hand: I have a quite large dataset of single nucleus RNA-seq from 8 individuals (8 separate 10X runs). These were prepped/sequenced in 2 different batches, but I am not interested in the inter-individual variability. In other words, I think it is safe to say that I can remove the batch effect (1 and 2) by removing the individual effect (1:8).

By reading on scran, batchelor and looking at your workflow on the Bioconductor OSCA website, however, I am still undecided as to what is the best strategy to normalize my data and wanted to ask for advice.

I have already conducted all the necessary count-level QC (capping low/high library sizes, capping % mitochondrial content, removing empty droplets, removing non-identified genes, etc). This was done on a merged count matrix so there is no subsetting problem.

Now for normalization, I reckon I have the following options:

  • scran pooling and deconvolution normalization on all individuals in the merged object, ignoring the individual: quickCluster, then computeSizeFactors, then logNormCounts. This appears to be what you did in the pancreas datasets in the OSCA tutorials (although I do not know whether in that case the different individuals were different 10X captures).

  • scran pooling and deconvolution as before, only this is done separately on each sample (i.e. subsetting each object by individual and running the normalization steps separately). It's very fast and easy to parallelize, which I don't dislike, and may make more sense in case clustering results are largely different for each individual. However, this may still introduce some bias as size factors may have different scales. I have to say I do not have large differences in coverage across batches/individuals so I do not expect this to be a big issue. Perhaps plotting the deconvolution size factors for each individual separately against the library size factors may shed some light on whether that's the case.

  • multiBatchNorm on the merged object, specifying the batch, then logNormCounts. This method should solve the size factor scaling issue, but it is unclear to me whether it sill uses the clustering + deconvolution approach (which I wanted to use given its success in some benchmarks) or whether it is a different method. Moreover, in the OSCA tutorials (chapter 13) the use of combinedVar is suggested for HVG selection, whereas I thought it would be sufficient to model the mean-variance with the blocked design.

I would then use fastMNN to remove any further batch effect that would still be present.

So, the question: what do you think is the most sensible approach? Is there something I'm missing?

Thanks for your time.

@LTLA
Copy link
Owner

LTLA commented Dec 11, 2019

The typical workflow that I follow is:

  1. scran pooling and normalization within each object to compute size factors. The size factors are not comparable across objects, but that's okay, because
  2. multiBatchNorm to adjust the size factors' scale so that they are comparable across objects; this will also do the log-transformation, so no need for logNormCounts().
  3. Then fastMNN().

Technically speaking, fastMNN() will do another round of normalization (cosine). TBH I don't really think this is necessary, but it sometimes makes things better, and it doesn't seem to do any harm otherwise, so I've just left it in.

One might think that the cosine normalization would allow us to skip 2, and that's probably true to some extent, but the nice thing about doing 2 is that it gives you a baseline to compare before and after MNN correction (sans the obvious differences in sequencing depth between objects). Which is nice to check that the batch correction is actually making a difference.

Some of these steps are put into practice in Chapter 31.

@gdagstn
Copy link
Author

gdagstn commented Dec 11, 2019

Great, thanks for the explanation.

Regarding your comment on cosine normalization, would it then make sense to do cosine on the object before fastMN to compare baseline vs corrected?

@LTLA
Copy link
Owner

LTLA commented Dec 11, 2019

Technically yes, if you wanted an accurate like-for-like comparison. But generally speaking, if it takes more effort than multiBatchNorm(), then it probably requires fastMNN() anyway.

Also, it's hard to interpret what cosine normalization actually does in terms of removing bias. By comparison, multiBatchNorm() is easy to explain - you're just equalizing coverage across batches - and the resulting expression values can actually be used for plotting and stuff. You can't use the cosine-normalized values for much except batch correction and clustering.

@gdagstn
Copy link
Author

gdagstn commented Dec 11, 2019

thanks, this answers all my questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants