Choice of normalization strategy with multiple batches #12

gdagstn · 2019-12-10T08:14:13Z

Hi,

first of all I just wanted to say it's a pleasure to read your documentation, answers and even the code you write as they're always clear and full of opportunities for learning.

Now to the matter at hand: I have a quite large dataset of single nucleus RNA-seq from 8 individuals (8 separate 10X runs). These were prepped/sequenced in 2 different batches, but I am not interested in the inter-individual variability. In other words, I think it is safe to say that I can remove the batch effect (1 and 2) by removing the individual effect (1:8).

By reading on scran, batchelor and looking at your workflow on the Bioconductor OSCA website, however, I am still undecided as to what is the best strategy to normalize my data and wanted to ask for advice.

I have already conducted all the necessary count-level QC (capping low/high library sizes, capping % mitochondrial content, removing empty droplets, removing non-identified genes, etc). This was done on a merged count matrix so there is no subsetting problem.

Now for normalization, I reckon I have the following options:

scran pooling and deconvolution normalization on all individuals in the merged object, ignoring the individual: quickCluster, then computeSizeFactors, then logNormCounts. This appears to be what you did in the pancreas datasets in the OSCA tutorials (although I do not know whether in that case the different individuals were different 10X captures).
scran pooling and deconvolution as before, only this is done separately on each sample (i.e. subsetting each object by individual and running the normalization steps separately). It's very fast and easy to parallelize, which I don't dislike, and may make more sense in case clustering results are largely different for each individual. However, this may still introduce some bias as size factors may have different scales. I have to say I do not have large differences in coverage across batches/individuals so I do not expect this to be a big issue. Perhaps plotting the deconvolution size factors for each individual separately against the library size factors may shed some light on whether that's the case.
multiBatchNorm on the merged object, specifying the batch, then logNormCounts. This method should solve the size factor scaling issue, but it is unclear to me whether it sill uses the clustering + deconvolution approach (which I wanted to use given its success in some benchmarks) or whether it is a different method. Moreover, in the OSCA tutorials (chapter 13) the use of combinedVar is suggested for HVG selection, whereas I thought it would be sufficient to model the mean-variance with the blocked design.

I would then use fastMNN to remove any further batch effect that would still be present.

So, the question: what do you think is the most sensible approach? Is there something I'm missing?

Thanks for your time.

The text was updated successfully, but these errors were encountered:

LTLA · 2019-12-11T01:34:13Z

The typical workflow that I follow is:

scran pooling and normalization within each object to compute size factors. The size factors are not comparable across objects, but that's okay, because
multiBatchNorm to adjust the size factors' scale so that they are comparable across objects; this will also do the log-transformation, so no need for logNormCounts().
Then fastMNN().

Technically speaking, fastMNN() will do another round of normalization (cosine). TBH I don't really think this is necessary, but it sometimes makes things better, and it doesn't seem to do any harm otherwise, so I've just left it in.

One might think that the cosine normalization would allow us to skip 2, and that's probably true to some extent, but the nice thing about doing 2 is that it gives you a baseline to compare before and after MNN correction (sans the obvious differences in sequencing depth between objects). Which is nice to check that the batch correction is actually making a difference.

Some of these steps are put into practice in Chapter 31.

gdagstn · 2019-12-11T01:43:14Z

Great, thanks for the explanation.

Regarding your comment on cosine normalization, would it then make sense to do cosine on the object before fastMN to compare baseline vs corrected?

LTLA · 2019-12-11T01:55:36Z

Technically yes, if you wanted an accurate like-for-like comparison. But generally speaking, if it takes more effort than multiBatchNorm(), then it probably requires fastMNN() anyway.

Also, it's hard to interpret what cosine normalization actually does in terms of removing bias. By comparison, multiBatchNorm() is easy to explain - you're just equalizing coverage across batches - and the resulting expression values can actually be used for plotting and stuff. You can't use the cosine-normalized values for much except batch correction and clustering.

gdagstn · 2019-12-11T02:10:37Z

thanks, this answers all my questions.

gdagstn closed this as completed Dec 11, 2019

piyushjo15 mentioned this issue Apr 13, 2020

Questions regarding multiBatchNorm and fastMNN based correction #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choice of normalization strategy with multiple batches #12

Choice of normalization strategy with multiple batches #12

gdagstn commented Dec 10, 2019 •

edited

Loading

LTLA commented Dec 11, 2019 •

edited

Loading

gdagstn commented Dec 11, 2019

LTLA commented Dec 11, 2019

gdagstn commented Dec 11, 2019

Choice of normalization strategy with multiple batches #12

Choice of normalization strategy with multiple batches #12

Comments

gdagstn commented Dec 10, 2019 • edited Loading

LTLA commented Dec 11, 2019 • edited Loading

gdagstn commented Dec 11, 2019

LTLA commented Dec 11, 2019

gdagstn commented Dec 11, 2019

gdagstn commented Dec 10, 2019 •

edited

Loading

LTLA commented Dec 11, 2019 •

edited

Loading