-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Choice of normalization strategy with multiple batches #12
Comments
The typical workflow that I follow is:
Technically speaking, One might think that the cosine normalization would allow us to skip 2, and that's probably true to some extent, but the nice thing about doing 2 is that it gives you a baseline to compare before and after MNN correction (sans the obvious differences in sequencing depth between objects). Which is nice to check that the batch correction is actually making a difference. Some of these steps are put into practice in Chapter 31. |
Great, thanks for the explanation. Regarding your comment on cosine normalization, would it then make sense to do cosine on the object before |
Technically yes, if you wanted an accurate like-for-like comparison. But generally speaking, if it takes more effort than Also, it's hard to interpret what cosine normalization actually does in terms of removing bias. By comparison, |
thanks, this answers all my questions. |
Hi,
first of all I just wanted to say it's a pleasure to read your documentation, answers and even the code you write as they're always clear and full of opportunities for learning.
Now to the matter at hand: I have a quite large dataset of single nucleus RNA-seq from 8 individuals (8 separate 10X runs). These were prepped/sequenced in 2 different batches, but I am not interested in the inter-individual variability. In other words, I think it is safe to say that I can remove the batch effect (1 and 2) by removing the individual effect (1:8).
By reading on
scran
,batchelor
and looking at your workflow on the Bioconductor OSCA website, however, I am still undecided as to what is the best strategy to normalize my data and wanted to ask for advice.I have already conducted all the necessary count-level QC (capping low/high library sizes, capping % mitochondrial content, removing empty droplets, removing non-identified genes, etc). This was done on a merged count matrix so there is no subsetting problem.
Now for normalization, I reckon I have the following options:
scran
pooling and deconvolution normalization on all individuals in the merged object, ignoring the individual:quickCluster
, thencomputeSizeFactors
, thenlogNormCounts
. This appears to be what you did in the pancreas datasets in the OSCA tutorials (although I do not know whether in that case the different individuals were different 10X captures).scran
pooling and deconvolution as before, only this is done separately on each sample (i.e. subsetting each object by individual and running the normalization steps separately). It's very fast and easy to parallelize, which I don't dislike, and may make more sense in case clustering results are largely different for each individual. However, this may still introduce some bias as size factors may have different scales. I have to say I do not have large differences in coverage across batches/individuals so I do not expect this to be a big issue. Perhaps plotting the deconvolution size factors for each individual separately against the library size factors may shed some light on whether that's the case.multiBatchNorm
on the merged object, specifying the batch, thenlogNormCounts
. This method should solve the size factor scaling issue, but it is unclear to me whether it sill uses the clustering + deconvolution approach (which I wanted to use given its success in some benchmarks) or whether it is a different method. Moreover, in the OSCA tutorials (chapter 13) the use ofcombinedVar
is suggested for HVG selection, whereas I thought it would be sufficient to model the mean-variance with the blocked design.I would then use
fastMNN
to remove any further batch effect that would still be present.So, the question: what do you think is the most sensible approach? Is there something I'm missing?
Thanks for your time.
The text was updated successfully, but these errors were encountered: