Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are Dino-normalized counts comparable across datasets? #1

Closed
mbernste opened this issue Apr 5, 2022 · 10 comments
Closed

Are Dino-normalized counts comparable across datasets? #1

mbernste opened this issue Apr 5, 2022 · 10 comments

Comments

@mbernste
Copy link

mbernste commented Apr 5, 2022

Hi @JBrownBiostat! We're looking to use Dino at my company and I had a quick question: are Dino-normalized counts comparable across datasets? That is, if we normalize two datasets separately can one then combine them? (not taking into account batch effects, of course).

On one hand, this may be a bad idea because the latent Gamma distributions are different between the two datasets with different means. On the other hand, I wonder if you could view the combined normalized values as posterior samples from the union of the model used in each dataset (i.e., the union of all the gammas used in the mixture models of both datasets), but the mixing weights for a cell in Dataset 1 would be zero for the latent gammas fit in Dataset 2. In this latter view, combining datasets seems valid.

Sorry, if I am not making sense here! Let me know what I can clarify

@mbernste mbernste changed the title Are Dino-normalized accounts comparable across datasets? Are Dino-normalized counts comparable across datasets? Apr 5, 2022
@JBrownBiostat
Copy link
Owner

Hi @mbernste!

That's great to hear, and I hope Dino helps with the work you are doing.

To a first approximation, your intuition sounds correct. If you reconfigured Dino to fit separate mixtures of gammas to subsets of samples, as in a simple one-factor factorial design -- equivalent to weighting as 0 the appropriate subsets of observed data for each separate fit -- you would get results that matched the situation where you normalized separately up to tuning parameters which are based on the number of input data points. I would not expect those differences to be practically meaningful for normal, non-pathological datasets. Similarly, while there is no particular reason to believe that the two fitted mixtures of gammas would be the same across datasets -- this would be affected by different input proportions of cell-types, treatment groups, batch effects, etc. -- I would not expect those differences to cause practical problems in downstream analysis, especially when you are already planning to test or accommodate for such differences.

The one big thing you need to consider is the target sequencing depth. By default, Dino will scale means of the posterior distribution from which normalized values are sampled to match the mean sequencing depth (column sum) of your input gene-by-sample count matrix, that is, the posteriors are distributed such that the expectation of the column-sums of the sampled, normalized values are all equal to the mean sequencing-depth of the input samples. That average sequencing-depth across samples will not typically be the same across datasets, and so one would observe a global fold-change across all genes between normalized results if this is not accounted for.

The easiest way to deal with this is to input custom scale factors through the depth argument in the Dino call where these factors are, for example, calculated across the datasets you are considering, or perhaps calculated against a standard level such as 10,000 counts, i.e., such that the expected normalized sequencing depth is 10,000 for all samples.

Please let me know if I can clarify further on the topic of supplying custom scale factors, or anything else, of course!

@mbernste
Copy link
Author

mbernste commented Apr 5, 2022

Thank you so much for the detailed reply! Is there a specific reason you scale the posterior means to match the mean sequencing depth? Is this basically just so the normalized values are in a range that is "intuitive"? That is, why not just use the unscaled posterior that is in units of "latent expression"? Unless I am missing something.

@JBrownBiostat
Copy link
Owner

The choice of the mean sequencing depth (default) as the target is, just as you say, so that the normalized values are in an intuitive range.

However, the use of some target sequencing depth which is constant across samples is vital to this or (almost) any other normalization method. If one were to use the un-scaled posterior distribution, then the results would be corrected for depth-dependent differences in the proportion of zeros in the data, but not for depth-dependent differences in expected expression. In other words, the excess variance in the sequenced expression due to random differences in sequencing depth would remain and no normalization would actually have taken place.

As an example of what I am describing in the context of other techniques, CPT or CPM (counts-per-thousand/million) simply scale the per-sample expression such that each sample has one-thousand or one-million counts post normalization (CPT can also be counts-per-ten-thousand). Scran, Median-Ratio, and TMM take an approach similar to Dino in that they each scale per-sample expression such that each sample has post-normalization depth roughly equal to the average of pre-normalized sequencing depths (with some method specific caveats and differences). Though I should note that these methods don't handle well the high proportions of per-sample zeros that Dino was specifically designed to deal with.

@mbernste
Copy link
Author

mbernste commented Apr 5, 2022

Thank you again for your explanation. Though I am getting the sense that I am missing something fundamental. Can you clarify what you mean by "target sequencing depth"?

Looking at the posterior in the paper, I am not sure what is meant by the "target depth". The delta_j is the observed sequencing depth. Is that different?

image

Furthermore, my perhaps incorrect understanding is that the true, latent value of lambda_j would implicitly normalize for sequencing depth because it is the "true expression" of the gene that only when multiplied by the scaling factor produces the mean count. Thus, it wouldn't need scaling.

I will continue to ponder your explanation a bit more!

@JBrownBiostat
Copy link
Owner

JBrownBiostat commented Apr 5, 2022

@mbernste, more than happy to clarify. And apologies for any confusion I may have caused; my use of the phrase "target sequencing depth" didn't appear using that wording in the original paper. What I am referencing when I say target sequencing depth shows up in the paper at the end of paragraph 1 of section 2.1:

"For convenience of interpretation on the λ_gj, calculated LS values are scaled prior to use in the Dino model such that δ_jMed=1 where jMed is the index of the cell with the median LS across cells." (I mentioned above that the mean sequencing depth is the target for simplicity, but technically it is the median as described in the paper)

This is dealing with both a practical and a philosophical problem. You are actually already hinting at the philosophical problem in your question; that being that we can't know the "true expression" of the gene, even statistically. When using reasonable methods, we know that the observed expression is always going to be some (unknown) fraction of the true RNA counts in a particular cell/sample. As such, the latent λ_gj is better understood as modeling a value proportional to the "true expression." This interpretation was not explicitly stated in the paper.

The practical problem is that if the values δ_j remain as observed sequencing depth (and not scaled as mentioned in the quoted section of the paper), then the corresponding sampled values of λ_gj, which are, of course, the normalized values, would all be very small (uniformly less than 1 with high probability). You can see this in the first equation from the crop of the paper you share -- before the Poisson component has been incorporated -- by considering that the Poisson component is the distribution on counts, y, given λ, and that λ would therefore have to be small for δ_j defined as the sequencing depth (much greater than y_gj for normal data).

This is all a long way of saying that the target sequencing depth is the "scale" upon which the latent, depth-corrected means λ_gj are considered. Mathematically, it will be the (possibly unobserved) sequencing depth associated with the cell for which δ_j=1. This can be seen again by examining the first equation above since the value δ_j=1 is the only such for which δ_j drops out of the model, and the expectation of our counts, y_gj, is only defined in terms of the latent λ_gj, given λ_gj. Using the default approach where the median sequencing depth is scaled to 1, the expected sum of λ_gj across genes is then roughly the same as the sequencing depth of the median cell. If we were to leave the δ_j as raw, un-scaled sequencing depths, then the the expected sum of λ_gj across genes would be roughly 1.

@mbernste
Copy link
Author

mbernste commented Apr 5, 2022

Awesome I get it now!! Thank you!

Final more practical question: if I pass a scalar like 10,000 as the depth argument, this will scale all sequencing depths by 1/10,000? So I if I do this for each dataset, the datasets would be roughly comparable?

This seems like a big advantage over scTransform or analytic Pearson residuals for which datasets that are normalized separately cannot really be combined (if I am not mistaken)

@mbernste
Copy link
Author

mbernste commented Apr 5, 2022

Or perhaps what I would need to do would be to scale the counts in each dataset so that the median cell (in terms of total count, i.e., library size) is equal to 10,000?

@mbernste
Copy link
Author

mbernste commented Apr 5, 2022

To provide you some context, at a high level what Im hoping to do is normalize a bunch of datasets separately and then later, we can pull whatever subset of datasets we need and pool them together for aggregate analysis. Thus, it's crucial that the datasets are normalized in complete isolation from one another

@JBrownBiostat
Copy link
Owner

Closer to the second option you are considering.

If the goal is to have the target sequencing depth always be 10,000 so that datasets which are separately normalized can be compared (BTW, this seems like a reasonable approach to me), then you would do the following:

  1. Calculate sequencing depths d_j for each sample/cell j in the usual manner
  2. Scale the calculated d_j by 1/10,000 such that a depth of 10,000 pre-scaling is transformed to 1 post-scaling
  3. Take the log of your scaled depths, d_j, (Dino uses these parameters on the log scale)
  4. Pass the vector of transformed depths as the depth argument to Dino, leaving the sequenced counts unaltered

In case it is helpful, there is an example of a similar procedure to the one I am detailing in the vignette for Dino on Bioconductor under section 3.6, although in that case scale factors calculated by Scran are being used as an alternate measure of library size to the simple column sums of the count matrix we have been considering here.

And yes, I wouldn't perform this type of dataset merging for a residuals-based method like scTransform. You might get away with it if the data were particularly well behaved, but it doesn't strike me as worth the risk.

@mbernste
Copy link
Author

mbernste commented Apr 5, 2022

Awesome, thank you so much. This was incredibly helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants