-
Notifications
You must be signed in to change notification settings - Fork 20
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compositionality in GLLVM approach #23
Comments
Hi @chassenr ! Thanks for the question. I'm not quite sure what the sequencing data mean in practice. Could you explain? |
Hi @JenniNiku Thanks! Cheers, |
It sounds to me like you want to include an offset to account for the different number of sequences in samples. However, I'm not sure that answers your question. As long as the sample per species is a count, then applying one of the regular count distributions (Poisson, NB, ZIP) shouldn't be a problem. |
It is more about accounting for the dependence of counts on each other per sample, to avoid spurious correlations. Usually, I use a clr transformation (e.g. ALDEx2, SPIEC-EASI). |
Would you mind describing your data in detail? At the moment it's not very clear what you're after :). |
I assume my data would be very similar to the microbial dataset used in the vignette. If no internal standard was used for sequencing, the sequence counts will not reflect actual abundances. I will try to explain the compositionality problem with a simple example. Sample 1: 3 species (absolute abundance: 10, 30, 60; sequence proportions: 0.1, 0.3, 0.6) The total number of sequences per sample (the library size or sequencing depth) has no ecological meaning (it is determine by how much DNA you put on the sequencer), but it constrains the sequence counts per OTU, i.e. all OTU counts always have to add up to the library size (100%). In the example this leads to an 'increase' in the proportions of the 2 species in sample 2 because they have to fill up this sequence space. However, this apparent increase is caused by the compositional character of the data and not a change in actual abundances. Therefore, the counts in an OTU table cannot be used as they are for statistical models and even taking the percentages is insufficient. |
The above explanation doesn't clarify much for me I'm afraid. I don't know much about sequencing or microbial biology (but maybe Jenni knows more? ). If you want to model absolute abundances relative to their sequence proportions, an offset should take care of that as far as I can tell. However, in your previous message you wrote that you want to account for dependence between counts of species in a sample. Are you implying that you expect them to be correlated somehow, i.e. in a pseudo replication kind of way? Alternatively, you can include additional sample-specific intercepts so that the mean abundance of both samples and species is accounted for outside of the ordination, giving focus on composition. Do I understand correctly that you have multiple sequences for each sample? If so, how is that reflected in your data? |
Hi @chassenr, Have you tried the example analysis from the vignette? I tried to follow the example with my microbial data (ASV table of 32 samples and 15 000 taxa, 10 environmental variables) but calculations took me days and I ended up with errors or nothing in the plots. I assumed it was because of such a large number of ASVs. |
@kbitenieks, any specific errors you had trouble with? GLLVMs are complex statistical models, and with so many taxa calculations are bound to take a while. However, just having many taxa should not create any errors. Lack of data in some taxa might, though. |
To @chassenr , so is the interest more in the proportions of the absolute abundances than the abundance itself in a sequence? For proportion/coverage data, package has not suitable distribution at the moment. However, we are developing GLLVMs for beta distribution. I haven't considered the possibility for Dirichlet, at all yet, so I don't know if that would work with these models. To @kbitenieks , I would recommend to start with a smaller subset of the data where most sparse columns are left out, and see how it works. Also, with large datasets, try to fit a model without calculating standard errors at first, using argument sd.errors = FALSE, as that part takes a long time for huge datasets. |
@JenniNiku My point is that sequence counts do not reflect abundance, and are therefore unsuited for statistical ecological models without transformation. As far as I can tell, the data in the vignette are untransformed sequence counts, which are used directly in GLLVM. Furthermore, the total number of sequences per sample differs between samples. This difference has no ecological meaning. Given the peculiarities of sequence counts, I question whether it is a valid to use such counts in GLLVM. Or did I miss something in the GLLVM package? Introducing a distribution for percentages will also not solve the problem, as sequence percentages are not independent (see compositionality, https://www.frontiersin.org/articles/10.3389/fmicb.2017.02224/full). Therefore the more recent packages for sequence data analysis, use log ratio transformations. @kbitenieks : Which kind of input data do you use: full OTU table, rarefied to equal library size, rare OTUs filtered? |
@chassenr, I would argue that statistical models should be formulated to account for the properties of data, also in the case of sequencing data, so that transformations can be avoided as much as possible (and as much information can be retained in the data as possible). As such, the question to answer is (from my perspective) "what GLLVM formulation accounts for the properties of sequencing data?". |
As chassenr rightfully pointed out, the sample sizes in a NGS data set, i.e. the total observations in a sample, have no biological meaning, and are technical artifacts inherited by the sequencing device, a now generally accepted "feature" of NGS data. There are two consequences resulting from this feature: i) the need for normalizing for unequal sample sizes and ii) to check if the resulting data is compositional or not. Traditionally, however, it was either scaling to sample sums (relative/proportional data), or subsampling to an equal sampling size ("rarefying"). Both have their problems, and are disregarded for a number of applications (differential abundance, abundance modelling, correlation analysis), but are still probably fine for beta diversity. Weiss et al, 2017 (Microbiome), is a very good start into this topic. However, the problem does not end with just adjusting for biologically irrelevant sample sizes. I think the gllvm package would greatly benefit from having recommendations how to deal best with the two problems. |
Does this reference: "Negative binomial mixed models for analyzing microbiome count data", give any leads? They account for library size by including an offset, as suggested above. |
Thanks for the heads up. I thing applying offsets is basically the same as using proportional data, isnt it? I remember using this approach for GAMMs with zero-inflated negative binomial distributions on NGS data. |
@trichterhb I would love to comment on that, but I'm still not completely clear on what you mean by "compositionality". Care to have another try to elaborate? |
Here is an evaluation of NB for sequence data: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0224909 |
I think the issue is best explained here: Imagine you are measuring soil texture as either loam or silt, in % per unit of soil. Such a data set only holds relative information, and by knowing loam, you also know silt percentage, giving you only one degree of freedom. Likewise, the issue with NGS data is that the last unknown sample size is determined by the instrument (as it e.g. can only yield 5 million observations per run). the issues with NGS count data are thus two fold: first, unequal, but meaningless sample sizes for which various normalizations exist (some of which cause compositionality within the sample), second, count totals are fixed by the instrument, and are thus intrinsically delivering compositional data. The second point depends on your personal philosophy (there is no observer in existence which can deliver infinite data, and a sequencer delivers much more observations than an eye), but the field of microbial ecology is definitely moving into adopting this view point. Since gllvm seem to be a very promising approach and since an NGS data is used as an example if i am not mistaken, it would be extremely helpful to have this method "fit" for recent trends in microbial ecology. |
I'm a bit late to the game here but, I believe, as @BertvanderVeen pointed out by using an offset and a random or fixed row effect for each sample is in effect modelling the data as a composition, as well as accounting for sequencing depth differences. But I digress- the properties that make compositional data different is that fact they are a simplex. So another set of distributions that could be considered are multivariate categorical distributions. With this in mind @BertvanderVeen and @JenniNiku could I make request to have the NM, GDM and DM distributions considered for the GLLVM package? Thanks Harrison, JG, Calder, WJ, Shastry, V, & Buerkle, CA 2020, ‘Dirichlet‐multinomial modelling outperforms alternatives for analysis of microbiome and other ecological count data’ Molecular Ecology Resources, vol. 20, no. 2, pp. 481–497, doi: 10.1111/1755-0998.13128. Kim J, Zhang Y, Day J, Zhou H. MGLM: An R Package for Multivariate Categorical Data Analysis. R J. 2018 Jul;10(1):73-90. doi: 10.32614/rj-2018-015. PMID: 32523781; PMCID: PMC7286576. |
Thanks for that confirmation, @ch16S. It's an interesting point that you mention, though I'm not too familiar with some of the distributions that you mention. Similarly to the binomial distribution, I'm not sure an analytically tractable solution is available for the multinomial distribution, though I might be wrong. Either how, the development of EVA might make that more straightforward. Having said that, at first glance it seems to me that those distributions prevent the assumption of independence in the observations (being multivariate in nature, that would make sense!). Currently, that assumption is key in the implementation of GLLVMs in the R-package, since it allows us to provide (in a relatively straightforward way compared to the alternative) tractable solutions to the marginal log-likelihood of GLLVMs. Not assuming independence (e.g., due to spatially structured residuals) makes the maths a lot more complex (for (E)VA at least)! That's not to say it's not a possibility, I just don't see it happening anytime soon. But @JenniNiku might have a different perspective on that of course. |
Interesting points here. I'm not that familiar with all of those distributions, so I would need to get to know them at first. |
Sorry I'm somewhat late to the party here... an appropriate procedure to account for compositionality in GLLVM is to analyse the raw counts but with a row effect in the model (adding the argument row.eff="fixed" or row.eff="random"). This controls for differences in library size so that all remaining terms in the model estimate compositional effects. Bert suggested an offset, which is a similar idea, but it assumes the row effect is known a priori. There are some curly questions about whether the matrix of model coefficients should then be constrained, partitioned into a "main effect" and remaining compositional terms, but that consideration is more about parameterisation of the model than anything else, and is secondary to the concern of which model to actually fit! |
Indeed, and my second suggestion a row intercept. Thanks for contributing to the discussion David. |
Out of curiosity is there circumstances where you would use a row offset and a row effect? And in terms of the curly questions about the row effect is there an example of how to partition that code into a main effect and compositional terms? |
row offset and row effect - I guess so, the offset would be accounting for effects we know about, the row effect is picking up residual effects not captured by offset. code to partition - well I guess I've done something relevant to this in mvabund::manyglm(..., comp=TRUE). Although the start of this discussion was about sequencing data for which this particular function would not be very practical (at least, not without recoding it from scratch, comp=TRUE co-opts code written for a different purpose and it does not come across gracefully). |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Hi @JenniNiku ,
a colleague recommended your R package for GLLVM and I am very curious, how you applied this approach to microbial data, specifically sequencing data. As sequencing data is compositional, I was wondering how you accounted for this in your approach, as the typical models that are used for species count data (e.g. poisson, NB) are not necessarily suitable for sequencing data?
Thanks!
Cheers,
Christiane
The text was updated successfully, but these errors were encountered: