Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deviance residuals after inclusion of covariates still include variation from covariates #72

Closed
ghost opened this issue Apr 12, 2017 · 4 comments

Comments

@ghost
Copy link

ghost commented Apr 12, 2017

Hi!

we are currently trying to use MAST for a dataset of 443 pancreatic alpha cells from 6 different healthy donors (136,92,117,44,28,26 cells from each donor respectively) and 443 cells from 4 disease donors (obtained with SmartSeq2). Now among a number of things we are doing with MAST we are trying to look at correlations between genes over the 443 healthy cell samples. To do that we thought we remove variation arising from CDR and donor differences, by looking at the deviance residuals after including CDR and donors as covariates in the model like this:

colData(scaRaw)$cngeneson <- scale(rowSums(testData != 0))
colData(scaRaw)$donor <- as.factor(subjectArrayH)

zlmResidDE <- zlm.SingleCellAssay(~cngeneson+donor, scaRaw, hook=deviance_residuals_hook)

residDE <- zlmResidDE@hookOut
resMatrix <- t(do.call(rbind, residDE))

Now before looking at the correlations we clustered the cells to check if the inter cell variation is now larger than the donor differences. We used the SIMLR algorithm with the 1000 most variable genes like this:

disp = zlmResidDE@dispersion[,1]
require(SIMLR)
cut1 = 1000
noClusters = 6

test2 = resMatrix[,rank(-disp) < cut1] # here we choose the most variable genes
res = SIMLR(t(test2), c = noClusters, cores.ratio = 0.5)

The result was this plot:

image

Now as you can see there is still a lot of clustering according to donor although donors to not fall into seperate clusters at least. This suggest that donor differences still mask variation due to the biological state of the cell. When we tried the same thing with a similar dataset of the same celltype (424 cells from 4 donors) obtained from a different sequencing platform (Celseq) we had really good results: Including only CDR in the model left cells clustering according to donor (left), but including the donors as a covariate left cells cluster independent of donor origin (right).

image

In addition, the biological variation was preserved as we got meaningful correlation structures out of the remaining deviance residuals...

Would you have any idea where this problem with the first dataset could arise from and why it is not present in the second? Or any suggestion how we could improve on the deviance residuals? Maybe we just have to disregard our first dataset, because it does not give us enough statistical power for this kind of analysis, but it would be a real shame..

Thanks!
Best wishes,

Alexander

@gfinak
Copy link
Member

gfinak commented Apr 12, 2017

The donor effects may be non-linear. These would be captured by SIMLR but not by the MAST model.
Have you inspected the residuals of some of those most variable genes by more standard methods to verify model assumptions?
Cheers,
Greg

@amcdavid
Copy link
Member

I agree with Greg--SIMLR is using a cell-to-cell distance matrix and will capture non-linear structure of the data. Regressing out variables with the deviance residuals (roughly speaking) makes each gene orthogonal to the nuisance covariates. But that does not make the distance matrix orthogonal to nuisance covariates (eg if cells are more similar to each other within ID than they are between ID). I don't know of anyone who has solved this problem yet for non-linear dimensionality reduction.

You could explicitly regress out the nuisance covariates from the distance matrix (rather than the expression matrix) then run your favorite dimensionality reduction algorithm, eg resid(lm(distance_matrix ~ covariates)).

@amcdavid
Copy link
Member

amcdavid commented Apr 12, 2017

Also, @AlexanderAivazidis is it OK if I repost your question on the bioconductor support site so that we can close this issue when it's resolved but keep the question alive for other users?

@ghost
Copy link
Author

ghost commented Apr 12, 2017

Thanks! This makes sense. Sure you can repost this question on the bioconductor support site that is probably a better place for it. I will look into your suggestions and then post my results if I have any progress with this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants