Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warnings/errors produced while running MAST with mixed effect #133

Closed
jmchan88 opened this issue Apr 29, 2020 · 16 comments
Closed

Warnings/errors produced while running MAST with mixed effect #133

jmchan88 opened this issue Apr 29, 2020 · 16 comments

Comments

@jmchan88
Copy link

Hi there,

I was wondering if you might be able to help me with an issue I've been having when running MAST including a random effect for batch. Despite following previous threads related to this topic, I keep getting a number of warnings and errors.

In particular, I have data for ~53,000 cells from 20 patients (batch) where I want to compare 2 groups of cells (condition), after calculating the centered cellular detection rate (cngeneson). I placed "batch" as a vector of 20 factors in my sca data frame. I then run the following command:

library(lme4)
zlmCond <- zlm(~condition + cngeneson + (1|batch), sca = sca, method = 'glmer', ebayes = F, strictConvergence=F, fitArgsD = list(nAGQ = 0))

However, I get a number of different warnings and errors, 3 of which worry me:

==============

Warning message:
In nobs.default(object, use.fallback = use.fallback) :
no 'nobs' method is available

Error : number of levels of each grouping factor must be < number of observations

Error in (function (cl, name, valueClass) :
assignment of an object of class “list” is not valid for @‘optimMsg’ in an object of class “LMERlike”; is(value, "character") is not TRUE

==============

It's unclear to me why 'nobs' is not recognized as a method, as it appears to be included in lme4 which I already loaded. It seems like the levels of my grouping factor (20) are well below the number of observations (53,000). I have no idea what to make of the 3rd error.

Thank you for any help you can provide!

@gfinak
Copy link
Member

gfinak commented Apr 29, 2020

Can you provide a minimal reproducible example? Otherwise we are just speculating.

@jmchan88
Copy link
Author

Thank you so much for taking the time to take a look at this! I've attached the sca object that I'm working with but downsampled to 1000 cells in each condition to reduce file size. I anticipate that the above command should reproduce the errors that I'm seeing.

sca.mast_example.rds.zip

@gfinak
Copy link
Member

gfinak commented Apr 29, 2020

Could you post the sca object with all cells but downsampled to several genes that give the error and others that don't?

@gfinak
Copy link
Member

gfinak commented Apr 29, 2020

Okay so just looking quickly at your data set, it looks like you haven't filtered out genes that aren't expressed in any cells. That's likely the problem. When I filter on genes with at least 10% expression the errors you describe go away.
We describe the whole filtering and analysis pipeline in the MAST paper. I suggest reading it.

@jmchan88
Copy link
Author

Ah I see! I will make those changes and see how it goes. Thank you so much for checking!

@gfinak gfinak closed this as completed Apr 30, 2020
@jmchan88
Copy link
Author

Hmm taking a look, it seems that filtering out genes expressed in at least 10% cells still produces the same errors/warnings for many, albeit possibly at lower frequency. They include:

Error : number of levels of each grouping factor must be < number of observations

In addition: Warning messages:
1: In nobs.default(object, use.fallback = us

2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.00948906 (tol = 0.002, component 1)

5: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model is nearly unidentifiable: very large eigenvalue

  • Rescale variables?

None of these strictly stop the code from running, so I'll take a look at the end results once it completes on the computer cluster (the code noticeably takes a much longer time than with fixed effects using bayesglm, understandably)

@gfinak
Copy link
Member

gfinak commented Apr 30, 2020

Can you post code that does the filtering and leads to the same errors? I am not experiencing this side-effect.

@jmchan88
Copy link
Author

jmchan88 commented Apr 30, 2020

Sure! This is the code I've been using below:

==================================

suppressMessages(library(MAST))
suppressMessages(library(lme4))
suppressMessages(library(stringr))
suppressMessages(library(data.table))
suppressMessages(require("Matrix"))

args <- commandArgs(trailingOnly = TRUE)

# command line arguments
mtx_file <- args[1]
g_file <- args[2]
bc_file <- args[3]
obs_file <- args[4]
obs <- args[5]
cluster1 <- args[6]
cluster2 <- args[7]
out_dir <- args[8]

# Normalized data frame
counts = as.matrix(readMM(mtx_file))
bc = read.csv(bc_file,header=F,col.names = 'Cell', stringsAsFactors = F)
g = read.csv(g_file,header=F,col.names = 'Gene', stringsAsFactors = F)

rownames(counts) = bc$Cell
colnames(counts) = g$Gene

ms <- rowSums(counts)
norm_df <- counts/ms * median(ms)

# Log transform
data_df <- log(norm_df + 1)

# Clusters
obs_df = read.csv(obs_file,header=T,stringsAsFactors=F, sep = '\t', row.names=1)
cell_clusters = as.character(obs_df[,obs]); names(cell_clusters) <- rownames(obs_df)
batch0 = as.character(obs_df[,'batch']); names(batch0) <- rownames(obs_df)

pairwise_de <- function(df1, df2, batch1, batch2, title){
        # ORdered data frame
        o_df <- rbind(df1, df2)
        #Filter out genes expressed by less than 10% of cells
        ind_col = colSums(o_df!=0)/nrow(o_df) > 0.1
        o_df = o_df[,ind_col]
        batch = c(batch1, batch2)
        condition <- rep(c(1, 2), times=c(nrow(df1), nrow(df2)))
        # Prepare for MAST
        wellKey <- rownames(o_df)
        cdata <- data.frame(cbind(wellKey=wellKey, condition=condition))
        fdata <- data.frame(primerid=colnames(o_df))

        # SCa data
        sca <- FromMatrix( t(o_df), cdata, fdata)
        cdr2 <-colSums(assay(sca)>0)
        colData(sca)$cngeneson <- scale(cdr2)
    colData(sca)$condition <- factor(unlist(as.list(condition)))
    colData(sca)$condition <- relevel(colData(sca)$condition, 2)
    colData(sca)$batch <- factor(batch)

    # Fits
    zlmCond <- zlm(~condition + cngeneson + (1|batch), sca = sca, method = 'glmer', ebayes = F, strictConvergence=F,fitArgsD = list(nAGQ = 0),
                   fitArgsC = list(control = lmerControl(check.conv.singular = .makeCC(action = "ignore",  tol = 1e-4)), verbose=0))
    summaryDt <- summary(zlmCond, doLRT='condition1')$datatable

        # Significance table
        fcHurdle <- merge(summaryDt[contrast=='condition1' & component=='H',.(primerid, `Pr(>Chisq)`)],
                summaryDt[contrast=='condition1' & component=='logFC', .(primerid,coef, ci.hi, ci.lo)], by='primerid')

        # FDR
        fcHurdle[,fdr:=p.adjust(`Pr(>Chisq)`, 'fdr')]
        data.table::setorder(fcHurdle, fdr)

        # Export data
        write.csv(as.data.frame(fcHurdle), sprintf("%s/%s.csv", out_dir, title), quote=FALSE)
}

df1 <- data_df[cell_clusters == cluster1, ]
df2 <- data_df[cell_clusters == cluster2, ]
batch1 = batch0[ cell_clusters == cluster1]
batch2 = batch0[ cell_clusters == cluster2]

pairwise_de(df1, df2, batch1, batch2, sprintf("%s.%s_vs_%s", obs, cluster1, cluster2))

@gfinak
Copy link
Member

gfinak commented Apr 30, 2020

> library(MAST)
> sca<-readRDS("~/Downloads/sca.mast_example.rds")
> sca_filtered<-filterLowExpressedGenes(sca,threshold=0.1)
Filtered out 42.64 percent of genes at 10.00 percent threshold
zlmCond <- zlm(~condition + cngeneson + (1|batch), sca = sca_filtered, method = 'glmer', ebayes = F, strictConvergence=F, fitArgsD = list(nAGQ = 0))
> zlmCond <- zlm(~condition + cngeneson + (1|batch), sca = sca_filtered, method = 'glmer', ebayes = F, strictConvergence=F, fitArgsD = list(nAGQ = 0))
Loading required namespace: lme4
 Completed [==============================>]  100% with 0 failures
fixed-effect model matrix is rank deficient so dropping 1 column / coefficient
Done!
Warning messages:
1: In optwrap(optimizer, devfun, start, rho$lower, control = control,  :
  convergence code 3 from bobyqa: bobyqa -- a trust region step failed to reduce q
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.0031284 (tol = 0.002, component 1)
3: In rbind(...) :
  number of columns of result is not a multiple of vector length (arg 1841)

> table(colData(sca)$batch)

PleuralEffusion         RU1065C          RU1066         RU1080C      RU1124A_LN 
             30               5             130             131              44 
      RU1144_LN        RU1144_T          RU1145          RU1152         RU1181B 
            106              61             162              64             103 
        RU1181C         RU1195A          RU1215  RU1229A_Frozen         RU1231A 
            119              24             193             185             264 
        RU1293A     RU1311A_T_1          RU426B          RU779D 
            295               8              59              17 

So the above works relatively fine, some convergence errors.
Looking at your batch structure, you've got a pretty unbalanced design, I suspect this is the source of your problems.

@gfinak
Copy link
Member

gfinak commented Apr 30, 2020

Also, don't fit different models for each pariwise comparison. Fit one model and compute contrasts.

@jmchan88
Copy link
Author

Ah gotcha ... I was hoping that including the random effect might allow me to account for batch effect without downsampling, but I guess it makes sense that this might lead to unstable fits that don't converge.

One warning for the life of me I can't understand though is this:
In addition: Warning message:
In nobs.default(object, use.fallback = use.fallback) :
no 'nobs' method is available

It seems like it's a function in both lme4 and stats, but both are loaded for me. Do you have a sense of what this method is when I'm running MAST and if it's something specific to my R environment? In the meantime, I'll try to downsample and balance the batches. Thank you!

@gfinak
Copy link
Member

gfinak commented Apr 30, 2020

To be clear, the problem is not that you're donwnsampling, the problem is that you appear to be fitting a bunch of pairwise models. Instead fit one model and compute the contrasts of interest.

@jmchan88
Copy link
Author

Ah I see what you are saying about fitting a single model and using contrasts for multiple pairwise comparisons.

I guess for this particular case though I'm just running this code once and making a single pairwise comparison, and these 2 conditions that I'm comparing essentially cover the entire dataset; so in essence, I'm doing a single pairwise comparison for the entire dataset and getting these warnings/errors.

But your point is well taken, and definitely for instances where I'm doing multiple pairwise comparisons in the future, I'll be sure to change my code to fit a single model!

@gfinak
Copy link
Member

gfinak commented Apr 30, 2020

See if you can isolate one gene that gives the nobs error (after filtering appropriately).

@amcdavid
Copy link
Member

amcdavid commented May 1, 2020

  1. I don't think the nobs error is significant. It has to do with some generic not being present with a calculation about the residual degrees of freedom that we save for models where we are doing shrinkage of the gaussian variance, but don't actually use it here. I would like to get to the bottom of it, though, and silence it.
  2. I suspect the Error : number of levels of each grouping factor must be < number of observations, are due to some genes with low expression vs the relatively high number of batch groups. In that case, we just don't fit the continuous component of the model, so this also should be harmless.
  3. I'm perplexed by the Error in (function (cl, name, valueClass) : assignment of an object of class “list” is not valid for @‘optimMsg’ in an object of class “LMERlike”; is(value, "character") is not TRUE report.
  4. If you are getting a lot of eigenvalue or convergence complaints, etc, that probably means that some component of the design is problematic. I am a bit worried about some of the batches that only had 5 or 8 cells measured, in particular.

@jmchan88
Copy link
Author

jmchan88 commented May 1, 2020

Thank you both so much for your comments! I do think you're right that the batch sizes are unbalanced and that may be leading to errors. The batch sizes that I sent you were uniformly downsampled actually just to give you guys a smaller file size to work with, but are actually much larger. That said, there is def the same imbalance with one batch size far outnumbering the others. I just did a run without that large batch and found that a number of these errors had disappeared, so I do believe I might need to downsample a bit to try to restore some balance. I'll see how the downsampling works ... thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants