New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LMER Segfault for "Large" Problems #702
Comments
Oh dear -- not looking forward to this ... a few questions for context
Just FYI, on problems this large it may be worth checking out @dmbates's |
Thanks for the quick reply! Will see if I can replicate the "problem too large" message. I am using a compute cluster with large amounts of memory available (hundreds of GB per node). Nodes run 64 bit Linux with Intel Xeon processors. Using PBS I am allowing the job to use 80GB. The highest memory usage I see for a solvable problem is about 40GB. When an error occurs, the segfault occurs pretty quickly (no iterations reported in the log) and the code isn't using much memory at that point. Nothing special in terms of BLAS/Lapack. SessionInfo() output below Thanks for the Julia tip. Julia isn't available to me at the moment, but it is a possible longer-term solution. RHEL 8.6 R version 4.1.3 (2022-03-10) Matrix products: default locale: attached base packages: loaded via a namespace (and not attached): RHEL 6.10 R version 3.5.2 (2018-12-20) Matrix products: default locale: attached base packages: loaded via a namespace (and not attached): |
Below is the error message I receive when using lme4_1.1-17 with Matrix_1.2-15. Error in initializePtr() : Thanks again for your help, |
Thanks for mentioning me, @bbolker. @mckman Phillip Alday (@palday) and I have been working on problems like this in the draft version of our book Embrace Uncertainty: Mixed-effects Models with Julia. Chapters 4 and 5 relate to fitting large-scale data sets. The big question for a case like this is whether there are observations on the same person in more than one firm. In experimental design this question is phrased as whether I am going to assume that on a data set like this there is some "crossing" of people with firms, which will require a more storage-intensive representation. The important consideration in chapter 5 is the "fill-in" in the Cholesky factorization of a large sparse matrix. In lme4 the Cholesky factorization is performed in C code from the cholmod packages, which is where the error message originates. In the MixedModels.jl package the Cholesky factor, L, and the symmetric matrix, A, which stores the information to generate L from parameter values, are both stored in blocked form where the sizes of the blocks are the number of levels of the grouping factors - You can see in the table at the beginning of section 5.2 that we have fit models with about 54,000 levels in the second grouping factor on a server. Unfortunately, the amount of storage required and the time taken to evaluate the objective are close to quadratic in that number. Another thing to note from that table is that incorporating firms with only a few observations (equivalent to keeping movies with very few ratings in our example) is very expensive in both storage and time. If it is possible to trim back the number of firms by dropping those with very few observations it makes it easier to fit a model. Let us know if you want to try to fit your model with MixedModels.jl and we will provide instructions on how to install Julia (see https://github.com/JuliaLang/juliaup) and the packages you will need. I need to create an appendix with such instructions in any case. |
Thanks, @dmbates, that's super-useful (and illuminating). Not your problem, but do you have any ideas why this would segfault instead of throwing the appropriate error with more recent Matrix/lme4? @mckman, can you
I could try to provide an annotated version that gives copious debugging output from I want to get rid of the segfault because it's "obviously" a bug (whereas "too big" is still a problem if you want to fit models, but is not a bug in the formal sense); if we can confirm that it's derived from Matrix/CHOLMOD then I would be tempted to try to run the whole thing with valgrind ... (If you have "hundreds" of Gb and can use them all maybe we can squeeze in the problem?) The other solution I was going to suggest was something along the lines of diamond https://github.com/stitchfix/diamond , i.e. estimating the among-person and among-firm variances with smaller data sets (and seeing if they're reasonably stable across different samples) and then using the known variances to do a much simpler optimization problem. However, I see that diamond is (1) in Python [another cross-language problem], and more importantly (2) only does logistic and ordinal regression. I'm sure there would be an analogous solution for Gaussian problems, but it seems like a research problem ... |
The fact that the older version of the Matrix package gave an error message from the cholmod code makes me think that the segfault in later versions is also from down in the C code. I would not recommend trying to track down a segfault from within cholmod. It can be a bit of a challenge trying to decide what is going on in that code. The first question to answer is whether
means that the problem is detected during the symbolic phase when the fill-reducing permutation is being constructed. This comes about because the cholmod code is trying to handle the general problem of a symmetric sparse matrix of size 1.5 M, which will be very difficult. It is not aware of the blocked structure of the matrix being factored. |
Thanks @dmbates for the comments/help. Persons are not nested within firm, but not every person works at every firm. Most persons work at only a few firms. All firms are connected for the set of firms I am estimating. I have already removed very small firms and the workers that primarily work only at those very small firms as they don't contribute much information to the model and are costly (as you point out). Unfortunately, I am using servers that are not completely under my control. Installing packages is OK, but I would need to have the admins install Julia. I understand the problem is large, but what is mildly frustrating is that LMER works well for a problem of size just under the point where it segfaults. It almost seems like there is a variable type wrong or some other parameter issue where the program size isn't allowed to grow as needed. A problem just under the size cutoff is solvable with decent starting values from a smaller random sample in 8-10 hours using maybe 40GB of memory. I only get the segfault with later versions of Matrix (I did step back a few versions with Matrix and still received a segfault), although that is not holding the version of LMER constant at the old 1-17 version. I can test that if needed. I can also try the lFormula test if needed as well. Thanks again for both of your help, |
@mckman The juliaup installer that I linked to allows the option of installing julia in your home directory. That is what I use on servers. We tried installing it in a public directory but it gets complicated having admins updating software that they don't use. For me the local install looked like
|
@mckman One thing to keep in mind about R is that the only integer format in which data are stored is 32-bit signed integers. At least that was true when I was active in R development several years ago and I believe it is still the case. Libraries like LAPACK often have two versions, one that passes 64-bit integers and one that only uses 32-bit integers. With R it is necessary to use the 32-bit version. I can't remember if cholmod, and in general the SuiteSparse library of packages, has a version for 32-bit and another for 64-bit integers but it is likely that it does. If an address or offset in bytes is being stored as a 32-bit integer then it rolls over to a negative value at around 16 GiB. I don't know if this is the problem but it, or something like it, could be. |
Thanks for the Julia install tips. Will give it a try soon. Yes, I was thinking that using 32 bit integers might be an issue somewhere in the code. Didn't know the R limitation on using them, although it appears there are some workarounds. @bbolker below are the results from using a new version of Matrix with an older version of LMER. Segfault. R version 3.5.2 (2018-12-20) Matrix products: default locale: attached base packages: other attached packages: loaded via a namespace (and not attached):
*** caught segfault *** Traceback: Working on the lFormula results. |
@bbolker The lFormula command completed without errors. |
Thanks to both of you for your help, I appreciate your quick and thoughtful replies. In summary, @dmbates suggests using the Julia MixedModels.jl package, but unfortunately this isn't an option for me at this time. The segfault seems to be an issue with the Matrix package and I understand your reluctance to delve into debugging someone else's code. So for the time being large problems (above a certain size threshold) are not solvable using LME4. The suggested alternative is the mixed model package in Julia.jl. What is the size of the largest known solved problem for a mixed model with two partially crossed non-nested random effects using LME4 and the Julia option? This kind of information is super helpful when planning what tools to use. |
@mckman You could solve larger problems if you increased the memory threshold in PBS. As @dmbates noted earlier,
So this tells you the size of the largest block in your model. Combine that with:
and you've got an estimate for the total memory consumption. I'm not sure any other software will be a lot better in terms of memory consumption. Other mixed models software uses a different representation in memory, but you ultimately still need to store the relevant model matrices. |
@palday. Thanks for the information on how to calculate memory usage. Unfortunately increasing the memory threshold in PBS is not the solution. First, PBS does not directly restrict the memory available to R. R is free to use however much it wants. PBS monitors the node and if R were to exceed the allowed memory usage then the job would be terminated. This never happens. I also monitor memory usage directly on the node executing the R job and it never gets above about 40GB. I have requested large amounts of memory for this job in PBS (ie 160GB) and it doesn't make a difference. LMER fails fairly early on before any iterations start at Supernodal/cholmod_super_symbolic.c, line 562. I don't think any other software would be better for memory consumption. LMER works great until a certain size threshold is reached and then it segfaults or throws a "problem too large" error depending on the version of Matrix used. |
Quick thoughts: (1) If it's not too much trouble, can you check if |
It isn't clear to me what building a 64-bit CHOLMOD shared library would do for you. AFAIK R's integers (in the sense of the contents of an INTSXP) are still defined to be 32-bit signed integers. There is some provision for "long vector" support but that just allows for the length of the vector to be a 64-bit integer. I believe that every time you passed an integer either to or from R (other than as the length of an R vector) you would end up converting it to a 32-bit representation. |
The question is whether the required steps are done sufficiently internally to C++ that we could get away without passing integers back and forth. I see that this doesn't work: m <- .Machine$integer.max
new("dgTMatrix", i = integer(0), j = integer(0), Dim = c(m, m))
I'm not 100% sure why this happens since the size of this object should be independent of the dimensions: > M <- new("dgTMatrix", i = integer(0), j = integer(0), Dim = c(10000L, 10000L))
> M2 <- new("dgTMatrix", i = integer(0), j = integer(0), Dim = c(100L, 100L))
> identical(object.size(M), object.size(M2))
[1] TRUE I am aware that I'm being naive here. I'm just poking around for one way of solving the problem, if indeed it's true that the proximal problem is with the capacity of the indexing vectors ... |
@bbolker There is no error when running lFormula, only lmer. The Matrix package version doesn't matter for lFormula. A package like https://cran.r-project.org/web/packages/bit64/ |
I have Julia installed (quite the ordeal to get MixedModels running without direct internet access on the server where I am doing the estimation) and have run a few test models. Appears to be working well so far. From the MixedModels documentation it seems possible to pass starting covariance parameter values from a previous estimation, but I can't seem to get it to work. In R, lmer has the start= option. Any help greatly appreciated. Thanks! |
@mckman It's a little bit tricky because the sorting of the parameters within the theta vector differs slightly between MixedModels and lme4. For a model with two intercept-only terms, the sorting is the same and you can do something like: fm1 = LinearMixedModel(@formula(...), data)
fm1.optsum.initial .= theta_from_r
fit!(fm1; progress=false) # you can disable or enable a progress bar In other words, you can't use the convenience form for construct-and-fit In a few different ad-hoc experiments, we haven't seen huge gains by using different starting values because most of the optimization time is spent in fine-tuning/verifying the optimum and not in getting into its general neighborhood. But for very large models, even saving a few iterations might save a large amount of time. |
Thanks @palday that worked. Unfortunately, the Julia MixedModels.jl package will not estimate the largest model I can estimate in R. Approximately 1.3 million persons and 85,000 firms. In top I can see julia get to about 7-8gb of memory usage and I get an error. "Warning: Random effects grouping variables with many levels can cause out-of-memory errors. Try manually specifying The machine I am running on has 1.5TB of memory, with 200GB allocated in PBS. This problem is solvable using R on the same server. Would be helpful to know how much memory julia is trying to allocate. Something like a dry-run option. The levels of the person and firm random effects are stored as 9 and 15 digit strings, respectively in the dataframe. They are factors in R, but I didn't see a similar type in julia. |
@mckman The solution is in the error message: use # you can also include constrast coding for other variables here
contrasts = Dict(:person => Grouping(), :firm => Grouping())
fm1 = LinearMixedModel(@formula(...), data; contrasts) |
@palday Thanks for the help with the contrasts syntax. Running a large problem now, which is currently using around 200GB of memory. Is there a straightforward way to receive iteration output when running MixedModels non-interactively (job submitted to a compute node on a cluster)? I set progress=false, which not unexpectedly appears to turn off all iteration output. LMER shows the covariance parameter values and the function minimization value at each iteration, which is super helpful with a long running job. Julia seems to buffer output leaving me a bit in the dark about what the program is doing. Not sure how you handle longer running non-interactive jobs, any advice is most appreciated. |
@palday @dmbates Thanks again for your help!!! Not sure if this issue ultimately turns out to be a 32 bit integer problem with LME/Matrix, but the MixedModels.jl program successfully estimates problems above the LME size cutoff (the largest I have done so far is 1.8 million persons and 126K firms using 135GB of memory). One issue with solving large problems in Julia is redirecting output to a file(s). The code below redirects the iteration history to program.err and the rest of the output to program.log. Everything sent to standard out (program.log) is buffered internally by Julia until the job completes, but the iteration history is reported as the program is executing. The reported minutes per iteration is great to have, but it would be very useful to also have the covariance parameter estimates (similar to what LME reports). using statements here open("program.log", "w") do out |
Just seeing this now. Matrix >= 1.5-0 replaces certain CHOLMOD-based C functions with ones written from scratch, making Matrix more efficient in many cases, since CHOLMOD is not "aware" of all of the Matrix subclasses. It's likely that these changes are culpable here, despite being tested quite extensively (though not so much for "huge" problems). Having said that, Cholesky factorization is still done by CHOLMOD, so the bug is probably somewhere else. I'd be grateful for an as-minimal-as-possible reproducible example, including steps to simulate a data set and a function call reproducing the segfault. BTW, |
FWIW, a number of protection stack bugs were detected in Matrix 1.5-0 by Tomas Kalibera's It wouldn't surprise me to learn that the segfault here is related. The problem is large, so the garbage collector is "active". So, it would be useful to know if you are able to reproduce the segfault under Matrix-devel. You can install from sources as follows:
Normally I'd recommend
but R-Forge seems to be "stuck" on an old commit. |
No synthetic data available yet that replicates the issue, however I tested the 1.5-3 version of Matrix (SVN revision 3759) and the seg fault is still present. |
I am estimating a relatively large earnings model using the lmer command. The model is a regression of log annual earnings on experience and hours (the fixed effects) and partially crossed person and firm random effects.
LMER command:
f0 <- lmer(lrae ~ exper + exper2 + exper3 + exper4 + hours + hours2 + hours3 + hours4 + (1 | person) + (1 | firm), data = indat, verbose=3, start=pars)
Lmer works well; the estimation is fast and uses relatively little memory, however when I cross a certain size threshold I receive the following segfault message.
*** caught segfault ***
address 0xb8, cause 'memory not mapped'
Traceback:
1: .Call(merPredDupdateXwts, Ptr, Xwts)
2: initializePtr()
3: .Object$initialize(...)
4: initialize(value, ...)
5: initialize(value, ...)
6: methods::new(def, ...)
I have used two versions of R (3.5.2 and 4.1.3), two versions of RHEL (6.10 and 8.6) and various versions of the Matrix (1.2-15 and 1.5-1) and LME4 (1.1-17 and 1.1-30) R packages with the same result. Although with the older Matrix package I did receive a "problem too large" message instead of the segfault.
The number of levels in the person random effect is ~ >1.4M and the number of levels in the firm random effect is ~ >100K when the estimation routine fails. Models with a slightly smaller number of levels in the random effects work fine. LMER certainly seems capable of solving larger models. Any help troubleshooting this issue is greatly appreciated.
The text was updated successfully, but these errors were encountered: