Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LMER Segfault for "Large" Problems #702

Open
mckman opened this issue Oct 5, 2022 · 27 comments
Open

LMER Segfault for "Large" Problems #702

mckman opened this issue Oct 5, 2022 · 27 comments

Comments

@mckman
Copy link

mckman commented Oct 5, 2022

I am estimating a relatively large earnings model using the lmer command. The model is a regression of log annual earnings on experience and hours (the fixed effects) and partially crossed person and firm random effects.

LMER command:
f0 <- lmer(lrae ~ exper + exper2 + exper3 + exper4 + hours + hours2 + hours3 + hours4 + (1 | person) + (1 | firm), data = indat, verbose=3, start=pars)

Lmer works well; the estimation is fast and uses relatively little memory, however when I cross a certain size threshold I receive the following segfault message.

*** caught segfault ***
address 0xb8, cause 'memory not mapped'

Traceback:
1: .Call(merPredDupdateXwts, Ptr, Xwts)
2: initializePtr()
3: .Object$initialize(...)
4: initialize(value, ...)
5: initialize(value, ...)
6: methods::new(def, ...)

I have used two versions of R (3.5.2 and 4.1.3), two versions of RHEL (6.10 and 8.6) and various versions of the Matrix (1.2-15 and 1.5-1) and LME4 (1.1-17 and 1.1-30) R packages with the same result. Although with the older Matrix package I did receive a "problem too large" message instead of the segfault.

The number of levels in the person random effect is ~ >1.4M and the number of levels in the firm random effect is ~ >100K when the estimation routine fails. Models with a slightly smaller number of levels in the random effects work fine. LMER certainly seems capable of solving larger models. Any help troubleshooting this issue is greatly appreciated.

@bbolker
Copy link
Member

bbolker commented Oct 5, 2022

Oh dear -- not looking forward to this ... a few questions for context

  • What exactly was the 'problem too large' message (maybe that will help track down the problem) from old Matrix?
  • What are the specs of the machine you're using (especially, how much RAM is needed?)
  • Are you explicitly doing any parallel computation? Are you using parallel BLAS or any such? (Checking a system monitor while your fits are running is a quick way to start figuring this out?)
  • Results of sessionInfo() from a clean session (especially OS and "Matrix products/BLAS/Lapack" sections)?

Just FYI, on problems this large it may be worth checking out @dmbates's MixedModels.jl too (I can appreciate that it may not be feasible/you may not want to go the Julia route ...)

@mckman
Copy link
Author

mckman commented Oct 5, 2022

Thanks for the quick reply! Will see if I can replicate the "problem too large" message.

I am using a compute cluster with large amounts of memory available (hundreds of GB per node). Nodes run 64 bit Linux with Intel Xeon processors. Using PBS I am allowing the job to use 80GB. The highest memory usage I see for a solvable problem is about 40GB. When an error occurs, the segfault occurs pretty quickly (no iterations reported in the log) and the code isn't using much memory at that point.

Nothing special in terms of BLAS/Lapack. SessionInfo() output below

Thanks for the Julia tip. Julia isn't available to me at the moment, but it is a possible longer-term solution.

RHEL 8.6

R version 4.1.3 (2022-03-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.6 (Ootpa)

Matrix products: default
BLAS/LAPACK: /lib64/libopenblas-r0.3.15.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_4.1.3

RHEL 6.10

R version 3.5.2 (2018-12-20)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.10 (Santiago)

Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_3.5.2

@mckman
Copy link
Author

mckman commented Oct 6, 2022

Below is the error message I receive when using lme4_1.1-17 with Matrix_1.2-15.

Error in initializePtr() :
Cholmod error 'problem too large' at file:../Supernodal/cholmod_super_symbolic.c, line 562
Calls: lmer ... initialize -> -> initializePtr -> .Call
Execution halted

Thanks again for your help,
Kevin

@dmbates
Copy link
Member

dmbates commented Oct 6, 2022

Thanks for mentioning me, @bbolker.

@mckman Phillip Alday (@palday) and I have been working on problems like this in the draft version of our book Embrace Uncertainty: Mixed-effects Models with Julia. Chapters 4 and 5 relate to fitting large-scale data sets.

The big question for a case like this is whether there are observations on the same person in more than one firm. In experimental design this question is phrased as whether person is nested within firm. If so, it should be fairly straightforward to fit the model in MixedModels.jl.

I am going to assume that on a data set like this there is some "crossing" of people with firms, which will require a more storage-intensive representation.

The important consideration in chapter 5 is the "fill-in" in the Cholesky factorization of a large sparse matrix. In lme4 the Cholesky factorization is performed in C code from the cholmod packages, which is where the error message originates. In the MixedModels.jl package the Cholesky factor, L, and the symmetric matrix, A, which stores the information to generate L from parameter values, are both stored in blocked form where the sizes of the blocks are the number of levels of the grouping factors - person and firm in your case. The point of the analysis in section 5.2 of this book is that it is the number of firms that is most influential in the amount of storage required. The diagonal block, L[2,2], ends up being a triangular matrix of size number of firms by number of firms which would be (10^10) * 8 bytes (i.e. 74 GiB) in your case. Also, the time required to perform the dense Cholesky factorization and what is known as a rank-k update of L[2,2] from L[2,1] would be substantial.

You can see in the table at the beginning of section 5.2 that we have fit models with about 54,000 levels in the second grouping factor on a server. Unfortunately, the amount of storage required and the time taken to evaluate the objective are close to quadratic in that number.

Another thing to note from that table is that incorporating firms with only a few observations (equivalent to keeping movies with very few ratings in our example) is very expensive in both storage and time. If it is possible to trim back the number of firms by dropping those with very few observations it makes it easier to fit a model.

Let us know if you want to try to fit your model with MixedModels.jl and we will provide instructions on how to install Julia (see https://github.com/JuliaLang/juliaup) and the packages you will need. I need to create an appendix with such instructions in any case.

@bbolker
Copy link
Member

bbolker commented Oct 6, 2022

Thanks, @dmbates, that's super-useful (and illuminating). Not your problem, but do you have any ideas why this would segfault instead of throwing the appropriate error with more recent Matrix/lme4?

@mckman, can you

  • confirm that you only get the segfault with more recent Matrix (i.e., holding R/lme4 version constant)?
  • try running your model only with lFormula() (i.e. substitute lFormula for lmer: this should do only the model setup). My prediction is that you'll still get the problems there.

I could try to provide an annotated version that gives copious debugging output from merPredD::updateXwts (which is I think where the problem is ultimately happening), although I strongly suspect that the problem is upstream and is only triggering the segfault later on ...

I want to get rid of the segfault because it's "obviously" a bug (whereas "too big" is still a problem if you want to fit models, but is not a bug in the formal sense); if we can confirm that it's derived from Matrix/CHOLMOD then I would be tempted to try to run the whole thing with valgrind ... (If you have "hundreds" of Gb and can use them all maybe we can squeeze in the problem?)

The other solution I was going to suggest was something along the lines of diamond https://github.com/stitchfix/diamond , i.e. estimating the among-person and among-firm variances with smaller data sets (and seeing if they're reasonably stable across different samples) and then using the known variances to do a much simpler optimization problem. However, I see that diamond is (1) in Python [another cross-language problem], and more importantly (2) only does logistic and ordinal regression. I'm sure there would be an analogous solution for Gaussian problems, but it seems like a research problem ...

@dmbates
Copy link
Member

dmbates commented Oct 6, 2022

The fact that the older version of the Matrix package gave an error message from the cholmod code makes me think that the segfault in later versions is also from down in the C code. I would not recommend trying to track down a segfault from within cholmod. It can be a bit of a challenge trying to decide what is going on in that code.

The first question to answer is whether person is nested within firm. If so, there is no need for a fill-reducing permutation. In the blocked form it means that the [2,2] block of L will be diagonal and occupy comparatively little storage. The fact that the warning in cholmod comes from

Supernodal/cholmod_super_symbolic.c, line 562

means that the problem is detected during the symbolic phase when the fill-reducing permutation is being constructed.

This comes about because the cholmod code is trying to handle the general problem of a symmetric sparse matrix of size 1.5 M, which will be very difficult. It is not aware of the blocked structure of the matrix being factored.

@mckman
Copy link
Author

mckman commented Oct 6, 2022

Thanks @dmbates for the comments/help. Persons are not nested within firm, but not every person works at every firm. Most persons work at only a few firms. All firms are connected for the set of firms I am estimating. I have already removed very small firms and the workers that primarily work only at those very small firms as they don't contribute much information to the model and are costly (as you point out). Unfortunately, I am using servers that are not completely under my control. Installing packages is OK, but I would need to have the admins install Julia.

I understand the problem is large, but what is mildly frustrating is that LMER works well for a problem of size just under the point where it segfaults. It almost seems like there is a variable type wrong or some other parameter issue where the program size isn't allowed to grow as needed. A problem just under the size cutoff is solvable with decent starting values from a smaller random sample in 8-10 hours using maybe 40GB of memory.

I only get the segfault with later versions of Matrix (I did step back a few versions with Matrix and still received a segfault), although that is not holding the version of LMER constant at the old 1-17 version. I can test that if needed.

I can also try the lFormula test if needed as well.

Thanks again for both of your help,
Kevin

@dmbates
Copy link
Member

dmbates commented Oct 6, 2022

@mckman The juliaup installer that I linked to allows the option of installing julia in your home directory. That is what I use on servers. We tried installing it in a public directory but it gets complicated having admins updating software that they don't use. For me the local install looked like

bates@lunchbox:~$ curl -fsSL https://install.julialang.org | sh

info: downloading installer
Welcome to Julia!

This will download and install the official Julia Language distribution
and its version manager Juliaup.

While Juliaup does not seem to be installed on this system, there is a
Juliaup configuration file present from a previous installation.
✔ Do you want to continue with the installation and overwrite the existing Juliaup configuration file? · yes

Juliaup will be installed into the Juliaup home directory, located at:

  /workspace/bates/.julia/juliaup

The julia, juliaup and other commands will be added to
Juliaup's bin directory, located at:

  /workspace/bates/.julia/juliaup/bin

This path will then be added to your PATH environment variable by
modifying the profile files located at:

  /workspace/bates/.profile

Julia will look for a new version of Juliaup itself every 1440 minutes when you start julia.

You can uninstall at any time with juliaup self uninstall and these
changes will be reverted.

✔ Do you want to install with these default configuration choices? · Proceed with installation

Now installing Juliaup
Installing Julia 1.8.2+0.x64 (x64).
Julia was successfully installed on your system.

Depending on which shell you are using, run one of the following
commands to reload the the PATH environment variable:

  . /workspace/bates/.profile

bates@lunchbox:~$ . /workspace/bates/.profile 

bates@lunchbox:~$ julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.8.2 (2022-09-29)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> 
bates@lunchbox:~$ du -sh ~/.julia/juliaup/
948M	/workspace/bates/.julia/juliaup/

@dmbates
Copy link
Member

dmbates commented Oct 6, 2022

@mckman One thing to keep in mind about R is that the only integer format in which data are stored is 32-bit signed integers. At least that was true when I was active in R development several years ago and I believe it is still the case. Libraries like LAPACK often have two versions, one that passes 64-bit integers and one that only uses 32-bit integers. With R it is necessary to use the 32-bit version. I can't remember if cholmod, and in general the SuiteSparse library of packages, has a version for 32-bit and another for 64-bit integers but it is likely that it does.

If an address or offset in bytes is being stored as a 32-bit integer then it rolls over to a negative value at around 16 GiB. I don't know if this is the problem but it, or something like it, could be.

@mckman
Copy link
Author

mckman commented Oct 6, 2022

Thanks for the Julia install tips. Will give it a try soon.

Yes, I was thinking that using 32 bit integers might be an issue somewhere in the code. Didn't know the R limitation on using them, although it appears there are some workarounds.

@bbolker below are the results from using a new version of Matrix with an older version of LMER. Segfault.

R version 3.5.2 (2018-12-20)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.10 (Santiago)

Matrix products: default
BLAS: /usr/lib64/R/lib/[libRblas.so
LAPACK: /usr/lib64/R/lib/[libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] lme4_1.1-17 Matrix_1.5-1 data.table_1.14.0

loaded via a namespace (and not attached):
[1] minqa_1.2.4 MASS_7.3-51.1 compiler_3.5.2 Rcpp_1.0.5
[5] splines_3.5.2 nlme_3.1-137 grid_3.5.2 nloptr_1.0.4
[9] lattice_0.20-38

pars <- c(0.616,0.591)

f0 <- lmer(lrae ~ exper + exper2 + exper3 + exper4 + hours + hours2 + hours3 + hours4 + (1 | pik) + (1 | sein), data = indat, verbose=3, start=pars)

*** caught segfault ***
address 0xb8, cause 'memory not mapped'

Traceback:
1: .Call(merPredDupdateXwts, Ptr, Xwts)
2: initializePtr()
3: .Object$initialize(...)
4: initialize(value, ...)
5: initialize(value, ...)
6: methods::new(def, ...)

Working on the lFormula results.

@mckman
Copy link
Author

mckman commented Oct 6, 2022

@bbolker The lFormula command completed without errors.

@mckman
Copy link
Author

mckman commented Oct 7, 2022

Thanks to both of you for your help, I appreciate your quick and thoughtful replies. In summary, @dmbates suggests using the Julia MixedModels.jl package, but unfortunately this isn't an option for me at this time. The segfault seems to be an issue with the Matrix package and I understand your reluctance to delve into debugging someone else's code.

So for the time being large problems (above a certain size threshold) are not solvable using LME4. The suggested alternative is the mixed model package in Julia.jl.

What is the size of the largest known solved problem for a mixed model with two partially crossed non-nested random effects using LME4 and the Julia option? This kind of information is super helpful when planning what tools to use.

@palday
Copy link
Contributor

palday commented Oct 7, 2022

@mckman You could solve larger problems if you increased the memory threshold in PBS.

As @dmbates noted earlier,

L[2,2], ends up being a triangular matrix of size number of firms persons by number of firms which would be (10^10) * 8 bytes (i.e. 74 GiB) in your case.

So this tells you the size of the largest block in your model. Combine that with:

  1. size of smaller blocks - depending on details of your model, could be quite small, potentially on the order of (number of persons * 8 bytes) + (number of firms * 8 bytes) (+ some small overhead). (why 8 bytes? double-precision floating point numbers take 8 bytes to store)
  2. the size of the dataframe you use to load the data (because that stays in memory in R), which I'm guessing is on the order of a few gigabytes. I think there might be ways around keeping this copy around in R; there definitely are in Julia.

and you've got an estimate for the total memory consumption. I'm not sure any other software will be a lot better in terms of memory consumption. Other mixed models software uses a different representation in memory, but you ultimately still need to store the relevant model matrices. glmmTMB might be worth a try (because to the best of my knowledge, it doesn't store the lower cholesky factor, it just stores the fixed-effects and random-effects model matrices), but I'm unsure how the memory demands from the gradient computation are.

@mckman
Copy link
Author

mckman commented Oct 7, 2022

@palday. Thanks for the information on how to calculate memory usage.

Unfortunately increasing the memory threshold in PBS is not the solution. First, PBS does not directly restrict the memory available to R. R is free to use however much it wants. PBS monitors the node and if R were to exceed the allowed memory usage then the job would be terminated. This never happens.

I also monitor memory usage directly on the node executing the R job and it never gets above about 40GB. I have requested large amounts of memory for this job in PBS (ie 160GB) and it doesn't make a difference. LMER fails fairly early on before any iterations start at Supernodal/cholmod_super_symbolic.c, line 562.

I don't think any other software would be better for memory consumption. LMER works great until a certain size threshold is reached and then it segfaults or throws a "problem too large" error depending on the version of Matrix used.

@bbolker
Copy link
Member

bbolker commented Oct 7, 2022

Quick thoughts: (1) If it's not too much trouble, can you check if lFormula() throws a "too large" error when run with old-Matrix? (It would also be interesting to do a bisection study to find at which version of Matrix we go from a too-large error to a segfault ...) It seems it may be possible to compile CHOLMOD to use 64-bit integers throughout -- I don't know how to set this up. It might be worth bringing in Matrix package maintainers @mmaechler (and possibly trying to hook Mikael Jagan as well).

@dmbates
Copy link
Member

dmbates commented Oct 7, 2022

It isn't clear to me what building a 64-bit CHOLMOD shared library would do for you. AFAIK R's integers (in the sense of the contents of an INTSXP) are still defined to be 32-bit signed integers. There is some provision for "long vector" support but that just allows for the length of the vector to be a 64-bit integer.

I believe that every time you passed an integer either to or from R (other than as the length of an R vector) you would end up converting it to a 32-bit representation.

@bbolker
Copy link
Member

bbolker commented Oct 7, 2022

The question is whether the required steps are done sufficiently internally to C++ that we could get away without passing integers back and forth.

I see that this doesn't work:

m <- .Machine$integer.max
new("dgTMatrix", i = integer(0), j = integer(0), Dim = c(m, m))

Error in .T2C(x) : 'R_Calloc' could not allocate memory (6442450941 of 4 bytes)

I'm not 100% sure why this happens since the size of this object should be independent of the dimensions:

> M <- new("dgTMatrix", i = integer(0), j = integer(0), Dim = c(10000L, 10000L))
> M2 <- new("dgTMatrix", i = integer(0), j = integer(0), Dim = c(100L, 100L))
> identical(object.size(M), object.size(M2))
[1] TRUE

I am aware that I'm being naive here. I'm just poking around for one way of solving the problem, if indeed it's true that the proximal problem is with the capacity of the indexing vectors ...

@mckman
Copy link
Author

mckman commented Oct 7, 2022

@bbolker There is no error when running lFormula, only lmer. The Matrix package version doesn't matter for lFormula.

A package like https://cran.r-project.org/web/packages/bit64/
might be helpful with passing integers back and forth?

@mckman
Copy link
Author

mckman commented Oct 11, 2022

I have Julia installed (quite the ordeal to get MixedModels running without direct internet access on the server where I am doing the estimation) and have run a few test models. Appears to be working well so far.

From the MixedModels documentation it seems possible to pass starting covariance parameter values from a previous estimation, but I can't seem to get it to work. In R, lmer has the start= option.

Any help greatly appreciated. Thanks!

@palday
Copy link
Contributor

palday commented Oct 11, 2022

@mckman It's a little bit tricky because the sorting of the parameters within the theta vector differs slightly between MixedModels and lme4. For a model with two intercept-only terms, the sorting is the same and you can do something like:

fm1 = LinearMixedModel(@formula(...), data)
fm1.optsum.initial .= theta_from_r
fit!(fm1; progress=false) # you can disable or enable a progress bar 

In other words, you can't use the convenience form for construct-and-fit fit(MixedModel, @formula(...), data).

In a few different ad-hoc experiments, we haven't seen huge gains by using different starting values because most of the optimization time is spent in fine-tuning/verifying the optimum and not in getting into its general neighborhood. But for very large models, even saving a few iterations might save a large amount of time.

@mckman
Copy link
Author

mckman commented Oct 12, 2022

Thanks @palday that worked.

Unfortunately, the Julia MixedModels.jl package will not estimate the largest model I can estimate in R. Approximately 1.3 million persons and 85,000 firms. In top I can see julia get to about 7-8gb of memory usage and I get an error.

"Warning: Random effects grouping variables with many levels can cause out-of-memory errors. Try manually specifying Grouping() contrasts for those variables.
@MixedModels /.../linearmixedmodel.jl:63
ERROR: LoadError: OutOfMemoryError()"

The machine I am running on has 1.5TB of memory, with 200GB allocated in PBS. This problem is solvable using R on the same server. Would be helpful to know how much memory julia is trying to allocate. Something like a dry-run option.

The levels of the person and firm random effects are stored as 9 and 15 digit strings, respectively in the dataframe. They are factors in R, but I didn't see a similar type in julia.

@palday
Copy link
Contributor

palday commented Oct 12, 2022

@mckman The solution is in the error message: use Grouping():

# you can also include constrast coding for other variables here
contrasts = Dict(:person => Grouping(), :firm => Grouping()) 
fm1 = LinearMixedModel(@formula(...), data; contrasts)

@mckman
Copy link
Author

mckman commented Oct 12, 2022

@palday Thanks for the help with the contrasts syntax. Running a large problem now, which is currently using around 200GB of memory.

Is there a straightforward way to receive iteration output when running MixedModels non-interactively (job submitted to a compute node on a cluster)?

I set progress=false, which not unexpectedly appears to turn off all iteration output. LMER shows the covariance parameter values and the function minimization value at each iteration, which is super helpful with a long running job. Julia seems to buffer output leaving me a bit in the dark about what the program is doing.

Not sure how you handle longer running non-interactive jobs, any advice is most appreciated.

@mckman
Copy link
Author

mckman commented Oct 15, 2022

@palday @dmbates Thanks again for your help!!! Not sure if this issue ultimately turns out to be a 32 bit integer problem with LME/Matrix, but the MixedModels.jl program successfully estimates problems above the LME size cutoff (the largest I have done so far is 1.8 million persons and 126K firms using 135GB of memory). One issue with solving large problems in Julia is redirecting output to a file(s). The code below redirects the iteration history to program.err and the rest of the output to program.log. Everything sent to standard out (program.log) is buffered internally by Julia until the job completes, but the iteration history is reported as the program is executing. The reported minutes per iteration is great to have, but it would be very useful to also have the covariance parameter estimates (similar to what LME reports).

using statements here

open("program.log", "w") do out
open("program.err", "w") do err
redirect_stdout(out) do
redirect_stderr(err) do
Mixed Model Estimation code here
end;
end;
end;
end;

@jaganmn
Copy link
Contributor

jaganmn commented Nov 3, 2022

Just seeing this now.

Matrix >= 1.5-0 replaces certain CHOLMOD-based C functions with ones written from scratch, making Matrix more efficient in many cases, since CHOLMOD is not "aware" of all of the Matrix subclasses. It's likely that these changes are culpable here, despite being tested quite extensively (though not so much for "huge" problems). Having said that, Cholesky factorization is still done by CHOLMOD, so the bug is probably somewhere else.

I'd be grateful for an as-minimal-as-possible reproducible example, including steps to simulate a data set and a function call reproducing the segfault.

BTW, zzz <- new("dgTMatrix", Dim = rep.int(.Machine$integer.max, 2L)) works just fine. The problem is show(zzz) doing something incredibly inefficient. I'd welcome a bug report ...

@jaganmn
Copy link
Contributor

jaganmn commented Nov 3, 2022

FWIW, a number of protection stack bugs were detected in Matrix 1.5-0 by Tomas Kalibera's rchk tool. These persisted in Matrix 1.5-1, but are fixed in Matrix-devel. Such bugs would have allowed unprotected objects to be freed prematurely by the garbage collector, possibly in the middle of a computation.

It wouldn't surprise me to learn that the segfault here is related. The problem is large, so the garbage collector is "active". So, it would be useful to know if you are able to reproduce the segfault under Matrix-devel. You can install from sources as follows:

$ svn checkout svn://svn.r-forge.r-project.org/svnroot/matrix/pkg/Matrix
$ R CMD INSTALL Matrix

Normally I'd recommend

> install.packages("Matrix", repos = "http://R-Forge.R-project.org", type = "source")

but R-Forge seems to be "stuck" on an old commit.

@mckman
Copy link
Author

mckman commented Nov 8, 2022

@jaganmn

No synthetic data available yet that replicates the issue, however I tested the 1.5-3 version of Matrix (SVN revision 3759) and the seg fault is still present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants