Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unique gives segfault from C stack overflow #4300

Open
quantitative-technologies opened this issue Mar 11, 2020 · 9 comments · May be fixed by #6111
Open

unique gives segfault from C stack overflow #4300

quantitative-technologies opened this issue Mar 11, 2020 · 9 comments · May be fixed by #6111
Labels
Milestone

Comments

@quantitative-technologies

# Minimal reproducible example

data <- readRDS('data_debug.rds')
dt <- unique(data)

crashes RStudio, or in the R console:

Error: segfault from C stack overflow

I can provide data_debug.rds (~300MB) if desired.

# Output of sessionInfo()

sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 19.10

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.12.8

loaded via a namespace (and not attached):
[1] compiler_3.6.2

@shrektan
Copy link
Member

Would you mind try to call setDT(data) before the unique call and report the results back here?

Just check if it’s due to the null self pointer.

@quantitative-technologies
Copy link
Author

I tried calling setDT(data) first, but the result was the same.

@shrektan
Copy link
Member

shrektan commented Mar 11, 2020

Oops, having your data would be necessary then.

@quantitative-technologies
Copy link
Author

Please find the data set here

@shrektan
Copy link
Member

shrektan commented Mar 28, 2020

Thanks for the data, I can reproduce this.


UPDATE: the error is thrown from

radix_r(0, nrow-1, 0); // top level recursive call: (from, to, radix)

@shrektan shrektan added the bug label Mar 28, 2020
@shrektan
Copy link
Member

shrektan commented Mar 28, 2020

OK, I create a smaller data set (23MB) that can reproduce this. The cause, I believe, is the recursive function radix_r() is called for too many times (too deep), leading to an overflow on the calling stack.

I don't know why this happens for this specific dataset as I can't reproduce it with randomly generated data.

Code

library(data.table)
dt <- readRDS('~/Downloads/data_debug_tan.rds')
dt2 <- data.table:::duplicated.data.table(dt)

Data

data_debug_tan.rds.zip

Debugged message

Error: C stack usage  8094212 is too close to the limit

@jangorecki jangorecki added this to the 1.12.9 milestone Apr 5, 2020
@mattdowle mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020
@ben-schwen
Copy link
Member

ben-schwen commented Nov 18, 2021

Its reproducible with the following setting:

  • A wide data.table here with at least 500 columns (at least for my stack size)
  • Duplicated rows

As @shrektan pointed out it's forderv and there void radix_r() what actually errors.

Example

x = matrix(rnorm(5000), nrow=10)
idx = sample(10, 20, TRUE)
DT = as.data.table(x[idx,])
forderv(DT, by=names(DT), sort=FALSE, retGrp=TRUE)
# Error: segfault from C stack overflow

Calling it a second time also changes the error to

x = matrix(rnorm(5000), nrow=10)
idx = sample(10, 20, TRUE)
DT = as.data.table(x[idx,])
forderv(DT, by=names(DT), sort=FALSE, retGrp=TRUE)
# Error in colnamesInt(x, by, check_dups = FALSE) : 
#   Internal error: savetl_init checks failed (0 100 0x556d9d49d540 0x556d9e05f430). please report to data.table issue tracker.

Error seems to have been introduced by 05c0d45, no segfault at 092fec3

edit:
What seems to be the problem here is that nradix gets so big that the recursion radix += 1 until radix+1 == nradix leads to the stack overflow.

@jangorecki jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022
@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
@brooksambrose
Copy link

brooksambrose commented Apr 16, 2024

I think I'm running into this with a data.table that is 18k columns, but I'm not getting any errors in R, it just crashes. If a fix is on the roadmap, is there a recommended way to de-duplicate wide data.tables in the meantime?

@ben-schwen
Copy link
Member

ben-schwen commented Apr 16, 2024

I think I'm running into this with a data.table that is 18k columns, but I'm not getting any errors in R, it just crashes. If a fix is on the roadmap, is there a recommended way to de-duplicate wide data.tables in the meantime?

You should at least get the # Error: segfault from C stack overflow when running your code from terminal.

AFAIA nobody is working on this fix yet. If the method which is failing is unique then you can fall back to call unique.data.frame(DT) on your data.table using the method from base.

edit: Another way to avoid this is to enlarge your stack size. Under linux you can do this with ulimit -s unlimited to set the stack size to unlimited. You can check that this worked via

Cstack_info()["size"]
#> size 
#>   NA

@ben-schwen ben-schwen linked a pull request Apr 30, 2024 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants