Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R 3.0.0 test out-of-memory due to test relying on R>=3.1.0 no-copy behaviour #2484

Closed
jangorecki opened this issue Nov 14, 2017 · 10 comments
Closed
Milestone

Comments

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Nov 14, 2017

Since 18790e0 R-3.0.0 started to fail on test

# segfault when rbindlist is asked to create a DT with more than 2bn rows
DT = data.table(1:1e6)
L = vector("list", 2148)
for (i in seq_along(L)) L[[i]] = DT   # many references to the same DT to avoid actually using large RAM for this test
test(1850, rbindlist(L), error="Total rows in the list is 2148000000 which is larger than the maximum number of rows, currently 2147483647")

Last entry in Rout.fail

Running test id 1849.9      Killed
@mattdowle
Copy link
Member

@mattdowle mattdowle commented Nov 14, 2017

Thanks @jangorecki! Shall we update the dependency to R 3.1.0? That was released April 2014 and is 3.5 years old now. This seems to be a good reason. There are two places in data.table.R that comment about list() copying in R < 3.1.0. We could then remove CcopyNamedInList, .R.listCopiesNamed and tidy that up.

It was this (very welcome) change in R 3.1.0 :

Avoid duplicating the right hand side values in complex assignments when possible. This reduces copying of replacement values in expressions such as Z$a <- a0 and ans[[i]] <- tmp: some package code has relied on there being copies.
Also, a number of other changes to reduce copying of objects; all contributed by or based on suggestions by Michael Lawrence.

@mattdowle mattdowle added this to the v1.10.6 milestone Nov 14, 2017
@mattdowle mattdowle changed the title R 3.0.0 fail due to rbindlist with more than 2bn rows R 3.0.0 fail due to too-big test relying on R>=3.1.0 no-copy behaviour Nov 14, 2017
@mattdowle mattdowle changed the title R 3.0.0 fail due to too-big test relying on R>=3.1.0 no-copy behaviour R 3.0.0 test out-of-memory due to test relying on R>=3.1.0 no-copy behaviour Nov 14, 2017
@jangorecki
Copy link
Member Author

@jangorecki jangorecki commented Nov 15, 2017

3.1 sounds fine. I remember I had to escape some tests on factor which crashes in base R 3.0. I will check how check will process on R 3.1. If anyNA was already in R 3.1 this will help in two other PR by @MichaelChirico

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Nov 15, 2017

Also the following:

grep 'any[(]is.na' data.table/R/*
data.table/R/data.table.R:          if (any(is.na(w))) {
data.table/R/data.table.R:            if (any(is.na(.SDcols)) || any(abs(.SDcols)>ncol(x)) || any(abs(.SDcols)<1L)) stop(".SDcols is numeric but out of bounds (or NA)")
data.table/R/data.table.R:            if (any(is.na(.SDcols)) || any(!.SDcols %chin% names(x))) stop("Some items of .SDcols are not column names (or are NA)")
data.table/R/data.table.R:  if (any(is.na(j))) stop("NA in j")
data.table/R/data.table.R:      if (any(is.na(i))) stop("Items of 'old' not found in column names: ",paste(old[is.na(i)],collapse=","))
data.table/R/data.table.R:    if (any(is.na(o))) stop("Names in neworder not found in x: ", paste(neworder[is.na(o)], collapse=","))
data.table/R/fcast.R:  if (any(is.na(drop))) stop("'drop' must be logical TRUE/FALSE")
data.table/R/foverlaps.R:  if (any(is.na(chmatch(by.x, names(x)))))
data.table/R/foverlaps.R:  if (any(is.na(chmatch(by.y, names(y)))))
data.table/R/setkey.R:#    1) it returns NA if any(is.na(.)) where NAs are detected at R level, inefficiently

@jangorecki
Copy link
Member Author

@jangorecki jangorecki commented Nov 15, 2017

Looks like anyNA is not yet in R 3.1. Current master fail one test, probably setnames call.

Error in eval(expr, envir, enclos) : 
    1 error out of 7625 (lastID=1853, endian==little, sizeof(long double)==16, sizeof(pointer)==8) in inst/tests/tests.Rraw on Wed Nov 15 04:42:41 2017. Search tests.Rraw for test number 1555.1.
  Calls: test.data.table -> sys.source -> eval -> eval

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Nov 15, 2017

Are you sure? Change log says anyNA was added to 3.1.0:

https://cran.r-project.org/doc/manuals/r-devel/NEWS.html

@jangorecki
Copy link
Member Author

@jangorecki jangorecki commented Nov 15, 2017

No, I am not (I don't have R on hand). Could you try to run 1555.1 on R 3.1.0? You can easily run it from docker.io/jangorecki/r-3.1.0.

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Nov 16, 2017

Test 1555.1 indeed fails on R 3.1.0, but not because of anyNA; the test is:

ans1 <- fread("issue_1113_fread.txt")
ans2 <- setDT(read.table("issue_1113_fread.txt", header=TRUE))
setnames(ans2, names(ans1))
test(1555.1, ans1, ans2)

That file is this one. The error returned is:

Datasets have different column classes. First 3: MCMCOBJ(numeric!=factor)

i.e., the last column, MCMCOBJ was read by fread as numeric but read.table as factor.

That seems like an error by read.table as (by inspection) clearly the last column is numeric. IIUC that's ancillary to bug #1113, so 3.1.0 dependency is satisfied (though that test should be re-tooled).

Checking on my current R installation (3.4.1), read.table interprets that column correctly as numeric, but in 3.1.0 it's read as a string (stringsAsFactors = FALSE leads to character instead of factor), so the error in that test is identifying a base R bug rather than a data.table bug, is my read.

@jangorecki
Copy link
Member Author

@jangorecki jangorecki commented Nov 16, 2017

So we can add condition on R version and postprocess column. Something like this should be fine ans2 <- setDT({df<-read.table("issue_1113_fread.txt", header=TRUE); if(.) ... else ...}) . And comment to issue #.

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Nov 17, 2017

@jangorecki since we don't know what changed/when, adding a getRVersion() check may not fix dependency for all R versions.

5d46ace

Fix added here conditions instead on whether it seems like this column we identified is causing issues; as such we'll be able to detect if there's some other reversion in another setting more easily. Spinning up docker now to confirm that solves this issue and we can return to pushing dependency to R 3.1.0

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Nov 21, 2017

Closed by #2489

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants