test.data.table(memtest=TRUE) #5515

mattdowle · 2022-11-08T02:21:44Z

memtest now an argument rather than solely an env variable
reports 10 largest RAM increases as text table analogous to the 10 longest tests by default
adds RSS column to the timings table rather than memtest.csv file, and returns the timings
no longer builds up multiple sets of timings from multiple runs, not needed
ps_mem() was being called before gc_mem() which did the gc(); now calls gc() before ps_mem() as intended
closes test.data.table() creates DT in .GlobalEnv #5514 too although not related to memtest per se

Current output below. Shows steps up where we have relatively bigger test data sizes which could be reduced. The plot is just to show shape and extents; use the table to identify the tests. This growth to 280MB might be enough to tip the overloaded CRAN Windows server to decide to kill, #5507. If we can reduce that down to 170MB it might help avoid that tipping point and kill. I still think that the CRAN Windows server is severely overloaded due to the install time being 5 times longer (214s) than the non-Windows CRAN servers (42s). If we can reduce ram usage in tests, then we should, just to show willing if nothing else. In the absence of any reply from Uwe, in particular not having the detailed logs asked for, this is the best I can guess and do. The FAIL is on CRAN Windows again now.

10 largest RAM increases (MB); see plot for cumulative effect (if any)
      ID nTest   RSS diff
 1:  637     3 211.7 33.1
 2:  819     1 218.9 20.6
 3: 1978     1 280.9 15.8
 4: 1151     2 237.6 15.7
 5: 1739     5 250.0 15.1
 6:  301     2 190.7 14.6
 7: 1158     1 220.2 12.6
 8:  167     3 177.5 11.6
 9: 1549     1 234.9  7.7
10: 1888     9 264.6  6.7

…e the env variable

….GlobalEnv, closes #5514; remove DTfun fix no longer needed.

…sion

…simpler

mattdowle · 2022-11-08T03:31:31Z

The result above was in dev mode using cc(). Results are a little different after installing but still finds similar test numbers.

      ID nTest   RSS diff
 1: 1157     1 218.4 28.2
 2: 1151     2 227.7 28.0
 3:  819     1 196.6 23.1
 4:  301     2 156.9 20.6
 5:  637     3 169.1 15.2
 6: 1848     2 249.9 14.5
 7: 1978     1 264.9 14.3
 8:  167     3 137.6 11.7
 9: 1739     5 234.3 10.6
10: 1378     5 223.6  4.5

Looking at how RSS builds from starting a fresh session :

Using :
as.integer(as.numeric(system(paste0("ps -o rss --no-headers ",Sys.getpid()),intern=TRUE))/1024)  #MB

R --vanilla                     # 50MB
R                               # 53MB
require(data.table)             # 67MB
test.data.table(memtest=TRUE)   # 123MB test 1, rising to 265MB last test
> attr(.Last.value, "timings")
         ID  time nTest   RSS  diff
      <int> <num> <int> <num> <num>
   1:     1 0.509     2 123.8    NA
   2:     2 0.003     2 123.8     0
   3:     3 0.001     1 123.8     0
   4:     4 0.001     1 123.8     0
   5:     5 0.001     1 123.8     0
  ---                              
2046:  2234 0.009     9 265.4     0
2047:  2235 0.001     2 265.4     0
2048:  2236 0.001     9 265.4     0
2049:  2237 0.001     2 265.4     0
2050:  2238 0.000     9 265.4     0

So somewhere at the beginning of test.data.table, on or before the first test, 56MB is allocated (123MB - 67MB). Might be worth investigating that after reducing the biggest tests. source() loads and parses tests.Rraw, iiuc, and maybe the parsed result takes that much. Or maybe something initializes or caches in test 1: its a simple test of tables() that doesn't really take 0.509s, iiuc, and if so it's not that 0.5s that could be tackled but the 56MB ram usage. Anwyay, that's probably going too far. All I really wanted to do is see if there was some low hanging fruit to reduce ram usage to show willing, and the 10 or so tests in the tables give us that.

ben-schwen · 2022-11-08T12:28:47Z

Loading the suggested packages takes up ~30 MB of RAM.

Line 113-118 in tests.Rraw (github is unable to preview this snippet)

for (s in sugg) {
  assign(paste0("test_",s), loaded<-suppressWarnings(suppressMessages(
    library(s, character.only=TRUE, logical.return=TRUE, quietly=TRUE, warn.conflicts=FALSE, pos="package:base")  # attach at the end for #5101
  )))
  if (!loaded) cat("\n**** Suggested package",s,"is not installed or has dependencies missing. Tests using it will be skipped.\n\n")
}

Maybe we could cut this down by moving those tests to an additional script?

jangorecki · 2022-11-08T12:58:25Z

R/test.data.table.R

+  nTest = RSS = NULL  # to avoid 'no visible binding' note
+  timings = env$timings[nTest>0]
+  if (!memtest) {
+    ans = head(timings[-1L][order(-time)], 10L)[,RSS:=NULL]   # exclude id 1 as in dev that includes JIT


in non-dev that should be still kept and not excluded, right?

good spot. now not-excluded in non-dev. I tested non-dev and test 1 doesn't appear in the top 10 as expected but good to have that covered now in case anything sneaks into test 1 in future.

jangorecki · 2022-11-08T13:11:30Z

R/test.data.table.R

+         gc() # force gc so we can find tests that use relatively larger amounts of RAM
+         timings[as.integer(num), RSS:=max(ps_mem(),RSS), verbose=FALSE]


why don't we call gc() after ps_mem()? if a test used a lot of memory but released all, and gc cleans that up, then ps_mem won't report much.

I have in my mind tests that create test input data, call test(), then clear up after the test() call, perhaps by calling rm(). So the test is more than just the test() call. ps_mem() is reporting and finding these top 10 tests. I was more thinking about getting temporary R usage out of the way (by calling gc() first) to reveal the larger test input datasets.
But yes, good idea to move gc() to be after and see what happens...

mattdowle · 2022-11-08T19:15:01Z

Using R CMD INSTALL each time followed by require(data.table); test.data.table(memtest=TRUE) in fresh R 4.2.2 session.

gc before ps_mem ............................................ gc after ps_mem

      ID nTest   RSS diff                     ID nTest   RSS diff
 1: 1151     2 228.1 28.1                1:  819     1 197.7 27.6
 2: 1157     1 217.3 26.7                2: 1151     2 227.6 27.6
 3:  819     1 197.2 23.1                3:  301     2 157.2 20.9
 4:  301     2 156.9 21.2                4: 1542    10 218.1 18.5
 5:  637     3 169.3 14.8                5:  637     3 172.8 18.4
 6: 1848     2 250.5 14.7                6: 1739     5 234.7 16.6
 7: 1978     1 265.4 14.3                7: 1848     2 250.4 14.3
 8:  167     3 137.0 12.2                8:  167     3 137.6 11.6
 9: 1739     5 234.5 10.7                9: 1978     1 262.5 11.5
10: 1378     5 223.7  4.5               10: 1378     5 199.5  4.4

The left is almost identical to earlier result in comment above, as expected because it's a rerun. Small differences there interesting to note their scale and those are assumed due to random nature of memory allocation.

Btw, for future reference, it takes just under 13 mins to run on my laptop.

So it doesn't make much difference with gc() before or after ps_mem(). Maybe in the middle of the two plots around row 1000 of attr(test.data.table(memtest=TRUE), "timings").

Comparing the sorted list of the top 10 IDs, 9 are common.

> sort(c(819,1151,301,1542,637,1739,1848,167,1978,1378))
 [1]  167  301  637  819 1151 1378 1542 1739 1848 1978
> sort(c(1151,1157,819,301,637,1848,1978,167,1739,1378))
 [1]  167  301  637  819 1151 1157 1378 1739 1848 1978

Looking at tests 1157 (out) and 1542 (in) they are both nice to find and it would be a shame to miss either. Other versions of R on different OS, compiled and configured differently, and other test files, etc, could be different of course.

Let's change the argument to control the gc() being before or after ps_mem() then.

codecov · 2022-11-08T22:31:41Z

Codecov Report

Merging #5515 (997c4ab) into master (a4c2b01) will decrease coverage by 0.06%.
The diff coverage is 67.74%.

@@            Coverage Diff             @@
##           master    #5515      +/-   ##
==========================================
- Coverage   99.51%   99.44%   -0.07%     
==========================================
  Files          80       80              
  Lines       14773    14787      +14     
==========================================
+ Hits        14701    14705       +4     
- Misses         72       82      +10

Impacted Files	Coverage Δ
R/test.data.table.R	`94.56% <67.74%> (-5.44%)`	⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

mattdowle · 2022-11-08T22:34:08Z

I ran memtest=1 and memtest=2 again and it reproduced the same ID number ordering in the top 10 as exhibited above for each. Just to have a 2nd observation to confirm before moving on to the real work of moving those tests and suggests.

mattdowle added 6 commits November 5, 2022 12:47

add memtest RSS to timings result and remove memtest.csv; simpler

028d882

memtest now an argument to test.data.table; simpler than having to us…

a5dea77

…e the env variable

add local=TRUE to source() in test 2036 to stop it overwriting DT in …

3462411

….GlobalEnv, closes #5514; remove DTfun fix no longer needed.

added reminder that memtest should be the first task in a fresh R ses…

a24cfd3

…sion

remove commented ok memtest plot code; just use base R plot instead, …

25ef45b

…simpler

test ps_mem() up front; use --no-headers instead of tail -1

6de7f19

mattdowle added this to the 1.14.5 milestone Nov 8, 2022

mattdowle added 2 commits November 7, 2022 19:31

add argument to .Rd too

3c538b3

avoid 'no visible binding' note

22997a8

mattdowle requested a review from jangorecki November 8, 2022 03:50

mattdowle added the tests label Nov 8, 2022

jangorecki reviewed Nov 8, 2022

View reviewed changes

exclude test 1 in dev only; thanks Jan

c211009

This was referenced Nov 8, 2022

Move Suggest-ed package tests out of tests.Rraw #5516

Closed

Reduce or move top 10 tests by RAM usage #5517

Closed

memtest can be 1 (gc before ps) or 2 (gc after ps); thanks Jan

997c4ab

added news item

ac2b4ea

mattdowle merged commit 3ba1c24 into master Nov 8, 2022

mattdowle deleted the memtest branch November 8, 2022 22:41

mattdowle added a commit that referenced this pull request Nov 9, 2022

GLCI memtest env variable now takes '1' not 'TRUE', #5515

a2ecd45

mattdowle modified the milestones: 1.14.7, 1.14.6 Nov 16, 2022

mattdowle added a commit that referenced this pull request Nov 16, 2022

test.data.table(memtest=TRUE) (#5515)

dfba302

mattdowle added a commit that referenced this pull request Nov 20, 2022

added memtest.id to test.data.table, #5515 tidy up

0a964d3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test.data.table(memtest=TRUE) #5515

test.data.table(memtest=TRUE) #5515

mattdowle commented Nov 8, 2022 •

edited

Loading

mattdowle commented Nov 8, 2022 •

edited

Loading

ben-schwen commented Nov 8, 2022 •

edited

Loading

jangorecki Nov 8, 2022

mattdowle Nov 8, 2022

jangorecki Nov 8, 2022

mattdowle Nov 8, 2022 •

edited

Loading

mattdowle commented Nov 8, 2022 •

edited

Loading

codecov bot commented Nov 8, 2022

mattdowle commented Nov 8, 2022

		gc() # force gc so we can find tests that use relatively larger amounts of RAM
		timings[as.integer(num), RSS:=max(ps_mem(),RSS), verbose=FALSE]

test.data.table(memtest=TRUE) #5515

test.data.table(memtest=TRUE) #5515

Conversation

mattdowle commented Nov 8, 2022 • edited Loading

mattdowle commented Nov 8, 2022 • edited Loading

ben-schwen commented Nov 8, 2022 • edited Loading

jangorecki Nov 8, 2022

Choose a reason for hiding this comment

mattdowle Nov 8, 2022

Choose a reason for hiding this comment

jangorecki Nov 8, 2022

Choose a reason for hiding this comment

mattdowle Nov 8, 2022 • edited Loading

Choose a reason for hiding this comment

mattdowle commented Nov 8, 2022 • edited Loading

codecov bot commented Nov 8, 2022

Codecov Report

mattdowle commented Nov 8, 2022

mattdowle commented Nov 8, 2022 •

edited

Loading

mattdowle commented Nov 8, 2022 •

edited

Loading

ben-schwen commented Nov 8, 2022 •

edited

Loading

mattdowle Nov 8, 2022 •

edited

Loading

mattdowle commented Nov 8, 2022 •

edited

Loading