Teach forder to re-use existing key and index attributes instead of sorting from scratch #4386

jangorecki · 2020-04-18T18:58:11Z

Closes #4380
Closes #4387
Partially address #2947 for index, not for keys. I don't think we can do much better there so I would also close that one.
Towards #2879
Resolves #1321 (a test should be added)

Examples below.
Note that output below is missing newly added attributes populated by new option retStats=TRUE. I am not adding them below to make output easier to read. retStats=TRUE will just return more attributes (all comes for free, unlike retGrp=T): any-not-ascii, any-not-utf8, any-na, any-inf-nan.

## either of
# cc()
# library(data.table); forderv = data.table:::forderv

setindex[v] will now compute groups (retGrp=T) as well, and store it together with index attribute. Note that it only applies to setindex interface, any internal index computation will use forder and will not compute groups unnecessarily (explained in computing index could find groups as well #4387).

d2 = data.table(id=2:1, v2=1:2)
setindexv(d2, "id")
str(attr(attr(d2, "index"), "__id"))
# int [1:2] 2 1
# - attr(*, "starts")= int [1:2] 1 2
# - attr(*, "maxgrpn")= int 1
d2 = data.table(id=2:1, v2=1:2)
d2[id==1L]
attr(attr(d2, "index"), "__id")
# int [1:2] 2 1

Low level C short circuit to re-use existing keys in forder (allow re-use of index during C bmerge which was calling C forder directly merge could utilize i index #4380). Verbose codes: key opt=1, indices opt=2.

d1 = data.table(id=1:2, v1=1:2)
d2 = data.table(id=2:1, v2=1:2)
setindexv(d2, "id")
d1[d2, on="id", verbose=TRUE]
#forderLazy: using existing index: __id
#forderLazy: opt=2, took 0.000s

Newly computed index could be now set as DT index attribute directly from C, next to where it is being computed. It means that forderv could already set the index. It is disabled by default, can be enabled by option datatable.forder.auto.index=TRUE. This option is only for debugging/development, not going to be exported. Ultimately current datatable.auto.index option will control that.

options(datatable.verbose=TRUE, datatable.forder.auto.index=TRUE)
d2 = data.table(id=2:1, v2=1:2)
o = forderv(d2, "id")
#forderLazy: setting index (retGrp=0) on DT: __id
#forderLazy: opt=-1, took 0.000s
str(o)
# int [1:2] 2 1
o = forderv(d2, "id", retGrp=TRUE) # has to compute groups
#forderLazy: index found but no retGrp: __id
#forderLazy: setting index (retGrp=1) on DT: __id
#forderLazy: opt=-1, took 0.000s
str(o)
# int [1:2] 2 1
# - attr(*, "starts")= int [1:2] 1 2
# - attr(*, "maxgrpn")= int 1
o = forderv(d2, "id", retGrp=FALSE) # don't have to compute
#forderLazy: using existing index: __id
#forderLazy: opt=2, took 0.000s
str(o)
# int [1:2] 2 1
o = forderv(d2, "id", retGrp=TRUE) # groups info still there
#forderLazy: using existing index: __id
#forderLazy: opt=2, took 0.000s
str(o)
# int [1:2] 2 1
# - attr(*, "starts")= int [1:2] 1 2
# - attr(*, "maxgrpn")= int 1

## other functions which uses forder can benefit as well
DT = data.table(A = rep(1:3, each=4), B = rep(1:4, each=3), C = rep(1:2, 6))
uniqueN(DT)
#forderLazy: setting index (retGrp=1) on DT: __A__B__C
#forderLazy: opt=-1, took 0.000s
#[1] 10
str(attr(attr(DT,"index"),"__A__B__C"))
# int [1:12] 1 3 2 4 5 6 7 8 9 11 ...
# - attr(*, "starts")= int [1:10] 1 3 4 5 6 7 8 9 10 11
# - attr(*, "maxgrpn")= int 2
uniqueN(DT)
#forderLazy: using existing index: __A__B__C
#forderLazy: opt=2, took 0.000s
#[1] 10

It is disabled by default because of the issue caused by #507. forderv writes an attribute by reference, when invoked for example via uniqueN by group, it writes the index attribute to the first group .SD, the same index attribute is later re-used for further groups, making results incorrect:

# test 1475.02
options(datatable.verbose=TRUE, datatable.forder.auto.index=TRUE)
DT = data.table(A = rep(1:3, each=4), B = rep(1:4, each=3), C = rep(1:2, 6))
DT[, .(uN=uniqueN(.SD)), by=A]
#forder: setting index (retGrp=1) on DT: __B__C
#forder: opt=-1, took 0.000s
#forder: using existing index: __B__C
#forder: opt=2, took 0.000s
#forder: using existing index: __B__C
#forder: opt=2, took 0.000s

If we will manage to detect that a call has been invoked from j+by group, then we can set lazy=FALSE to escape all optimizations. At the same time we could setDTthreads to 1 in such case #4294.

codecov · 2020-04-18T20:51:59Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.50%. Comparing base (faaae13) to head (14e9168).

❗ Current head 14e9168 differs from pull request most recent head 0f15775. Consider uploading reports for the commit 0f15775 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4386      +/-   ##
==========================================
+ Coverage   97.47%   97.50%   +0.02%     
==========================================
  Files          80       80              
  Lines       14873    15030     +157     
==========================================
+ Hits        14498    14655     +157     
  Misses        375      375

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/forder.c

forder C set index directly smart opt for index retGrp=T/F no tests updated yet

jangorecki · 2020-04-20T22:29:22Z

R/setkey.R

  if (length(o)) {
+    setattr(x,"index",NULL)   # TO DO: reorder existing indexes likely faster than rebuilding again. Allow optionally. Simpler for now to clear. Only when order changes.


there is a related comment next to test 1419.62

# checks also that the prior index a is dropped (because y is keyed with no index)

but we should not drop that index, as long as order does not change then index is still valid, and it can even be useful now, because it can store groups info.

inst/tests/tests.Rraw

src/data.table.h

jangorecki · 2024-07-21T14:13:10Z

added NEWS entry.

MichaelChirico · 2024-07-27T21:30:23Z

@jangorecki I was trying to write an atime example, I saw opt=-1 here, am I missing something?

DT = data.table(a = sample(letters, 1e8, TRUE), key='a')
options(datatable.forder.reuse.sorting=TRUE, datatable.verbose=TRUE)
key(DT)
# [1] "a"
invisible(uniqueN(DT))
# forder.c received 100000000 rows and 1 columns
# forderReuseSorting: opt=-1, took 0.407s

setindex(DT, a)
invisible(uniqueN(DT))
# forderReuseSorting: using existing index: __a
# forderReuseSorting: opt=2, took 0.000s

Why doesn't the setkey() approach lead to the performance improvement?

jangorecki · 2024-07-28T10:07:30Z

Most likely it does not do retGrp=T which is needed by uniqueN.

MichaelChirico · 2024-07-28T20:22:11Z

Most likely it does not do retGrp=T which is needed by uniqueN.

Got it. It does leave us in an awkward place where now setindex() can have substantially better performance than setkey(). We should think through how to avoid this, but I'm happy to defer that to a follow-up -- would #2879 suffice, or should we start a new issue for that?

This is really unfortunate:

DT=data.table(a=sample(letters))
setindex(DT, a)
options(datatable.verbose=TRUE)
uniqueN(DT)
# forderReuseSorting: using existing index: __a
# forderReuseSorting: opt=2, took 0.000s
# [1] 26
setkey(DT, a)
# forderReuseSorting: using existing index: __a
# forderReuseSorting: opt=2, took 0.000s
# forder took 0.000 sec
# reorder took 0.002s elapsed (0.003s cpu)
uniqueN(DT)
# forder.c received 26 rows and 1 columns
# forderReuseSorting: opt=-1, took 0.000s
# [1] 26

Can setkey retain grouping when it's already been computed, at least?

jangorecki · 2024-07-28T20:35:36Z

It can but then carry performance penalties more broadly that initially planned change. Why not follow up PR?

MichaelChirico · 2024-07-28T20:49:04Z

It can but then carry performance penalties more broadly that initially planned change. Why not follow up PR?

Filed #6318 and #6319 since I see at least two things to address in follow-up.

MichaelChirico · 2024-07-28T21:31:39Z

@Anirban166 this is a good candidate for a performance regression test, would you like to take a stab at writing one? I don't think Jan will get to it quickly :)

Added as a follow-up task: #6320

MichaelChirico · 2024-07-28T21:31:59Z

Thanks @jangorecki !!! 🚀

lazy forder

b0efcf5

jangorecki added the WIP label Apr 18, 2020

jangorecki added 3 commits April 18, 2020 21:05

fix tests

aaead65

fix tests

b0858df

respect use.index option

9df2668

jangorecki added 2 commits April 18, 2020 21:55

bmerge timings

62982ed

codecov

eec4971

jangorecki removed the WIP label Apr 18, 2020

jangorecki commented Apr 18, 2020

View reviewed changes

src/forder.c Outdated Show resolved Hide resolved

jangorecki commented Apr 18, 2020

View reviewed changes

src/forder.c Outdated Show resolved Hide resolved

jangorecki commented Apr 18, 2020

View reviewed changes

src/forder.c Outdated Show resolved Hide resolved

jangorecki added a commit that referenced this pull request Apr 18, 2020

reduce conflicts to PR #4386

3004748

jangorecki mentioned this pull request Apr 19, 2020

order could utilize existing index #3023

Open

helper function

9affbfa

jangorecki mentioned this pull request Apr 19, 2020

unique can be optimized on keyed data.tables #2947

Open

jangorecki added 3 commits April 20, 2020 10:37

reduce diff to master

b36136a

rename fix

3442b78

setindex writes groups (retGrp=TRUE)

dfe883e

forder C set index directly smart opt for index retGrp=T/F no tests updated yet

jangorecki added the WIP label Apr 20, 2020

calc order, not groups

efb3319

jangorecki linked an issue Apr 20, 2020 that may be closed by this pull request

computing index could find groups as well #4387

Closed

jangorecki added 2 commits April 20, 2020 21:42

expect to reach optimization

68b378d

skip opt for list

0bd0e1e

jangorecki mentioned this pull request Apr 20, 2020

merge could utilize i index #4380

Closed

jangorecki added 3 commits April 20, 2020 23:10

override retGrp=F by retGrp=T is legit use

71f5aeb

more backward compatiblility, no retGrp from getindex

de916aa

more verbose messages during opts in forderLazy

0a3da1d

jangorecki commented Apr 20, 2020

View reviewed changes

inst/tests/tests.Rraw Outdated Show resolved Hide resolved

This was referenced Jul 15, 2024

Use helper to achieve identical(head(key(x), length(y)), y) #6261

Open

Centralize handling of index retrieval #6262

Open

separate index attributes from index #4467

Open

spurious whitespace change

e8b9dcd

jangorecki commented Jul 21, 2024

View reviewed changes

src/data.table.h Show resolved Hide resolved

NEWS entry for lazy forder

a48ff5e

MichaelChirico added 6 commits July 21, 2024 10:03

tidy up NEWS

59f5f21

PROTECT() on key attribute

32c6305

Merge branch 'master' into lazy-forder

050e048

rename arg/option 'lazy' -> 'reuseSorting'

02ff718

MaybeSorted->ReuseSorting

320f496

other lazy= usage

799b4a9

This was referenced Jul 28, 2024

Indexed data can outperform keyed data when cached sort is uses #6318

Open

setkey() should retain retGrp info when re-using that info from setindex() #6319

Open

MichaelChirico mentioned this pull request Jul 28, 2024

Add an atime performance regression test for forder caching #6320

Open

Merge branch 'master' into lazy-forder

ffe431f

MichaelChirico merged commit 1a84514 into master Jul 28, 2024
4 of 5 checks passed

MichaelChirico deleted the lazy-forder branch July 28, 2024 21:32

This was referenced Jul 29, 2024

Create rchk GHA #6263

Merged

PROTECT() more in forder+bmerge #6324

Merged

tdhock mentioned this pull request Aug 3, 2024

revdeps corporaexplorer/polle fail after forder re-use key #6349

Closed

MichaelChirico mentioned this pull request Aug 4, 2024

Exact test of prepare_data() broken by latest data.table kgjerde/corporaexplorer#32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Teach forder to re-use existing key and index attributes instead of sorting from scratch #4386

Teach forder to re-use existing key and index attributes instead of sorting from scratch #4386

jangorecki commented Apr 18, 2020 •

edited

Loading

codecov bot commented Apr 18, 2020 •

edited

Loading

jangorecki Apr 20, 2020 •

edited

Loading

jangorecki commented Jul 21, 2024

MichaelChirico commented Jul 27, 2024

jangorecki commented Jul 28, 2024

MichaelChirico commented Jul 28, 2024 •

edited

Loading

jangorecki commented Jul 28, 2024

MichaelChirico commented Jul 28, 2024

MichaelChirico commented Jul 28, 2024

MichaelChirico commented Jul 28, 2024

		if (length(o)) {
		setattr(x,"index",NULL) # TO DO: reorder existing indexes likely faster than rebuilding again. Allow optionally. Simpler for now to clear. Only when order changes.

Teach forder to re-use existing key and index attributes instead of sorting from scratch #4386

Teach forder to re-use existing key and index attributes instead of sorting from scratch #4386

Conversation

jangorecki commented Apr 18, 2020 • edited Loading

codecov bot commented Apr 18, 2020 • edited Loading

Codecov Report

jangorecki Apr 20, 2020 • edited Loading

Choose a reason for hiding this comment

jangorecki commented Jul 21, 2024

MichaelChirico commented Jul 27, 2024

jangorecki commented Jul 28, 2024

MichaelChirico commented Jul 28, 2024 • edited Loading

jangorecki commented Jul 28, 2024

MichaelChirico commented Jul 28, 2024

MichaelChirico commented Jul 28, 2024

MichaelChirico commented Jul 28, 2024

jangorecki commented Apr 18, 2020 •

edited

Loading

codecov bot commented Apr 18, 2020 •

edited

Loading

jangorecki Apr 20, 2020 •

edited

Loading

MichaelChirico commented Jul 28, 2024 •

edited

Loading