Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teach forder to re-use existing key and index attributes instead of sorting from scratch #4386

Merged
merged 62 commits into from
Jul 28, 2024

Conversation

jangorecki
Copy link
Member

@jangorecki jangorecki commented Apr 18, 2020

Closes #4380
Closes #4387
Partially address #2947 for index, not for keys. I don't think we can do much better there so I would also close that one.
Towards #2879
Resolves #1321 (a test should be added)


Examples below.
Note that output below is missing newly added attributes populated by new option retStats=TRUE. I am not adding them below to make output easier to read. retStats=TRUE will just return more attributes (all comes for free, unlike retGrp=T): any-not-ascii, any-not-utf8, any-na, any-inf-nan.

## either of
# cc()
# library(data.table); forderv = data.table:::forderv
  • setindex[v] will now compute groups (retGrp=T) as well, and store it together with index attribute. Note that it only applies to setindex interface, any internal index computation will use forder and will not compute groups unnecessarily (explained in computing index could find groups as well #4387).
d2 = data.table(id=2:1, v2=1:2)
setindexv(d2, "id")
str(attr(attr(d2, "index"), "__id"))
# int [1:2] 2 1
# - attr(*, "starts")= int [1:2] 1 2
# - attr(*, "maxgrpn")= int 1
d2 = data.table(id=2:1, v2=1:2)
d2[id==1L]
attr(attr(d2, "index"), "__id")
# int [1:2] 2 1
  • Low level C short circuit to re-use existing keys in forder (allow re-use of index during C bmerge which was calling C forder directly merge could utilize i index #4380). Verbose codes: key opt=1, indices opt=2.
d1 = data.table(id=1:2, v1=1:2)
d2 = data.table(id=2:1, v2=1:2)
setindexv(d2, "id")
d1[d2, on="id", verbose=TRUE]
#forderLazy: using existing index: __id
#forderLazy: opt=2, took 0.000s
  • Newly computed index could be now set as DT index attribute directly from C, next to where it is being computed. It means that forderv could already set the index. It is disabled by default, can be enabled by option datatable.forder.auto.index=TRUE. This option is only for debugging/development, not going to be exported. Ultimately current datatable.auto.index option will control that.
options(datatable.verbose=TRUE, datatable.forder.auto.index=TRUE)
d2 = data.table(id=2:1, v2=1:2)
o = forderv(d2, "id")
#forderLazy: setting index (retGrp=0) on DT: __id
#forderLazy: opt=-1, took 0.000s
str(o)
# int [1:2] 2 1
o = forderv(d2, "id", retGrp=TRUE) # has to compute groups
#forderLazy: index found but no retGrp: __id
#forderLazy: setting index (retGrp=1) on DT: __id
#forderLazy: opt=-1, took 0.000s
str(o)
# int [1:2] 2 1
# - attr(*, "starts")= int [1:2] 1 2
# - attr(*, "maxgrpn")= int 1
o = forderv(d2, "id", retGrp=FALSE) # don't have to compute
#forderLazy: using existing index: __id
#forderLazy: opt=2, took 0.000s
str(o)
# int [1:2] 2 1
o = forderv(d2, "id", retGrp=TRUE) # groups info still there
#forderLazy: using existing index: __id
#forderLazy: opt=2, took 0.000s
str(o)
# int [1:2] 2 1
# - attr(*, "starts")= int [1:2] 1 2
# - attr(*, "maxgrpn")= int 1

## other functions which uses forder can benefit as well
DT = data.table(A = rep(1:3, each=4), B = rep(1:4, each=3), C = rep(1:2, 6))
uniqueN(DT)
#forderLazy: setting index (retGrp=1) on DT: __A__B__C
#forderLazy: opt=-1, took 0.000s
#[1] 10
str(attr(attr(DT,"index"),"__A__B__C"))
# int [1:12] 1 3 2 4 5 6 7 8 9 11 ...
# - attr(*, "starts")= int [1:10] 1 3 4 5 6 7 8 9 10 11
# - attr(*, "maxgrpn")= int 2
uniqueN(DT)
#forderLazy: using existing index: __A__B__C
#forderLazy: opt=2, took 0.000s
#[1] 10

It is disabled by default because of the issue caused by #507. forderv writes an attribute by reference, when invoked for example via uniqueN by group, it writes the index attribute to the first group .SD, the same index attribute is later re-used for further groups, making results incorrect:

# test 1475.02
options(datatable.verbose=TRUE, datatable.forder.auto.index=TRUE)
DT = data.table(A = rep(1:3, each=4), B = rep(1:4, each=3), C = rep(1:2, 6))
DT[, .(uN=uniqueN(.SD)), by=A]
#forder: setting index (retGrp=1) on DT: __B__C
#forder: opt=-1, took 0.000s
#forder: using existing index: __B__C
#forder: opt=2, took 0.000s
#forder: using existing index: __B__C
#forder: opt=2, took 0.000s

If we will manage to detect that a call has been invoked from j+by group, then we can set lazy=FALSE to escape all optimizations. At the same time we could setDTthreads to 1 in such case #4294.

@jangorecki jangorecki added the WIP label Apr 18, 2020
@codecov
Copy link

codecov bot commented Apr 18, 2020

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.50%. Comparing base (faaae13) to head (14e9168).

❗ Current head 14e9168 differs from pull request most recent head 0f15775. Consider uploading reports for the commit 0f15775 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4386      +/-   ##
==========================================
+ Coverage   97.47%   97.50%   +0.02%     
==========================================
  Files          80       80              
  Lines       14873    15030     +157     
==========================================
+ Hits        14498    14655     +157     
  Misses        375      375              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jangorecki jangorecki removed the WIP label Apr 18, 2020
src/forder.c Outdated Show resolved Hide resolved
src/forder.c Outdated Show resolved Hide resolved
src/forder.c Outdated Show resolved Hide resolved
jangorecki added a commit that referenced this pull request Apr 18, 2020
forder C set index directly
smart opt for index retGrp=T/F
no tests updated yet
@jangorecki jangorecki added the WIP label Apr 20, 2020
@jangorecki jangorecki linked an issue Apr 20, 2020 that may be closed by this pull request
if (length(o)) {
setattr(x,"index",NULL) # TO DO: reorder existing indexes likely faster than rebuilding again. Allow optionally. Simpler for now to clear. Only when order changes.
Copy link
Member Author

@jangorecki jangorecki Apr 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a related comment next to test 1419.62

# checks also that the prior index a is dropped (because y is keyed with no index)

but we should not drop that index, as long as order does not change then index is still valid, and it can even be useful now, because it can store groups info.

inst/tests/tests.Rraw Outdated Show resolved Hide resolved
@jangorecki
Copy link
Member Author

added NEWS entry.

@MichaelChirico
Copy link
Member

@jangorecki I was trying to write an atime example, I saw opt=-1 here, am I missing something?

DT = data.table(a = sample(letters, 1e8, TRUE), key='a')
options(datatable.forder.reuse.sorting=TRUE, datatable.verbose=TRUE)
key(DT)
# [1] "a"
invisible(uniqueN(DT))
# forder.c received 100000000 rows and 1 columns
# forderReuseSorting: opt=-1, took 0.407s

setindex(DT, a)
invisible(uniqueN(DT))
# forderReuseSorting: using existing index: __a
# forderReuseSorting: opt=2, took 0.000s

Why doesn't the setkey() approach lead to the performance improvement?

@jangorecki
Copy link
Member Author

Most likely it does not do retGrp=T which is needed by uniqueN.

@MichaelChirico
Copy link
Member

MichaelChirico commented Jul 28, 2024

Most likely it does not do retGrp=T which is needed by uniqueN.

Got it. It does leave us in an awkward place where now setindex() can have substantially better performance than setkey(). We should think through how to avoid this, but I'm happy to defer that to a follow-up -- would #2879 suffice, or should we start a new issue for that?

This is really unfortunate:

DT=data.table(a=sample(letters))
setindex(DT, a)
options(datatable.verbose=TRUE)
uniqueN(DT)
# forderReuseSorting: using existing index: __a
# forderReuseSorting: opt=2, took 0.000s
# [1] 26
setkey(DT, a)
# forderReuseSorting: using existing index: __a
# forderReuseSorting: opt=2, took 0.000s
# forder took 0.000 sec
# reorder took 0.002s elapsed (0.003s cpu)
uniqueN(DT)
# forder.c received 26 rows and 1 columns
# forderReuseSorting: opt=-1, took 0.000s
# [1] 26

Can setkey retain grouping when it's already been computed, at least?

@jangorecki
Copy link
Member Author

It can but then carry performance penalties more broadly that initially planned change. Why not follow up PR?

@MichaelChirico
Copy link
Member

It can but then carry performance penalties more broadly that initially planned change. Why not follow up PR?

Filed #6318 and #6319 since I see at least two things to address in follow-up.

@MichaelChirico
Copy link
Member

@Anirban166 this is a good candidate for a performance regression test, would you like to take a stab at writing one? I don't think Jan will get to it quickly :)

Added as a follow-up task: #6320

@MichaelChirico
Copy link
Member

Thanks @jangorecki !!! 🚀

@MichaelChirico MichaelChirico merged commit 1a84514 into master Jul 28, 2024
4 of 5 checks passed
@MichaelChirico MichaelChirico deleted the lazy-forder branch July 28, 2024 21:32
This was referenced Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants