check for index in setkey #3582

saraswatmks · 2019-05-22T19:42:39Z

The idea the following:

Check in setkeyv if the data.table already has an index
If yes, use that index and don't run setindex inside it again.

mattdowle · 2019-05-22T20:08:45Z

@saraswatmks Excellent! Have invited you to be project member (you need to accept the notification) so next time you can create a branch in the main project.

mattdowle · 2019-05-22T20:26:20Z

I don't see cols used in the branch when x has some indices existing. Doesn't it need to see if the requested key (i.e. cols) exists as an index? If it does it can use that ordering to change the physical order, otherwise (i.e. if no index exists for cols) it still needs to call forder.

saraswatmks · 2019-05-22T20:59:11Z

@mattdowle thanks for your quick feedback. I've made the change. Now we check:

If the given cols exists as an index. If it does, we use that existing order.
If the given cols doesn't exist as an index, we run forder as usual.

I am on macos. No matter what I do I can't run tests locally. I need to push everytime and wait for pipeline to throw some error. Could you guide me to some link so that I can test it locally before pushing ? I have already tried this and it hasn't helped.

codecov · 2019-05-22T21:25:02Z

Codecov Report

Merging #3582 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3582      +/-   ##
==========================================
+ Coverage   97.79%   97.79%   +<.01%     
==========================================
  Files          66       66              
  Lines       12904    12909       +5     
==========================================
+ Hits        12620    12625       +5     
  Misses        284      284

Impacted Files	Coverage Δ
R/setkey.R	`98.34% <100%> (+0.02%)`	⬆️
src/fwrite.c	`97.64% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fa6297c...bf59b63. Read the comment docs.

MichaelChirico · 2019-05-23T00:03:21Z

hi @saraswatmks could you share more details about what's not working? especially your locale info, that seems to be the most common source of headaches

mattdowle · 2019-05-23T02:20:02Z

@saraswatmks
https://github.com/Rdatatable/data.table/wiki/Installation#openmp-enabled-compiler-for-mac
https://github.com/Rdatatable/data.table/wiki/Contributing#minimal-first-time-pr
If those don't work, then as Michael said, please post full output here showing error messages.

mattdowle · 2019-05-23T02:24:08Z

R/setkey.R

+
+  # get existing index name if any
+  found_index <- NULL
+  if(is.null(indices(x))){


Should be !is.null here? iiuc that's why line 107 isn't covered (see codecov results in the conversation tab).

sorry, my bad

R/setkey.R

jangorecki · 2019-05-23T03:25:36Z

@saraswatmks at which point R CMD build data.table --no-build-vignettes and R CMD check data.table_1.12.3.tar.gz --ignore-vignettes are not working?

saraswatmks · 2019-05-23T13:22:40Z

@MichaelChirico I get this error when I run R CMD build .

* checking for file ‘./DESCRIPTION’ ... OK
* preparing ‘data.table’:
* checking DESCRIPTION meta-information ... OK
* cleaning src
* installing the package to build vignettes
      -----------------------------------
* installing *source* package ‘data.table’ ...
** libs
/usr/local/opt/llvm/bin/clang -fopenmp -I/anaconda3/lib/R/include -DNDEBUG   -I/usr/local/opt/gettext/include -I/usr/local/opt/llvm/include   -fPIC  -g -O3 -Wall -pedantic -std=gnu99 -mtune=native -pipe -c assign.c -o assign.o
In file included from assign.c:1:
In file included from ./data.table.h:1:
/anaconda3/lib/R/include/R.h:55:11: fatal error: 'stdlib.h' file not found
# include <stdlib.h> /* Not used by R itself, but widely assumed in packages */
          ^~~~~~~~~~
1 error generated.
make: *** [assign.o] Error 1
ERROR: compilation failed for package ‘data.table’
* removing ‘/private/var/folders/6l/7tvz3rz510n4gp2gsrpq6mxr0000gp/T/RtmpMitKB2/Rinstfc6676e4c658/data.table’
      -----------------------------------
ERROR: package installation failed

saraswatmks · 2019-05-23T13:24:38Z

@jangorecki Thanks for helping. When I run R CMD check data.table_1.12.3.tar.gz --ignore-vignettes I get

* using log directory ‘/Users/manish/Documents/open-source-projects/data.table.Rcheck’
* using R version 3.4.2 (2017-09-28)
* using platform: x86_64-apple-darwin13.4.0 (64-bit)
* using session charset: UTF-8
* using option ‘--ignore-vignettes’
* checking for file ‘data.table/DESCRIPTION’ ... OK
* this is package ‘data.table’ version ‘1.12.3’
* checking package namespace information ... OK
* checking package dependencies ...Warning: unable to access index for repository https://CRAN.R-project.org/src/contrib:
  internet routines cannot be loaded
Warning: unable to access index for repository https://bioconductor.org/packages/3.5/bioc/src/contrib:
  internet routines cannot be loaded
Warning: unable to access index for repository https://bioconductor.org/packages/3.5/data/annotation/src/contrib:
  internet routines cannot be loaded
Warning: unable to access index for repository https://bioconductor.org/packages/3.5/data/experiment/src/contrib:
  internet routines cannot be loaded
 ERROR
Packages suggested but not available: ‘R.utils’ ‘nanotime’

saraswatmks · 2019-05-23T13:25:20Z

Here's my sessionInfo

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0        rstudioapi_0.8    magrittr_1.5      usethis_1.4.0     devtools_2.0.1   
 [6] pkgload_1.0.2     R6_2.3.0          rlang_0.3.0.1     tools_3.5.1       pkgbuild_1.0.2   
[11] data.table_1.11.8 sessioninfo_1.1.1 cli_1.0.1         withr_2.1.2       remotes_2.0.2    
[16] yaml_2.2.0        assertthat_0.2.0  digest_0.6.18     rprojroot_1.3-2   crayon_1.3.4     
[21] processx_3.2.0    callr_3.0.0       base64enc_0.1-3   fs_1.2.6          ps_1.2.1         
[26] testthat_2.0.1    glue_1.3.0        memoise_1.1.0     compiler_3.5.1    desc_1.2.0       
[31] backports_1.1.2   prettyunits_1.0.2

jangorecki · 2019-05-23T13:36:31Z

@saraswatmks install those two missing packages, or use _R_CHECK_FORCE_SUGGESTS_=false env var before running package check.

saraswatmks · 2019-05-23T14:41:37Z

@saraswatmks install those two missing packages or use _R_CHECK_FORCE_SUGGESTS_=false env var before running package check.

@jangorecki Just to make sure I am doing the right way, what is the common way to do this ? For ex: I am doing the following:

Go to terminal. Create a conda env and install all these dependencies using conda install
Run commands from terminal.

Am I following the right way? Or there is something in RStudio which can help me skip these steps.

Also, after I installed both the dependencies, I did R CMD build ., I get a new error:

* checking for file ‘./DESCRIPTION’ ... OK
* preparing ‘data.table’:
* checking DESCRIPTION meta-information ... OK
* cleaning src
Error in loadVignetteBuilder(pkgdir, TRUE) :
  vignette builder 'knitr' not found
Execution halted

jangorecki · 2019-05-23T14:50:09Z

skip conda
use plain R
(only once) install all data.table suggested deps or (once per session) use env var to skip their tests
use existing Makefile, it makes life easier, then in console

cd data.table
make build && make check

MichaelChirico · 2019-05-23T14:52:11Z

to each his own but I cringe at the word conda. it has only caused me headaches. RStuido has build&reload and clean&reload functions which will help a lot.

I've been thinking it might be nice to have a short vignette or blog post or wiki page on "how to fix a data.table issue for a first time contributor" giving some tips and tricks on getting over the hump.

you might want to check the file cc.R which has a lot of (densely packed) accumulated knowledge on debugging & package development

MichaelChirico · 2019-05-23T14:54:07Z

PS so it doesn't get lost in the trees -- thanks for the PR! we really appreciate all community efforts at fixing&improving the package & hope the experience is more illuminating than frustrating 😁

mattdowle · 2019-05-23T18:13:36Z

R/setkey.R

+  }
+
+  # forder only if index is not present
+  if(!identical(found_index, cols)){


This works if there is one index, as the test tests. But there can be multiple indexes (each index is a set of columns). It needs to find if any of the indexes match cols and then use that one. The test needs expanding for cases of multiple indexes and cols both existing and not existing when there are multiple (so it tests it picks up the correct one).

mattdowle · 2019-05-23T18:15:51Z

R/setkey.R

  } else {
-    o <- forderv(x, cols, sort=TRUE, retGrp=FALSE)
+      cat("using existing index for", found_index, "\n")


Needs an if (verbose) please otherwise this is always being printed.

mattdowle · 2019-05-23T18:24:15Z

R/setkey.R

-    cat("forder took", tt["user.self"]+tt["sys.self"], "sec\n")
+
+  # get existing index name if any
+  found_index <- NULL


We prefer = in data.table please. I don't agree with the common and widely marketed advice to use <-. Single = is the same as C and many other languages. I sometimes hear, e.g. from Python folk, that R code using <- looks "old"/"not modern" and I see their point. I've always used =. I save <- for use when passing function arguments to do an assign and pass in one go; e.g. write(DT, file=f<-tempfile()); ... do something with f ... . I'm often swapping between C and R (which I think more people should do too since C is not as hard as some people want you to believe). When doing this, using = to mean assign and == to mean equals in two languages consistently (R and C) is nice. And by the way, the people who laugh at R for using L postfix to mean integer ... it comes from C (it's the same as C) and it makes a lot of sense.

thanks for detailed insight. made the change.

saraswatmks · 2019-05-23T19:04:00Z

@MichaelChirico Thanks to you guys for your kindness and patience. I've been wanting to contribute since a long time.
What else needs to be done in this PR ?

mattdowle · 2019-05-24T02:30:48Z

@saraswatmks Thanks for contributing, and welcome!

What else needs to be done in this PR ?

I made some comments above to be addressed please.

…_2889

mattdowle

Looks good. Thanks for making the previous changes.
More requests below. I hope they make sense and I tried to explain the rationale behind them for future reference. Almost there! I wrote quite a bit intending a series of comments in order, but they seem to have appeared in a different order after I submitted: just read them all quickly first before getting bogged down into any particular one.
Also, please add a news item, and please add yourself to the bottom of contributors list in DESCRIPTION. Your name will then appear on CRAN contributor field on the next update.

mattdowle · 2019-05-25T02:37:12Z

inst/tests/tests.Rraw

+                 aaa = c(1,1,2,2,2,1,1,2,2,2))
+setindex(DT, a)
+test(1419.60, allIndicesValid(DT), TRUE)
+test(1419.61, setkey(DT, a, verbose=TRUE), output="using existing index for a")


This check on output= is necessary and good but not quite sufficient. It's possible that setkey() prints this message but then doesn't set the key properly. So there needs to be another test just after this one that makes sure the key has been changed. Unfortunately the test data chosen doesn't make this a good test because the input data is trivially sorted already. So my first thought is to change the input data so it's random e.g. DT = data.table(a=c(2,3,2,1,3,2,1), aaa=...) then add test 1419.62 to check that DT$a is c(1,1,2,2,2,3,3) afterwards. Or something like that. Then it is checking both that i) the setkey() has changed the physical order and ii) the output= also checks that it changed the order for the correct reason (using the existing index).

mattdowle · 2019-05-25T02:47:56Z

R/setkey.R

+  # get existing index name if any
+  found_index = NULL
+  if(!is.null(indices(x))) found_index <- names(attributes(attributes(x)$index))
+  new_possible_index = paste0("__", cols, collapse="")


These 3 lines would be simpler as just one line :
index = paste0(cols, collapse="__")
See the next 3 comments in combination below ...

mattdowle · 2019-05-25T02:51:37Z

R/setkey.R

+  new_possible_index = paste0("__", cols, collapse="")
+
+  # forder only if index is not present
+  if(!any(new_possible_index == found_index)){


Then this if() would be :
if (!any(index == indices(x))
If x doesn't have any indices, indices(x) returns NULL and any("anything"==NULL) is FALSE in R (which is nice).
The idea is just to save repeating variable names and use fewer lines of code so we have less to maintain in future. Not too few lines to the extent of making it hard to understand. But in this case, what I'm suggesting is simpler and worth doing and easier to read and check.
Using indices() is cleaner than fetching the attributes() directly, because we have an isolating interface. If we ever change the attribute structure or names in future, we only need to change the code inside indices() not try and find everywhere that reads the attributes directly.

mattdowle · 2019-05-25T02:54:10Z

R/setkey.R

  } else {
-    o = forderv(x, cols, sort=TRUE, retGrp=FALSE)
+      # find the matching index
+      ix =  found_index[which(found_index == new_possible_index)]


Then this line can be removed.

mattdowle · 2019-05-25T02:55:00Z

R/setkey.R

+          cat("using existing index for", gsub("^__","", ix), "\n")
+          o <- attr(attributes(x)$index, which=ix, exact = TRUE)
+      } else {
+          o <- attr(attributes(x)$index, which=ix, exact = TRUE)


And these two which=ix become which=index

@mattdowle I've made the changes. Correct me if I am wrong, I think we need ix here because, let's take a scenario where a user set multiple indexes. So when we do indices(DT), we get a, b, __a__b, if the user does setkey(DT, a), setkey(DT, b), setkey(DT, a, b) for a table DT = data.table(a = c(...), b = c(...), c = c(...)) . In this case, we need to find the ix index of which index is matched.

Great - thanks again. Almost there!
Look at line 104 though. I commented before that line 104 could be removed. Line 104 is, in the end, just renaming index as idx, iiuc. This which= is looking up the item by name anyway. So line 104 can be removed and then replace this which=idx with which=index.
Once that's done, perhaps rename the local variable index to thiskey or newkey? That way it reads a little betters and avoids the same name "index" as the attribute name. For example, line 94 will probably read a little better as if (!any(indices(x)==newkey)).

Done. Thanks for your patience.

mattdowle · 2019-05-25T03:02:09Z

inst/tests/tests.Rraw

+setindex(DT, a)
+setindex(DT, aaa)
+test(1419.62, allIndicesValid(DT), TRUE)
+test(1419.63, setkey(DT, aaa, verbose=TRUE), output="using existing index for aaa")


Same request here as above for 1419.61. This test needs to check the result of this setkey() command has done the correct thing (e.g. changed the physical row order properly) as well as doing that for the right reason (output=).
Since setkey() returns DT invisibly, the easiest way is just to pass an appropriate y=data.table(...) into this test(). See other tests() in this file for examples for single calls to test() which use x, y and output= too.

DESCRIPTION

mattdowle

Thanks for previous. Still the results need checking in tests. It's ok - I'll do them. Since setkey() is a core part of data.table we have to ensure any changes to it are nailed down.

mattdowle · 2019-05-28T19:29:18Z

inst/tests/tests.Rraw

+test(1419.62, setkey(DT, a, verbose=TRUE), output="using existing index for a")
+
+# check setkey incase of existing multiple indexes
+DT <- data.table(a = c(3,3,4,4,5,6,1,1,7,2,2),


a has 11 items here but aaa and bbb have 10 so there's a warning when running test.data.table() runs in dev and when users would run test.data.table(). It's a lot easier and faster to run locally first before pushing. See the cc.R script in the root of the project : that's what I use in dev.

mattdowle · 2019-05-28T19:33:50Z

inst/tests/tests.Rraw

+setindex(DT, a)
+setindex(DT, aaa)
+test(1419.65, allIndicesValid(DT), TRUE)
+test(1419.66, setkey(DT, aaa, verbose=TRUE), output="using existing index for aaa")


This doesn't check the result of setkey is correct. Please see previous comments where that was requested : #3582 (comment) and #3582 (comment)

thanks for chipping in. I wasn't fully aware of test function parameters. Now I understand it better. Also, I will try to fix my local environment (I still don't know how am I gonna do that) to avoid stupid errors in the PR.

mattdowle · 2019-05-28T20:08:51Z

I added the result checks to the tests. ~~That actually finds a problem in setindex() where it is incorrectly using an existing index.~~ That commit will fail on test 1419.66 and then I'll fix that next.
...
That turned out to be just a problem in this PR: not started the index with __ so although it was supposed to be picking up the index and using it, it was just fetching NULL and thinking the table was already sorted by that index. Testing the output was correct was important to catch this. I changed those two lines to one line that uses the existing getindex() function which knows how to fetch the index in one place with checks.

mattdowle · 2019-05-28T21:13:37Z

inst/tests/tests.Rraw

 options(datatable.auto.index = TRUE)
-test(1376.12, list(DT[a==2L], indices(DT)), list(DT[9L],"a"))
+test(1376.12, list(DT[a==2L], indices(DT)), list(DT[2L],"a"))


These 3 tests changes from 9L to 2L were strange that they were needed. Anyway, the latest commit leaves these tests unchanged which seems more correct.

Yep: those 3 tests should not have been changed to 9L in this PR. The setkey(DT,b) inbetween tests 1376.07 and 1376.08 had broken under this PR and wasn't changing the row order properly.

…#3582

jangorecki · 2023-12-09T14:17:30Z

This change will be superseded (possibly taking it all away during git merge) by #4386 which do same optimization (among others) but on C level rather than R.

check for index in setkey

572b64f

mattdowle added this to the 1.12.4 milestone May 22, 2019

explicitly naming the parameters

c8ddc62

using cols to check existing index

bfefa71

mattdowle reviewed May 23, 2019

View reviewed changes

R/setkey.R Outdated Show resolved Hide resolved

add test

0c64f51

adding test, fixing bug

d46358f

mattdowle reviewed May 23, 2019

View reviewed changes

MichaelChirico mentioned this pull request May 24, 2019

Purge <- usage when not strictly necessary #3590

Closed

check for multi-indexes

057938f

saraswatmks added 4 commits May 24, 2019 21:18

fix bug

7503aea

Merge branch 'master' into fix_2889

3bce910

use verbose param in test

3b0d3f5

Merge branch 'fix_2889' of github.com:saraswatmks/data.table into fix…

3986117

…_2889

mattdowle requested changes May 25, 2019

View reviewed changes

saraswatmks added 8 commits May 27, 2019 14:17

add more test, description, news, fixes as suggested

153ae6c

fix conflict, change from master

f2934a2

update news

1809782

fix issue

a6aa862

set index in test1419.62

9288121

fix test error

b500253

fix test sequence

230e5d1

fix testcase 1376.09, 1376.11, 1376.12

97d64a5

saraswatmks force-pushed the fix_2889 branch from 5670922 to 97d64a5 Compare May 27, 2019 21:17

jangorecki reviewed May 28, 2019

View reviewed changes

DESCRIPTION Show resolved Hide resolved

remove which index

70b5760

mattdowle reviewed May 28, 2019

View reviewed changes

added checks on y too to new setkey tests

fcb0c5b

mattdowle reviewed May 28, 2019

View reviewed changes

uses getindex(), 2-space indentation, 2L back to 9L in test 1376

bf59b63

mattdowle approved these changes May 28, 2019

View reviewed changes

mattdowle merged commit a62faf9 into Rdatatable:master May 28, 2019

mattdowle added a commit that referenced this pull request May 28, 2019

news item only (FR not bug, and link to issue not pr for consistency), …

b25d163

…#3582

adamsardar mentioned this pull request Dec 21, 2020

Move to using = rather than <- adamsardar/stoneTrees#27

Closed

check for index in setkey #3582

check for index in setkey #3582

Conversation

saraswatmks commented May 22, 2019 • edited by mattdowle Loading

mattdowle commented May 22, 2019

mattdowle commented May 22, 2019

saraswatmks commented May 22, 2019 • edited Loading

codecov bot commented May 22, 2019 • edited Loading

Codecov Report

MichaelChirico commented May 23, 2019

mattdowle commented May 23, 2019 • edited Loading

mattdowle May 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jangorecki commented May 23, 2019

saraswatmks commented May 23, 2019 • edited Loading

saraswatmks commented May 23, 2019

saraswatmks commented May 23, 2019

jangorecki commented May 23, 2019 • edited by mattdowle Loading

saraswatmks commented May 23, 2019 • edited Loading

jangorecki commented May 23, 2019 • edited Loading

MichaelChirico commented May 23, 2019

MichaelChirico commented May 23, 2019

mattdowle May 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdowle May 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saraswatmks commented May 23, 2019

mattdowle commented May 24, 2019 • edited Loading

mattdowle left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdowle May 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdowle May 25, 2019 • edited Loading

Choose a reason for hiding this comment

mattdowle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdowle commented May 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jangorecki commented Dec 9, 2023

saraswatmks commented May 22, 2019 •

edited by mattdowle

Loading

saraswatmks commented May 22, 2019 •

edited

Loading

codecov bot commented May 22, 2019 •

edited

Loading

mattdowle commented May 23, 2019 •

edited

Loading

mattdowle May 23, 2019 •

edited

Loading

saraswatmks commented May 23, 2019 •

edited

Loading

jangorecki commented May 23, 2019 •

edited by mattdowle

Loading

saraswatmks commented May 23, 2019 •

edited

Loading

jangorecki commented May 23, 2019 •

edited

Loading

mattdowle May 23, 2019 •

edited

Loading

mattdowle May 23, 2019 •

edited

Loading

mattdowle commented May 24, 2019 •

edited

Loading

mattdowle left a comment •

edited

Loading

mattdowle May 28, 2019 •

edited

Loading

mattdowle May 25, 2019 •

edited

Loading

mattdowle commented May 28, 2019 •

edited

Loading