Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check for index in setkey #3582

Merged
merged 21 commits into from
May 28, 2019
Merged

check for index in setkey #3582

merged 21 commits into from
May 28, 2019

Conversation

saraswatmks
Copy link
Contributor

@saraswatmks saraswatmks commented May 22, 2019

Closes #2889

The idea the following:

  1. Check in setkeyv if the data.table already has an index
  2. If yes, use that index and don't run setindex inside it again.

@mattdowle mattdowle added this to the 1.12.4 milestone May 22, 2019
@mattdowle
Copy link
Member

@saraswatmks Excellent! Have invited you to be project member (you need to accept the notification) so next time you can create a branch in the main project.

@mattdowle
Copy link
Member

I don't see cols used in the branch when x has some indices existing. Doesn't it need to see if the requested key (i.e. cols) exists as an index? If it does it can use that ordering to change the physical order, otherwise (i.e. if no index exists for cols) it still needs to call forder.

@saraswatmks
Copy link
Contributor Author

saraswatmks commented May 22, 2019

@mattdowle thanks for your quick feedback. I've made the change. Now we check:

  1. If the given cols exists as an index. If it does, we use that existing order.
  2. If the given cols doesn't exist as an index, we run forder as usual.

I am on macos. No matter what I do I can't run tests locally. I need to push everytime and wait for pipeline to throw some error. Could you guide me to some link so that I can test it locally before pushing ? I have already tried this and it hasn't helped.

@codecov
Copy link

codecov bot commented May 22, 2019

Codecov Report

Merging #3582 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3582      +/-   ##
==========================================
+ Coverage   97.79%   97.79%   +<.01%     
==========================================
  Files          66       66              
  Lines       12904    12909       +5     
==========================================
+ Hits        12620    12625       +5     
  Misses        284      284
Impacted Files Coverage Δ
R/setkey.R 98.34% <100%> (+0.02%) ⬆️
src/fwrite.c 97.64% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fa6297c...bf59b63. Read the comment docs.

@MichaelChirico
Copy link
Member

hi @saraswatmks could you share more details about what's not working? especially your locale info, that seems to be the most common source of headaches

@mattdowle
Copy link
Member

mattdowle commented May 23, 2019

R/setkey.R Outdated

# get existing index name if any
found_index <- NULL
if(is.null(indices(x))){
Copy link
Member

@mattdowle mattdowle May 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be !is.null here? iiuc that's why line 107 isn't covered (see codecov results in the conversation tab).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, my bad

R/setkey.R Outdated Show resolved Hide resolved
@jangorecki
Copy link
Member

@saraswatmks at which point R CMD build data.table --no-build-vignettes and R CMD check data.table_1.12.3.tar.gz --ignore-vignettes are not working?

@saraswatmks
Copy link
Contributor Author

saraswatmks commented May 23, 2019

@MichaelChirico I get this error when I run R CMD build .

* checking for file ‘./DESCRIPTION’ ... OK
* preparing ‘data.table’:
* checking DESCRIPTION meta-information ... OK
* cleaning src
* installing the package to build vignettes
      -----------------------------------
* installing *source* package ‘data.table’ ...
** libs
/usr/local/opt/llvm/bin/clang -fopenmp -I/anaconda3/lib/R/include -DNDEBUG   -I/usr/local/opt/gettext/include -I/usr/local/opt/llvm/include   -fPIC  -g -O3 -Wall -pedantic -std=gnu99 -mtune=native -pipe -c assign.c -o assign.o
In file included from assign.c:1:
In file included from ./data.table.h:1:
/anaconda3/lib/R/include/R.h:55:11: fatal error: 'stdlib.h' file not found
# include <stdlib.h> /* Not used by R itself, but widely assumed in packages */
          ^~~~~~~~~~
1 error generated.
make: *** [assign.o] Error 1
ERROR: compilation failed for package ‘data.table’
* removing ‘/private/var/folders/6l/7tvz3rz510n4gp2gsrpq6mxr0000gp/T/RtmpMitKB2/Rinstfc6676e4c658/data.table’
      -----------------------------------
ERROR: package installation failed

@saraswatmks
Copy link
Contributor Author

@jangorecki Thanks for helping. When I run R CMD check data.table_1.12.3.tar.gz --ignore-vignettes I get

* using log directory ‘/Users/manish/Documents/open-source-projects/data.table.Rcheck’
* using R version 3.4.2 (2017-09-28)
* using platform: x86_64-apple-darwin13.4.0 (64-bit)
* using session charset: UTF-8
* using option ‘--ignore-vignettes’
* checking for file ‘data.table/DESCRIPTION’ ... OK
* this is package ‘data.table’ version ‘1.12.3’
* checking package namespace information ... OK
* checking package dependencies ...Warning: unable to access index for repository https://CRAN.R-project.org/src/contrib:
  internet routines cannot be loaded
Warning: unable to access index for repository https://bioconductor.org/packages/3.5/bioc/src/contrib:
  internet routines cannot be loaded
Warning: unable to access index for repository https://bioconductor.org/packages/3.5/data/annotation/src/contrib:
  internet routines cannot be loaded
Warning: unable to access index for repository https://bioconductor.org/packages/3.5/data/experiment/src/contrib:
  internet routines cannot be loaded
 ERROR
Packages suggested but not available: ‘R.utils’ ‘nanotime’

@saraswatmks
Copy link
Contributor Author

Here's my sessionInfo

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0        rstudioapi_0.8    magrittr_1.5      usethis_1.4.0     devtools_2.0.1   
 [6] pkgload_1.0.2     R6_2.3.0          rlang_0.3.0.1     tools_3.5.1       pkgbuild_1.0.2   
[11] data.table_1.11.8 sessioninfo_1.1.1 cli_1.0.1         withr_2.1.2       remotes_2.0.2    
[16] yaml_2.2.0        assertthat_0.2.0  digest_0.6.18     rprojroot_1.3-2   crayon_1.3.4     
[21] processx_3.2.0    callr_3.0.0       base64enc_0.1-3   fs_1.2.6          ps_1.2.1         
[26] testthat_2.0.1    glue_1.3.0        memoise_1.1.0     compiler_3.5.1    desc_1.2.0       
[31] backports_1.1.2   prettyunits_1.0.2

@jangorecki
Copy link
Member

jangorecki commented May 23, 2019

@saraswatmks install those two missing packages, or use _R_CHECK_FORCE_SUGGESTS_=false env var before running package check.

@saraswatmks
Copy link
Contributor Author

saraswatmks commented May 23, 2019

@saraswatmks install those two missing packages or use _R_CHECK_FORCE_SUGGESTS_=false env var before running package check.

@jangorecki Just to make sure I am doing the right way, what is the common way to do this ? For ex: I am doing the following:

  1. Go to terminal. Create a conda env and install all these dependencies using conda install
  2. Run commands from terminal.

Am I following the right way? Or there is something in RStudio which can help me skip these steps.

Also, after I installed both the dependencies, I did R CMD build ., I get a new error:

* checking for file ‘./DESCRIPTION’ ... OK
* preparing ‘data.table’:
* checking DESCRIPTION meta-information ... OK
* cleaning src
Error in loadVignetteBuilder(pkgdir, TRUE) :
  vignette builder 'knitr' not found
Execution halted

@jangorecki
Copy link
Member

jangorecki commented May 23, 2019

  1. skip conda
  2. use plain R
  3. (only once) install all data.table suggested deps or (once per session) use env var to skip their tests
  4. use existing Makefile, it makes life easier, then in console
cd data.table
make build && make check

@MichaelChirico
Copy link
Member

to each his own but I cringe at the word conda. it has only caused me headaches. RStuido has build&reload and clean&reload functions which will help a lot.

I've been thinking it might be nice to have a short vignette or blog post or wiki page on "how to fix a data.table issue for a first time contributor" giving some tips and tricks on getting over the hump.

you might want to check the file cc.R which has a lot of (densely packed) accumulated knowledge on debugging & package development

@MichaelChirico
Copy link
Member

PS so it doesn't get lost in the trees -- thanks for the PR! we really appreciate all community efforts at fixing&improving the package & hope the experience is more illuminating than frustrating 😁

R/setkey.R Outdated
}

# forder only if index is not present
if(!identical(found_index, cols)){
Copy link
Member

@mattdowle mattdowle May 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works if there is one index, as the test tests. But there can be multiple indexes (each index is a set of columns). It needs to find if any of the indexes match cols and then use that one. The test needs expanding for cases of multiple indexes and cols both existing and not existing when there are multiple (so it tests it picks up the correct one).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it.

R/setkey.R Outdated
} else {
o <- forderv(x, cols, sort=TRUE, retGrp=FALSE)
cat("using existing index for", found_index, "\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs an if (verbose) please otherwise this is always being printed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it.

R/setkey.R Outdated
cat("forder took", tt["user.self"]+tt["sys.self"], "sec\n")

# get existing index name if any
found_index <- NULL
Copy link
Member

@mattdowle mattdowle May 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer = in data.table please. I don't agree with the common and widely marketed advice to use <-. Single = is the same as C and many other languages. I sometimes hear, e.g. from Python folk, that R code using <- looks "old"/"not modern" and I see their point. I've always used =. I save <- for use when passing function arguments to do an assign and pass in one go; e.g. write(DT, file=f<-tempfile()); ... do something with f ... . I'm often swapping between C and R (which I think more people should do too since C is not as hard as some people want you to believe). When doing this, using = to mean assign and == to mean equals in two languages consistently (R and C) is nice. And by the way, the people who laugh at R for using L postfix to mean integer ... it comes from C (it's the same as C) and it makes a lot of sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for detailed insight. made the change.

@saraswatmks
Copy link
Contributor Author

@MichaelChirico Thanks to you guys for your kindness and patience. I've been wanting to contribute since a long time.
What else needs to be done in this PR ?

@mattdowle
Copy link
Member

mattdowle commented May 24, 2019

@saraswatmks Thanks for contributing, and welcome!

What else needs to be done in this PR ?

I made some comments above to be addressed please.

Copy link
Member

@mattdowle mattdowle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for making the previous changes.
More requests below. I hope they make sense and I tried to explain the rationale behind them for future reference. Almost there! I wrote quite a bit intending a series of comments in order, but they seem to have appeared in a different order after I submitted: just read them all quickly first before getting bogged down into any particular one.
Also, please add a news item, and please add yourself to the bottom of contributors list in DESCRIPTION. Your name will then appear on CRAN contributor field on the next update.

aaa = c(1,1,2,2,2,1,1,2,2,2))
setindex(DT, a)
test(1419.60, allIndicesValid(DT), TRUE)
test(1419.61, setkey(DT, a, verbose=TRUE), output="using existing index for a")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check on output= is necessary and good but not quite sufficient. It's possible that setkey() prints this message but then doesn't set the key properly. So there needs to be another test just after this one that makes sure the key has been changed. Unfortunately the test data chosen doesn't make this a good test because the input data is trivially sorted already. So my first thought is to change the input data so it's random e.g. DT = data.table(a=c(2,3,2,1,3,2,1), aaa=...) then add test 1419.62 to check that DT$a is c(1,1,2,2,2,3,3) afterwards. Or something like that. Then it is checking both that i) the setkey() has changed the physical order and ii) the output= also checks that it changed the order for the correct reason (using the existing index).

R/setkey.R Outdated
# get existing index name if any
found_index = NULL
if(!is.null(indices(x))) found_index <- names(attributes(attributes(x)$index))
new_possible_index = paste0("__", cols, collapse="")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 lines would be simpler as just one line :
index = paste0(cols, collapse="__")
See the next 3 comments in combination below ...

R/setkey.R Outdated
new_possible_index = paste0("__", cols, collapse="")

# forder only if index is not present
if(!any(new_possible_index == found_index)){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then this if() would be :
if (!any(index == indices(x))
If x doesn't have any indices, indices(x) returns NULL and any("anything"==NULL) is FALSE in R (which is nice).
The idea is just to save repeating variable names and use fewer lines of code so we have less to maintain in future. Not too few lines to the extent of making it hard to understand. But in this case, what I'm suggesting is simpler and worth doing and easier to read and check.
Using indices() is cleaner than fetching the attributes() directly, because we have an isolating interface. If we ever change the attribute structure or names in future, we only need to change the code inside indices() not try and find everywhere that reads the attributes directly.

R/setkey.R Outdated
} else {
o = forderv(x, cols, sort=TRUE, retGrp=FALSE)
# find the matching index
ix = found_index[which(found_index == new_possible_index)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then this line can be removed.

R/setkey.R Outdated
cat("using existing index for", gsub("^__","", ix), "\n")
o <- attr(attributes(x)$index, which=ix, exact = TRUE)
} else {
o <- attr(attributes(x)$index, which=ix, exact = TRUE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And these two which=ix become which=index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattdowle I've made the changes. Correct me if I am wrong, I think we need ix here because, let's take a scenario where a user set multiple indexes. So when we do indices(DT), we get a, b, __a__b, if the user does setkey(DT, a), setkey(DT, b), setkey(DT, a, b) for a table DT = data.table(a = c(...), b = c(...), c = c(...)) . In this case, we need to find the ix index of which index is matched.

Copy link
Member

@mattdowle mattdowle May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - thanks again. Almost there!
Look at line 104 though. I commented before that line 104 could be removed. Line 104 is, in the end, just renaming index as idx, iiuc. This which= is looking up the item by name anyway. So line 104 can be removed and then replace this which=idx with which=index.
Once that's done, perhaps rename the local variable index to thiskey or newkey? That way it reads a little betters and avoids the same name "index" as the attribute name. For example, line 94 will probably read a little better as if (!any(indices(x)==newkey)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for your patience.

setindex(DT, a)
setindex(DT, aaa)
test(1419.62, allIndicesValid(DT), TRUE)
test(1419.63, setkey(DT, aaa, verbose=TRUE), output="using existing index for aaa")
Copy link
Member

@mattdowle mattdowle May 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same request here as above for 1419.61. This test needs to check the result of this setkey() command has done the correct thing (e.g. changed the physical row order properly) as well as doing that for the right reason (output=).
Since setkey() returns DT invisibly, the easiest way is just to pass an appropriate y=data.table(...) into this test(). See other tests() in this file for examples for single calls to test() which use x, y and output= too.

Copy link
Member

@mattdowle mattdowle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for previous. Still the results need checking in tests. It's ok - I'll do them. Since setkey() is a core part of data.table we have to ensure any changes to it are nailed down.

test(1419.62, setkey(DT, a, verbose=TRUE), output="using existing index for a")

# check setkey incase of existing multiple indexes
DT <- data.table(a = c(3,3,4,4,5,6,1,1,7,2,2),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a has 11 items here but aaa and bbb have 10 so there's a warning when running test.data.table() runs in dev and when users would run test.data.table(). It's a lot easier and faster to run locally first before pushing. See the cc.R script in the root of the project : that's what I use in dev.

setindex(DT, a)
setindex(DT, aaa)
test(1419.65, allIndicesValid(DT), TRUE)
test(1419.66, setkey(DT, aaa, verbose=TRUE), output="using existing index for aaa")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't check the result of setkey is correct. Please see previous comments where that was requested : #3582 (comment) and #3582 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for chipping in. I wasn't fully aware of test function parameters. Now I understand it better. Also, I will try to fix my local environment (I still don't know how am I gonna do that) to avoid stupid errors in the PR.

@mattdowle
Copy link
Member

mattdowle commented May 28, 2019

I added the result checks to the tests. That actually finds a problem in setindex() where it is incorrectly using an existing index. That commit will fail on test 1419.66 and then I'll fix that next.
...
That turned out to be just a problem in this PR: not started the index with __ so although it was supposed to be picking up the index and using it, it was just fetching NULL and thinking the table was already sorted by that index. Testing the output was correct was important to catch this. I changed those two lines to one line that uses the existing getindex() function which knows how to fetch the index in one place with checks.

options(datatable.auto.index = TRUE)
test(1376.12, list(DT[a==2L], indices(DT)), list(DT[9L],"a"))
test(1376.12, list(DT[a==2L], indices(DT)), list(DT[2L],"a"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 tests changes from 9L to 2L were strange that they were needed. Anyway, the latest commit leaves these tests unchanged which seems more correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep: those 3 tests should not have been changed to 9L in this PR. The setkey(DT,b) inbetween tests 1376.07 and 1376.08 had broken under this PR and wasn't changing the row order properly.

@jangorecki
Copy link
Member

This change will be superseded (possibly taking it all away during git merge) by #4386 which do same optimization (among others) but on C level rather than R.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Setting key to an existing index can skip forder step
4 participants