Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] assigning with := should retain keys better #2372

Closed
MarkusBonsch opened this issue Sep 20, 2017 · 3 comments
Closed

[Request] assigning with := should retain keys better #2372

MarkusBonsch opened this issue Sep 20, 2017 · 3 comments
Milestone

Comments

@MarkusBonsch
Copy link
Contributor

@MarkusBonsch MarkusBonsch commented Sep 20, 2017

Currently, keys and secondary indices are dropped, if any of the key columns is altered via a := assignment.
This is not necessary, however, since the key is still valid for all key columns before the assigned one.
Given this, we can better retain keys and secondary indices and improve data.table performance.
An example:

DT <- data.table(x1 = sample(1:10, 100, replace = TRUE),
                 x2 = sample(1:10, 100, replace = TRUE),
                 x3 = sample(1:10, 100, replace = TRUE),
                 y  = rnorm(100))

setkey(DT, x1, x2, x3)

## DT has a key on x1, x2, x3
print(key(DT))
# [1] "x1" "x2" "x3"

## assigning to x3 drops the key completely:
DT[2, x3 := 1000]
print(key(DT))
# NULL

## even though the sorting of x1 and x2 is of course not affected:
print(length(data.table:::forderv(DT, by = c("x1", "x2"))) == 0)
# TRUE


@MarkusBonsch
Copy link
Contributor Author

@MarkusBonsch MarkusBonsch commented Sep 22, 2017

I am working on a solution. Just need to memCheck for memory leaks before creating a pull request

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Sep 28, 2017

This was closed by your PR #2389 - thanks! Awesome! The text at the top of the PR "closes [#2372]" had brackets around the number so it didn't auto close this issue. I edited that text after the merge but it still didn't link up. So I guess it needs to be right at the time of the merge.

@mattdowle mattdowle closed this Sep 28, 2017
@MarkusBonsch
Copy link
Contributor Author

@MarkusBonsch MarkusBonsch commented Oct 4, 2017

I have discovered a minor fault in my implementation:

I am assuming that secondary indices behave exactly like the key.
This is, however, wrong:

  • the key is just a piece of information on the existing ordering.
  • The secondary key is a recipe for reordering.

If there is a key on columns x and y, it is a fully valid key on x as well. If there is an index on columns x and y, it is also a fully valid index on x, but it may cause spurious reordering in y if used as an index for x. The only secondary index that is fully equivalent to a key is a integer(0) index.
Although, there is currently no really clear policy on ordering based on indices (issue #2400), this can cause unexpected behaviour and needs to be fixed.

Therefore, I need to change my implementation as follows:

  • if any column in a secondary index has been modified, drop it
  • except if it is an integer(0) index: then, the index is still valid for all columns until the assigned one (exclusive)

I will fix asap and create a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

2 participants
You can’t perform that action at this time.