Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setDT and setkeyv inside function #5618

Open
ghost opened this issue Mar 22, 2023 · 4 comments
Open

setDT and setkeyv inside function #5618

ghost opened this issue Mar 22, 2023 · 4 comments

Comments

@ghost
Copy link

ghost commented Mar 22, 2023

Hi all! This is likely related to #4783 and #4816.

Consider this example:

library(data.table)

x <- tibble::tibble(a = c(3,2,1))
foo <- function(dt) {
  print(address(dt))
  setDT(dt)
  print(address(dt))
  setkeyv(dt, "a")
  return(dt[])
}

> foo(x)
[1] "0x106cbd5c8"
[1] "0x115082a00"
   a
1: 1
2: 2
3: 3

The above is expected. From what I understood setDT will make a shallow copy of x and then reference another object, hence the change in address from 0x106cbd5c8 to 0x115082a00. However, what is unexpected to me is that the order of the original variable, x, also changed:

> x
   a
1: 1
2: 2
3: 3

Is this expected behavior? I thought that the setDT inside foo would make a shallow copy of the passed object, and would then reference that copy instead of the original, so that setkeyv would only arrange the data inside the function, but not the data outside the function. Just like in the example in this #SO where the data in the original variable is not modified.

Question: Why does setkeyv reorders x, even though we are calling setDT inside foo?

@ghost
Copy link
Author

ghost commented Mar 24, 2023

Please let me know if I can add more information or context to make this question more understandable.

@msummersgill
Copy link

One of the comments in an answer to your linked stack overflow question summarizes the big picture in broad strokes:

Using setDT within a function on an argument will probably never have consistent semantics and is almost guaranteed to get you nasty surprises.
-Ofek Shilon

If you don't want a data.table to be changed in the global environment, your best bet is to make an explicit copy within your function environment using the copy() function. Otherwise, using any of the modify by reference capabilities i.e. :=, and set... functions will probably have side effects.

@ghost
Copy link
Author

ghost commented Mar 24, 2023

Thanks for the comment. Indeed, the side effect was really surprising and I wanted to understand how that mechanically happens. After setDT any usual changes to dt are local, like adding an extra column of data, and do not change x (as in the SO post I reference). But for some reason setkeyv changes both dt and x, and I am trying to understand why/how.

@dvg-p4
Copy link
Contributor

dvg-p4 commented May 16, 2023

I thought that the setDT inside foo would make a shallow copy of the passed object, and would then reference that copy instead of the original, so that setkeyv would only arrange the data inside the function, but not the data outside the function.

This assumption actually contradicts itself. When you make a shallow copy, you create a new list of pointers to the same data:

 x -- x$a
          \ 
           ---> a
          /
dt -- dt$a

so if you change dt$a (e.g. reordering it with setkeyv()), you necessarily also change x$a--since it's the same vector down at the level of memory! We can see this if we look at the addresses more closely:

> foo <- function(dt) {
      print(paste("dt:", address(dt), "  dt$a:", address(dt$a)))
      setDT(dt)
      print(paste("dt:", address(dt), "  dt$a:", address(dt$a)))
      setkeyv(dt, "a")
      print(paste("dt:", address(dt), "  dt$a:", address(dt$a)))
      return(dt[])
  }
> x <- data.frame(a = c(3,2,1))
> address(x)
[1] "0x11f8acbc0"
> address(x$a)
[1] "0x111a15798"
> foo(x)
[1] "dt: 0x11f8acbc0   dt$a: 0x111a15798"
[1] "dt: 0x10ee0a600   dt$a: 0x111a15798"
[1] "dt: 0x10ee0a600   dt$a: 0x111a15798"
   a
1: 1
2: 2
3: 3
> x
   a
1: 1
2: 2
3: 3
> address(x)
[1] "0x11f8acbc0"
> address(x$a)
[1] "0x111a15798"

If you want a deep copy, you should use as.data.table():

foo <- function(df) {
+     print(paste("df:", address(df), "  df$a:", address(df$a)))
+     dt <- as.data.table(df)
+     print(paste("df:", address(df), "  df$a:", address(df$a)))
+     print(paste("dt:", address(dt), "  dt$a:", address(dt$a)))
+     setkeyv(dt, "a")
+     print(paste("df:", address(df), "  df$a:", address(df$a)))
+     print(paste("dt:", address(dt), "  dt$a:", address(dt$a)))
+     return(dt[])
+ }
> x <- data.frame(a = c(3,2,1))
> foo(x)
[1] "df: 0x10b7e2b70   df$a: 0x1111fcb78"
[1] "df: 0x10b7e2b70   df$a: 0x1111fcb78"
[1] "dt: 0x11194ce00   dt$a: 0x10b5524a8"
[1] "df: 0x10b7e2b70   df$a: 0x1111fcb78"
[1] "dt: 0x11194ce00   dt$a: 0x10b5524a8"
   a
1: 1
2: 2
3: 3
> x
  a
1 3
2 2
3 1

If you want consistent reference semantics, you should call setDT() on the data.frame before passing it to the function:

> bar <- function(dt) {
+     print(paste("dt:", address(dt), "  dt$a:", address(dt$a)))
+     setkeyv(dt, "a")
+     print(paste("dt:", address(dt), "  dt$a:", address(dt$a)))
+     dt[, b := 7]
+     print(paste("dt:", address(dt), "  dt$a:", address(dt$a), "  dt$b:", address(dt$b)))
+     return(dt[])
+ }
> y <- data.frame(a = c(7,6,5))
> address(y)
[1] "0x1101fa8c8"
> address(y$a)
[1] "0x13eb354d8"
> setDT(y)
> address(y)
[1] "0x111a0be00"
> address(y$a)
[1] "0x13eb354d8"
> bar(y)
[1] "dt: 0x111a0be00   dt$a: 0x13eb354d8"
[1] "dt: 0x111a0be00   dt$a: 0x13eb354d8"
[1] "dt: 0x111a0be00   dt$a: 0x13eb354d8   dt$b: 0x111998598"
   a b
1: 5 7
2: 6 7
3: 7 7
> y
   a b
1: 5 7
2: 6 7
3: 7 7
> address(y)
[1] "0x111a0be00"
> address(y$a)
[1] "0x13eb354d8"
> address(y$b)
[1] "0x111998598"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants