Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

programming on data.table #4304

Merged
merged 39 commits into from
May 10, 2021
Merged

programming on data.table #4304

merged 39 commits into from
May 10, 2021

Conversation

jangorecki
Copy link
Member

@jangorecki jangorecki commented Mar 14, 2020

This PR is meant to introduce new (base-R like) interface for parametrizing data.table queries. #2655
It uses base R substitute style substitution, but extends it for substitution of names of the argument in a function calls. This allows to give names of a columns from a variables as well, which is not possible using just base R substitute.

# data
DT = data.table(x = 1:5, y = 5:1)

# parameters
in_col_name = "x"
fun = "sum"
fun_arg1 = "na.rm"
fun_arg1val = TRUE
out_col_name = "sum_x"

# query
DT[, .(out_col_name = fun(in_col_name, fun_arg1=fun_arg1val))]

# desired query
DT[, .(sum_x = sum(x, na.rm=TRUE))]

# proposed API
DT[, .(out_col_name = fun(in_col_name, fun_arg1=fun_arg1val)),
  env = list(in_col_name=in_col_name, fun=fun, fun_arg1=fun_arg1, fun_arg1val=fun_arg1val, out_col_name=out_col_name)]

working example, checkout to programming branch, start R

cc(F) # build and attach pkg

substitute2(
  .(out_col_name = fun(in_col_name, fun_arg1=fun_arg1val)),
  env = list(
    in_col_name = "x",
    fun = "sum",
    fun_arg1 = "na.rm",
    fun_arg1val = TRUE,
    out_col_name = "sum_x"
  )
)
#.(sum_x = sum(x, na.rm = TRUE))

Feedback welcome!

@renkun-ken
Copy link
Member

Really like the solution here. Desperately need programming with data.table in my code. The design looks concise and powerful enough for most of my use cases.

R/programming.R Outdated Show resolved Hide resolved
@renkun-ken

This comment has been minimized.

@jangorecki

This comment has been minimized.

@MichaelChirico

This comment has been minimized.

@codecov
Copy link

codecov bot commented Mar 17, 2020

Codecov Report

Merging #4304 (d00f292) into master (fa72d74) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4304   +/-   ##
=======================================
  Coverage   99.46%   99.46%           
=======================================
  Files          73       75    +2     
  Lines       14611    14706   +95     
=======================================
+ Hits        14533    14628   +95     
  Misses         78       78           
Impacted Files Coverage Δ
src/init.c 100.00% <ø> (ø)
R/data.table.R 99.94% <100.00%> (+<0.01%) ⬆️
R/programming.R 100.00% <100.00%> (ø)
R/test.data.table.R 100.00% <100.00%> (ø)
src/programming.c 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fa72d74...d00f292. Read the comment docs.

@jangorecki

This comment has been minimized.

@renkun-ken

This comment has been minimized.

@renkun-ken

This comment has been minimized.

@jangorecki

This comment has been minimized.

@jangorecki
Copy link
Member Author

jangorecki commented Sep 15, 2020

@jan-glx I don't think there is a need for an extra dependency for something that base R can handle well. The only more tricky thing in base R metaprogramming is to substitute names of arguments, but this not less tricky with rlang where you need to use rlang's := operator, that would conflict with DT's use of := . Are there any advantages of using rlang over base R here? This PR brings base R way into [.
Another reason is API stability, I don't see that to be priority in tidyverse packages and personally I would prefer to use base R just because of that. I don't want to criticize here. There are always two sides, tidyverse goes towards "always improving", while base R or data.table prefers to be more conservative and provide longer lifecycle for their code.

@jan-glx
Copy link
Contributor

jan-glx commented Sep 15, 2020

I feel like there would not be a conflict, just a - perhaps confusing - dual use of := :
This would be achieved by a slightly different perspective on the problem: Instead of providing meta-programming for data.table queries, one would just support tidyeval's quasi-quotation in .() and perhaps in :=().

# desired query
DT[, .(sum_x = sum(x, na.rm=TRUE))]

# your proposed API
DT[, .(out_col_name = fun(in_col_name, fun_arg1=fun_arg1val)),
  env = list(in_col_name=in_col_name, fun=fun, fun_arg1=fun_arg1, fun_arg1val=fun_arg1val, out_col_name=out_col_name)]

# tidyeval would look much simpler and readable to me (variables would )
DT[, .(!!out_col_name := (!!fun)(!!in_col_name, !!fun_arg1 := !!fun_arg1val))]

The same would work in by/keyby and on.
For assignments in j one could either

DT[, .`:=`(!!out_col_name := .((!!fun)(!!in_col_name, !!fun_arg1 := !!fun_arg1val))]
# which would substitute to:
# DT[, `:=`(sum_x = sum(x, na.rm=TRUE))]

or allow to use bare names in .() on the left hand side of :=.
Not sure about i though. In any case, I have not really thought this through and tidyeval might indeed not be mature enough (note, however, that I only suggested to use the syntax, not the package), I was really only wondering what speaks against this since it seems so much simpler from a user's (that is used to tidyeval) perspective.

@MichaelChirico
Copy link
Member

@jan-glx thanks for your input.

understand there are different tastes but for me personally that version with all the !! does not look simple or intuitive at all.

one point worth mentioning about the env approach is that i think it aligns with a typical usage pattern for me, which is to keep these kind of metaparameters in a list for the whole analysis.

So for me, a more representative use case would be:

# at top of script
params = list(
  in_col_name=in_col_name, 
  fun=fun, 
  fun_arg1=fun_arg1, 
  fun_arg1val=fun_arg1val, 
  out_col_name=out_col_name,
  # ... other hyperparameters ...
)
# ... any other setup, data i/o, etc ...

DT[env = params, , .(out_col_name = fun(in_col_name, fun_arg1=fun_arg1val))]

# other queries using the same hyperparameters
DT[env = params, ...] 

@jangorecki
Copy link
Member Author

jangorecki commented Sep 16, 2020

I strongly agree with Michael about !!. This type of logical operations has its valid use cases, consider having integer variable that you want to as.logical(var). How much more typing you need rather than !!var? I agree it is not a best practice but if you just play with data interactively, and not write a code or script, then !! as coercion to logical does work well.

@JohnMount
Copy link

I think !! is bad idea. It isn't in fact a no-op operation (see !!c(0, 1, 2)). Also possibly the rlang team will eventually get a special glyph or parse change into the R language and then data.table will be stuck using an abandon convention. I think the rlang team did get a change into R on how !! itself is parsed to avoid something like !(!()).

@jangorecki
Copy link
Member Author

@jan-glx a very fresh example of confusing use of := from rlang: https://stackoverflow.com/a/63926023/2490497

@renkun-ken
Copy link
Member

renkun-ken commented Sep 17, 2020

Most of my use case is programmatically determine the list to supply to env=, i.e.,

dt[, X := Y + Z, env = list(X = paste0("x", i), Y = paste0("y", i), Z = paste0("z", i))

which looks much cleaner than

X <- paste0("x", i)
Y <- paste0("y", i)
Z <- paste0("z", i)
dt[, !!X := !!Y + !!Z]

in terms of the scoping of the X,Y,Z variables.

@jangorecki
Copy link
Member Author

Another example of feature suggested in this PR can be seen on this SO question which asks for quite complex parametrizing of DT query: https://stackoverflow.com/questions/24833247/how-can-one-work-fully-generically-in-data-table-in-r-with-column-names-in-varia

@mattdowle mattdowle added this to the 1.14.1 milestone May 9, 2021
@mattdowle mattdowle merged commit 79c3d3e into master May 10, 2021
@mattdowle mattdowle deleted the programming branch May 10, 2021 07:11
@renkun-ken
Copy link
Member

Very excited to see the PR is finally merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vignette for programmatic data.table (eval, get, etc) Alternative way to dynamic symbol usage in j
8 participants