Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

melt with custom variable columns using variable_table attribute #4731

Merged
merged 66 commits into from
May 9, 2021

Conversation

tdhock
Copy link
Member

@tdhock tdhock commented Oct 1, 2020

Closes #2551
Closes #2575
Closes #3396
Background: there are a bunch of issues #3396 #2575 #2551 involving melt on a data table with column names that each have more than one piece of information e.g.

  1. Petal.Length, Sepal.Length, Petal.Width, ... (part, dimension)
  2. a_1, b_1, a_2, b_2, ... (letter, number)
  3. sex_child1, age_child1, sex_child2, age_child2, ... (feature, number)

In these situations we would like output columns that have names shown in parentheses above, but with current melt the variable column is either the full column name (if there is one output value column),

remotes::install_github("Rdatatable/data.table@melt-custom-variable")
#> Skipping install of 'data.table' from a github remote, the SHA1 (c02fa9e8) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
options(datatable.print.class=TRUE)
DT <- data.table(id=1, a_1=10, b_2=21, a_2=20)
melt(DT, measure.vars=patterns("_"))
#>       id variable value
#>    <num>   <fctr> <num>
#> 1:     1      a_1    10
#> 2:     1      b_2    21
#> 3:     1      a_2    20

or an integer (if there are multiple output value columns),

melt(DT, measure.vars=patterns(a="a", b="b"))
#>       id variable     a     b
#>    <num>   <fctr> <num> <num>
#> 1:     1        1    10    21
#> 2:     1        2    20    NA

Comparison: tidyr::pivot_longer can be used with names_sep/names_pattern/names_transform arguments to get a more useful molten/tall output table in these cases.

Proposal: The goal of this PR is to fix these issues, and provide melt feature-parity with tidyr::pivot_longer.
The solution involves a new function measure which lets us do this if we want a single output value column,

melt(DT, measure.vars=measure(letter, number=as.integer, sep="_"))
#>       id letter number value
#>    <num> <char>  <int> <num>
#> 1:     1      a      1    10
#> 2:     1      b      2    21
#> 3:     1      a      2    20

and we can use the special value.name keyword if we want multiple output value columns,

melt(DT, measure.vars=measure(value.name, number=as.integer, sep="_"))
#>       id number     a     b
#>    <num>  <int> <num> <num>
#> 1:     1      1    10    NA
#> 2:     1      2    20    21

Note in the code above that the output does not include the variable column at all. Instead the more relevant letter/number columns are output.

Some points to discuss:

  • do we keep the name measure for this new function or do we want to call it something else?
  • measure function allows specification of type conversion, e.g., as.integer above. (similar to names_transform argument of tidyr::pivot_longer).
  • do we keep the value.name keyword (for consistency with value.name argument) or do we change it to something else to avoid confusion?
  • internally this works because the measure() function returns a vector/list with a special variable_table attribute which is a data table with columns letter/number. There is new fmelt C code which recognizes this attribute and uses it to create the desired output. Is variable_table a good name for this attribute or shall we call it something else?
  • I did a bunch of timings to compare the new code with existing data.table and tidyr solutions (see comment with figures below), and it seems like the new code is really fast.
  • I had to re-write do_patterns, and I renamed it to eval_with_cols in order for it to support adding cols argument to measure as well as patterns. A nice new feature is that it will work with ANY user-provided function that has an argument named cols. e.g.,
remotes::install_github("tdhock/nc@new-measure")
melt(DT, measure.vars=nc::measure(letter="[ab]", "_", number="[12]", as.integer))
#>       id letter number value
#>    <num> <char>  <int> <num>
#> 1:     1      a      1    10
#> 2:     1      b      2    21
#> 3:     1      a      2    20
melt(DT, measure.vars=nc::measure(column="[ab]", "_", number="[12]", as.integer))
#>       id number     a     b
#>    <num>  <int> <num> <num>
#> 1:     1      1    10    NA
#> 2:     1      2    20    21

I know this is a really big PR --- please tell me if there is anything I can do to help make it easier to review.
Thanks.

@tdhock
Copy link
Member Author

tdhock commented Oct 2, 2020

this PR includes commits from other PRS #4720 and #4723 (these other ones are simpler and should be reviewed/merged first)

@tdhock
Copy link
Member Author

tdhock commented Oct 2, 2020

I expected that the new method should be as fast as, or faster than existing approaches, so I did some timings to verify that empirically (source code for timings experiments/plots). The approaches I considered are:

  • tidyr::pivot_longer uses names_pattern/names_transform arguments (typical usage of tidyr).
  • data.table::melt.old.join/set use melt the old way (no variable_table attribute), as users would typically do with current master. The result of melt is post-processed by either set/:= or joining to a data table with info extracted from the input column names.
  • data.table::melt.new.pattern/sep uses the new measure.vars=measure(...) as users will typically do after merging this PR (uses new variable_table attribute).
  • data.table::melt.new.var_tab directly uses the new melt with variable_table attribute in measure.vars. Typically users would not do this directly but it is interesting here for comparison --- the only difference between var_tab and pattern/sep is the overhead of calling measure(...).

The first comparison figure plot the computation time as a function of the number of rows in the input:

figure-who-rows-dt

It is clear from the who data (above right) that (1) all data table methods are faster than tidyr, (2) new data table methods are slightly faster than old.join method for large data, and (3) new.pattern is slower than new.var_tab by less than 0.01 seconds, indicating that the overhead of calling measure(...) is very small. The iris data (above left) additionally show that (4) there is not much difference between new pattern/sep methods, and (5) both are comparable to old.set method.

The second figure below plots the computation time as a function of the number of columns in the input:

figure-who-cols-dt

Similar trends are evident in this comparison, and we can also see that old.join is slower than old.set.

Overall these timings provide convincing evidence that the new code is at least as fast as existing data table methods, and sometimes slightly faster. Additionally we see that the convenience of the measure(...) function only costs a very small amount in terms of computation time.

@codecov
Copy link

codecov bot commented Oct 2, 2020

Codecov Report

Merging #4731 (c927a52) into master (ebc14ce) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4731      +/-   ##
==========================================
+ Coverage   99.44%   99.45%   +0.01%     
==========================================
  Files          73       73              
  Lines       14469    14612     +143     
==========================================
+ Hits        14388    14532     +144     
+ Misses         81       80       -1     
Impacted Files Coverage Δ
R/data.table.R 99.94% <100.00%> (ø)
R/fmelt.R 100.00% <100.00%> (ø)
R/utils.R 100.00% <100.00%> (ø)
src/fmelt.c 99.64% <100.00%> (+0.21%) ⬆️
src/init.c 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebc14ce...c927a52. Read the comment docs.

@tdhock
Copy link
Member Author

tdhock commented Oct 3, 2020

Example with iris data:

remotes::install_github("Rdatatable/data.table@melt-custom-variable")
#> Skipping install of 'data.table' from a github remote, the SHA1 (7a73a777) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
iris.dt = data.table(iris)[c(1,150)]
melt(iris.dt, measure.vars=measure(part, value.name, sep="."))
#>      Species  part Length Width
#> 1:    setosa Sepal    5.1   3.5
#> 2: virginica Sepal    5.9   3.0
#> 3:    setosa Petal    1.4   0.2
#> 4: virginica Petal    5.1   1.8
melt(iris.dt, measure.vars=measure(value.name, dim, sep="."))
#>      Species    dim Sepal Petal
#> 1:    setosa Length   5.1   1.4
#> 2: virginica Length   5.9   5.1
#> 3:    setosa  Width   3.5   0.2
#> 4: virginica  Width   3.0   1.8
melt(iris.dt, measure.vars=measure(part, dim, sep="."))
#>      Species  part    dim value
#> 1:    setosa Sepal Length   5.1
#> 2: virginica Sepal Length   5.9
#> 3:    setosa Sepal  Width   3.5
#> 4: virginica Sepal  Width   3.0
#> 5:    setosa Petal Length   1.4
#> 6: virginica Petal Length   5.1
#> 7:    setosa Petal  Width   0.2
#> 8: virginica Petal  Width   1.8

@TysonStanley
Copy link
Member

I don't have time right now to go through this in any depth, but the user functionality looks fantastic. I think it is intuitive and allows users to more simply access the speed/efficiency without, what I assume, any breaking changes.

@tdhock tdhock changed the title Custom variable columns using variable_table attribute melt with custom variable columns using variable_table attribute Jan 22, 2021
@tdhock
Copy link
Member Author

tdhock commented Jan 22, 2021

There are lots of conflicts in tests.Rraw due to the test numbers. At the time when this PR was initiated, the test numbers I used were new, but now they conflict with some test numbers which were added in other PRs which have been merged into master since then. Going forward what is the recommendation for choosing new test numbers in such a way as to avoid conflicts with other PRs? I checked the Contributing wiki page but I did not see any mention of this issue.

@MichaelChirico
Copy link
Member

That's a pain point with our test numbering system, we don't really have a workaround.

But nothing needed on your end -- if that's the only conflict, it will be fixed by Matt during merge.

@mattdowle mattdowle added this to the 1.14.1 milestone May 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment