melt with custom variable columns using variable_table attribute #4731

tdhock · 2020-10-01T23:43:55Z

Closes #2551
Closes #2575
Closes #3396
Background: there are a bunch of issues #3396 #2575 #2551 involving melt on a data table with column names that each have more than one piece of information e.g.

Petal.Length, Sepal.Length, Petal.Width, ... (part, dimension)
a_1, b_1, a_2, b_2, ... (letter, number)
sex_child1, age_child1, sex_child2, age_child2, ... (feature, number)

In these situations we would like output columns that have names shown in parentheses above, but with current melt the variable column is either the full column name (if there is one output value column),

remotes::install_github("Rdatatable/data.table@melt-custom-variable")
#> Skipping install of 'data.table' from a github remote, the SHA1 (c02fa9e8) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
options(datatable.print.class=TRUE)
DT <- data.table(id=1, a_1=10, b_2=21, a_2=20)
melt(DT, measure.vars=patterns("_"))
#>       id variable value
#>    <num>   <fctr> <num>
#> 1:     1      a_1    10
#> 2:     1      b_2    21
#> 3:     1      a_2    20

or an integer (if there are multiple output value columns),

melt(DT, measure.vars=patterns(a="a", b="b"))
#>       id variable     a     b
#>    <num>   <fctr> <num> <num>
#> 1:     1        1    10    21
#> 2:     1        2    20    NA

Comparison: tidyr::pivot_longer can be used with names_sep/names_pattern/names_transform arguments to get a more useful molten/tall output table in these cases.

Proposal: The goal of this PR is to fix these issues, and provide melt feature-parity with tidyr::pivot_longer.
The solution involves a new function measure which lets us do this if we want a single output value column,

melt(DT, measure.vars=measure(letter, number=as.integer, sep="_"))
#>       id letter number value
#>    <num> <char>  <int> <num>
#> 1:     1      a      1    10
#> 2:     1      b      2    21
#> 3:     1      a      2    20

and we can use the special value.name keyword if we want multiple output value columns,

melt(DT, measure.vars=measure(value.name, number=as.integer, sep="_"))
#>       id number     a     b
#>    <num>  <int> <num> <num>
#> 1:     1      1    10    NA
#> 2:     1      2    20    21

Note in the code above that the output does not include the variable column at all. Instead the more relevant letter/number columns are output.

Some points to discuss:

do we keep the name measure for this new function or do we want to call it something else?
measure function allows specification of type conversion, e.g., as.integer above. (similar to names_transform argument of tidyr::pivot_longer).
do we keep the value.name keyword (for consistency with value.name argument) or do we change it to something else to avoid confusion?
internally this works because the measure() function returns a vector/list with a special variable_table attribute which is a data table with columns letter/number. There is new fmelt C code which recognizes this attribute and uses it to create the desired output. Is variable_table a good name for this attribute or shall we call it something else?
I did a bunch of timings to compare the new code with existing data.table and tidyr solutions (see comment with figures below), and it seems like the new code is really fast.
I had to re-write do_patterns, and I renamed it to eval_with_cols in order for it to support adding cols argument to measure as well as patterns. A nice new feature is that it will work with ANY user-provided function that has an argument named cols. e.g.,

remotes::install_github("tdhock/nc@new-measure")
melt(DT, measure.vars=nc::measure(letter="[ab]", "_", number="[12]", as.integer))
#>       id letter number value
#>    <num> <char>  <int> <num>
#> 1:     1      a      1    10
#> 2:     1      b      2    21
#> 3:     1      a      2    20
melt(DT, measure.vars=nc::measure(column="[ab]", "_", number="[12]", as.integer))
#>       id number     a     b
#>    <num>  <int> <num> <num>
#> 1:     1      1    10    NA
#> 2:     1      2    20    21

I know this is a really big PR --- please tell me if there is anything I can do to help make it easier to review.
Thanks.

tdhock · 2020-10-02T00:07:16Z

this PR includes commits from other PRS #4720 and #4723 (these other ones are simpler and should be reviewed/merged first)

tdhock · 2020-10-02T06:17:12Z

I expected that the new method should be as fast as, or faster than existing approaches, so I did some timings to verify that empirically (source code for timings experiments/plots). The approaches I considered are:

tidyr::pivot_longer uses names_pattern/names_transform arguments (typical usage of tidyr).
data.table::melt.old.join/set use melt the old way (no variable_table attribute), as users would typically do with current master. The result of melt is post-processed by either set/:= or joining to a data table with info extracted from the input column names.
data.table::melt.new.pattern/sep uses the new measure.vars=measure(...) as users will typically do after merging this PR (uses new variable_table attribute).
data.table::melt.new.var_tab directly uses the new melt with variable_table attribute in measure.vars. Typically users would not do this directly but it is interesting here for comparison --- the only difference between var_tab and pattern/sep is the overhead of calling measure(...).

The first comparison figure plot the computation time as a function of the number of rows in the input:

It is clear from the who data (above right) that (1) all data table methods are faster than tidyr, (2) new data table methods are slightly faster than old.join method for large data, and (3) new.pattern is slower than new.var_tab by less than 0.01 seconds, indicating that the overhead of calling measure(...) is very small. The iris data (above left) additionally show that (4) there is not much difference between new pattern/sep methods, and (5) both are comparable to old.set method.

The second figure below plots the computation time as a function of the number of columns in the input:

Similar trends are evident in this comparison, and we can also see that old.join is slower than old.set.

Overall these timings provide convincing evidence that the new code is at least as fast as existing data table methods, and sometimes slightly faster. Additionally we see that the convenience of the measure(...) function only costs a very small amount in terms of computation time.

codecov · 2020-10-02T21:33:45Z

Codecov Report

Merging #4731 (c927a52) into master (ebc14ce) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4731      +/-   ##
==========================================
+ Coverage   99.44%   99.45%   +0.01%     
==========================================
  Files          73       73              
  Lines       14469    14612     +143     
==========================================
+ Hits        14388    14532     +144     
+ Misses         81       80       -1

Impacted Files	Coverage Δ
R/data.table.R	`99.94% <100.00%> (ø)`
R/fmelt.R	`100.00% <100.00%> (ø)`
R/utils.R	`100.00% <100.00%> (ø)`
src/fmelt.c	`99.64% <100.00%> (+0.21%)`	⬆️
src/init.c	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebc14ce...c927a52. Read the comment docs.

tdhock · 2020-10-03T02:40:15Z

Example with iris data:

remotes::install_github("Rdatatable/data.table@melt-custom-variable")
#> Skipping install of 'data.table' from a github remote, the SHA1 (7a73a777) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
iris.dt = data.table(iris)[c(1,150)]
melt(iris.dt, measure.vars=measure(part, value.name, sep="."))
#>      Species  part Length Width
#> 1:    setosa Sepal    5.1   3.5
#> 2: virginica Sepal    5.9   3.0
#> 3:    setosa Petal    1.4   0.2
#> 4: virginica Petal    5.1   1.8
melt(iris.dt, measure.vars=measure(value.name, dim, sep="."))
#>      Species    dim Sepal Petal
#> 1:    setosa Length   5.1   1.4
#> 2: virginica Length   5.9   5.1
#> 3:    setosa  Width   3.5   0.2
#> 4: virginica  Width   3.0   1.8
melt(iris.dt, measure.vars=measure(part, dim, sep="."))
#>      Species  part    dim value
#> 1:    setosa Sepal Length   5.1
#> 2: virginica Sepal Length   5.9
#> 3:    setosa Sepal  Width   3.5
#> 4: virginica Sepal  Width   3.0
#> 5:    setosa Petal Length   1.4
#> 6: virginica Petal Length   5.1
#> 7:    setosa Petal  Width   0.2
#> 8: virginica Petal  Width   1.8

TysonStanley · 2020-10-18T22:38:56Z

I don't have time right now to go through this in any depth, but the user functionality looks fantastic. I think it is intuitive and allows users to more simply access the speed/efficiency without, what I assume, any breaking changes.

tdhock · 2021-01-22T22:12:19Z

There are lots of conflicts in tests.Rraw due to the test numbers. At the time when this PR was initiated, the test numbers I used were new, but now they conflict with some test numbers which were added in other PRs which have been merged into master since then. Going forward what is the recommendation for choosing new test numbers in such a way as to avoid conflicts with other PRs? I checked the Contributing wiki page but I did not see any mention of this issue.

MichaelChirico · 2021-01-28T07:28:51Z

That's a pain point with our test numbering system, we don't really have a workaround.

But nothing needed on your end -- if that's the only conflict, it will be fixed by Matt during merge.

tdhock added 17 commits September 25, 2020 22:48

support missing values in measure.vars arg to melt

cadfd09

PR4720

522cc78

bump variable when all data are missing

61a3c8c

test/fix with id.vars

264169e

only link issue not PR in NEWS

9a58c4c

doc/exmple for missing entries of measure.vars list

0b69386

no newlines in news items

2a242e0

fix typo

50b7b2e

bugfix for melt with na.rm=T and list for measure.vars

e535423

remove tabs

6c8bb51

merge

4c5810c

merge

a1012ac

start with melt

999c91b

document values in variable column

c1fb10f

Merge branch 'fix4027' into melt-custom-variable

8d25cf2

new variable.name attribute for measure.vars

9b68a1b

use lvars in output column number allocation

e1b769c

tdhock added 4 commits October 1, 2020 17:08

fix segfault with na.rm=T

0bf6323

merge

73d3543

sep_ funs

b6407ab

increment test num, print statements

190e019

use datasets::iris to avoid iris copy in earlier tests

22fe79a

tdhock added 2 commits October 2, 2020 16:00

eval_with_cols instead of do_patterns

775f873

maybe_fun not maybe.fun

ab9cbbb

tdhock added 2 commits October 2, 2020 20:33

pattern_* takes ... instead of fun.list, error checking

ac6e82a

group_funs error checking

2cb0a59

tdhock added 2 commits October 15, 2020 16:49

minor typos

9c96302

measure error messages

cf11f67

tdhock and others added 2 commits October 18, 2020 22:47

more errors

608910e

more unusual type and arg name errors

5045e6f

tdhock mentioned this pull request Nov 17, 2020

How do you use capture_melt_multiple in this example? tdhock/nc#18

Closed

tdhock added 3 commits January 22, 2021 14:22

err fun

eed3129

merge

9a63df1

fix stop

40d2789

tdhock changed the title ~~Custom variable columns using variable_table attribute~~ melt with custom variable columns using variable_table attribute Jan 22, 2021

simplify using args

870bd83

tdhock mentioned this pull request Feb 25, 2021

fix variable with melt(measure.vars=list), na.rm=T/F consistency #4723

Merged

MichaelChirico and others added 2 commits April 28, 2021 22:47

Merge branch 'master' into melt-custom-variable

fe48070

Merge branch 'master' into melt-custom-variable

cde9b4a

mattdowle added this to the 1.14.1 milestone May 9, 2021

mattdowle added 7 commits May 9, 2021 01:26

news tweak

fafafc2

merge follow up

87c73ad

merge follow up: confirmed that eval_with_cols() replaced do_patterns()

dcca6fb

merge follow up: remove chmatch_na

e3b0582

whitespace

5b3f9b2

VarNameSymbol moved to init.c

f50f1cf

PROTECT not needed

c927a52

mattdowle merged commit f5c6526 into master May 9, 2021

mattdowle deleted the melt-custom-variable branch May 9, 2021 09:21

tdhock mentioned this pull request May 28, 2021

Feature Request : Reshape Data within DataTable h2oai/datatable#2499

Open

tdhock mentioned this pull request Jun 5, 2023

revdep CooRTweet check failures with new melt code #5649

Closed

jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

melt with custom variable columns using variable_table attribute #4731

melt with custom variable columns using variable_table attribute #4731

tdhock commented Oct 1, 2020 •

edited by mattdowle

Loading

tdhock commented Oct 2, 2020

tdhock commented Oct 2, 2020 •

edited

Loading

codecov bot commented Oct 2, 2020 •

edited

Loading

tdhock commented Oct 3, 2020 •

edited

Loading

TysonStanley commented Oct 18, 2020

tdhock commented Jan 22, 2021

MichaelChirico commented Jan 28, 2021

melt with custom variable columns using variable_table attribute #4731

melt with custom variable columns using variable_table attribute #4731

Conversation

tdhock commented Oct 1, 2020 • edited by mattdowle Loading

tdhock commented Oct 2, 2020

tdhock commented Oct 2, 2020 • edited Loading

codecov bot commented Oct 2, 2020 • edited Loading

Codecov Report

tdhock commented Oct 3, 2020 • edited Loading

TysonStanley commented Oct 18, 2020

tdhock commented Jan 22, 2021

MichaelChirico commented Jan 28, 2021

tdhock commented Oct 1, 2020 •

edited by mattdowle

Loading

tdhock commented Oct 2, 2020 •

edited

Loading

codecov bot commented Oct 2, 2020 •

edited

Loading

tdhock commented Oct 3, 2020 •

edited

Loading