Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd behaviour of melt() when using the cols argument in measure() #5063

Closed
UweBlock opened this issue Jul 4, 2021 · 13 comments · Fixed by #6077
Closed

Odd behaviour of melt() when using the cols argument in measure() #5063

UweBlock opened this issue Jul 4, 2021 · 13 comments · Fixed by #6077
Labels

Comments

@UweBlock
Copy link
Contributor

UweBlock commented Jul 4, 2021

measure() is a great enhancement to melt() - thanks a lot for this!

However, while experimenting with the new measure() function when trying to find an answer for this question on SO I came across an odd behaviour of melt() when the cols argument is used with measure():

# Minimal reproducible example

df <- fread("Country    AXOX1991    AXOX1992    BXOX1991    BXOX1992    CXOX1991    CXOX1992
Afghanistan 1   2   3   4   5   6
USA 6   5   4   3   2   1")

This gives the expected result (note that the AXO columns intentionally have be left out of the reshape):

melt(df,  measure.vars = measure(var, year, 
                                 pattern = "([BC]XO)X(\\d{4})"))
       Country AXOX1991 AXOX1992 var year value
1: Afghanistan        1        2 BXO 1991     3
2:         USA        6        5 BXO 1991     4
3: Afghanistan        1        2 BXO 1992     4
4:         USA        6        5 BXO 1992     3
5: Afghanistan        1        2 CXO 1991     5
6:         USA        6        5 CXO 1991     2
7: Afghanistan        1        2 CXO 1992     6
8:         USA        6        5 CXO 1992     1

Now, when I include cols in the call to measure()

(cols <- names(df)[4:7])

[1] "BXOX1991" "BXOX1992" "CXOX1991" "CXOX1992"

melt(df,  measure.vars = measure(var, year, cols = cols,
                                 pattern = "([BC]XO)X(\\d{4})"))

I get an unexpected and obviously wrong result and a warning:

   BXOX1992 CXOX1991 CXOX1992 var year       value
1:        4        5        6 BXO 1991 Afghanistan
2:        3        2        1 BXO 1991         USA
3:        4        5        6 BXO 1992           1
4:        3        2        1 BXO 1992           6
5:        4        5        6 CXO 1991           2
6:        3        2        1 CXO 1991           5
7:        4        5        6 CXO 1992           3
8:        3        2        1 CXO 1992           4

Warning message:
In melt.data.table(df, measure.vars = measure(var, year, :
'measure.vars' [Country, AXOX1991, AXOX1992, BXOX1991, ...] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'character'. All measure variables not of type 'character' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.

The documentation says on the pattern argument:

Perl-compatible regex with capture groups to match to cols.

Perhaps, I have completely misunderstood the meaning of the cols argument. Unfortunately, I have found no example of the proper use of cols neither in the documentation of measure() nor in the reshape vignette.


Even stranger, when using a function to create the character vector of column names in the call, I get an error message right away:

melt(df,  measure.vars = measure(var, year, cols = names(df)[4:7],
                                 pattern = "([BC]XO)X(\\d{4})"))

Error in (function (fun.list, sep = "_", pattern, cols, multiple.keyword = "value.name", :
cols must be a character vector of column names

# Output of sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.1

loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0  
@Henrik-P
Copy link

Henrik-P commented Jul 4, 2021

I also struggle to understand the cols argument. My data and pattern are even simpler, like:

d = data.table(x_1 = 1, x_2 = 2, y_1 = 3, y_2 = 4)
d
#   x_1 x_2 y_1 y_2
# 1:  1   2   3   4

In contrast to @uwe, I manage to select the cols using names(dt)[i], in the sense it doesn't error. However, the result isn't always what I would expect, and it seems to depend on which columns are selected. E.g. selecting the first two columns gives the result I expect:

melt(d, measure.vars = measure(var, ix, pattern = "(.*)_(.*)", cols = names(d)[1:2])) 
#    y_1 y_2 var ix value
# 1:   3   4   x  1     1
# 2:   3   4   x  2     2

However, when selecting the two "y" columns, only the "var" and "ix" columns are correct. I would expect columns "y_1" and "y_2" to be "x_1" and "x_2" with their corresponding values, and the "value" column to have values from "y_1" and "y_2".

melt(d, measure.vars = measure(var, ix, pattern = "(.*)_(.*)", cols = names(d)[3:4]))
#    y_1 y_2 var ix value
# 1:   3   4   y  1     1
# 2:   3   4   y  2     2

What am I missing here?

A big thanks to everyone involved in developing measure - extremely useful!

@tlapak
Copy link
Contributor

tlapak commented Jul 4, 2021

Thank you for testing new features before they are released! The cols argument to measure() has indeed not been handled properly.

@tlapak
Copy link
Contributor

tlapak commented Jul 5, 2021

Check you brackets. cols is an argument of measure. I haven't actually checked against dev, but it should error out. Works as expected in the PR I linked.

@Henrik-P
Copy link

Henrik-P commented Jul 5, 2021

Sorry @tlapak for me being sloppy. Deleted.

@tdhock
Copy link
Member

tdhock commented Jul 5, 2021

I wrote measure() and I never intended for the user to specify the cols argument. The idea was to mimic patterns() -- the user does not specify cols argument, but it is automatically set to the names of the table that was provided to melt.
so that is the problem, essentially the behavior is undefined.
for the use cases above, instead of specifying cols, I would recommend using a different pattern,

> d = data.table(x_1 = 1, x_2 = 2, y_1 = 3, y_2 = 4)
> melt(d, measure.vars = measure(ix, pattern = "x_(.*)"))
     y_1   y_2     ix value
   <num> <num> <char> <num>
1:     3     4      1     1
2:     3     4      2     2

@tdhock
Copy link
Member

tdhock commented Jul 5, 2021

I think this is a problem with documentation. The OP was able to get the result they wanted by specifying the pattern arg without the cols arg. So would it fix the issue to change the documentation of the cols arg to something like the following? "cols: a character vector that is automatically provided, with all of the column names of the data table input to melt. The user should NOT provide this argument; to specify the columns to be measured, use the pattern argument."

@UweBlock
Copy link
Contributor Author

UweBlock commented Jul 6, 2021

Thank you for the detailed explanation, @tdhock !

My impression was that the cols argument is to be used together with the sep/patterns argument to define a subset of columns of the dataset which need to be reshaped and on which sep/patterns is to be applied.

This would help in cases where sep is to simplistic to define the columns to reshape. Also, it would make it less complex to write regular expressions for patterns (especially for users who are less acquainted with regular expressions). In my example #5063 I could have written pattern = "(\\w+)X(\\d{4})" instead of pattern = "([BC]XO)X(\\d{4})" if cols would restrict the columns to reshape.

So, if cols does not have this expected (at least by me) functionality or any other meaning to the user, then I would second @MichaelChirico suggestion to remove cols from the API.

@tdhock
Copy link
Member

tdhock commented Jul 6, 2021

yes let's remove cols arg if possible, or at least clarify in the documentation that it should not be specified by the user.

@MichaelChirico
Copy link
Member

We can make !missing(cols) an error

@tdhock
Copy link
Member

tdhock commented Jul 7, 2021

that is an interesting idea @MichaelChirico but I don't think that will work. (we will get error even when user does not specify cols, because it is filled in automatically -- not missing)

@skanskan
Copy link

Hello.

Where can we find the documentation and examples on how to use measure() ?

Thanks.

@tdhock
Copy link
Member

tdhock commented Oct 29, 2021

?measure and search for measure( in datatable-reshape vignette (in github master)

@tdhock
Copy link
Member

tdhock commented Jul 1, 2023

Going back to the example in your original post, if you run that code using #5115, then you will get the informative error message below, instead of the warning and wrong result:

> melt(df,  measure.vars = measure(var, year, cols = cols, pattern = "([BC]XO)X(\\d{4})"))
Erreur : user should not provide cols argument to measure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment