Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: expansion of melt functionality for handling names of output #1547

Closed
MichaelChirico opened this issue Feb 25, 2016 · 17 comments
Closed

Comments

@MichaelChirico
Copy link
Member

MichaelChirico commented Feb 25, 2016

I'm trying to summarize some data which I have stored (in wide form) as repeated cross-sections; the way to go would appear to be to summarize and melt, but there's some crucial information lost in the process -- variable names are tossed & it's not clear we can predict the resulting order.

An example:

set.seed(10239)
DT <- setnames(as.data.table(replicate(8, runif(10))),
               paste0("meas", rep(1:3, c(3, 3, 2)), 
                      rep(c("_jan", "_feb", "_mar"))))
DT
#        meas1_jan meas1_feb  meas1_mar  meas2_jan
#  1: 0.6600937557 0.3694227 0.04429776 0.02691871
#  2: 0.5562533571 0.2260477 0.99935621 0.62940965
#  3: 0.7800425424 0.9500860 0.05847563 0.90400149
#  4: 0.3372507696 0.1853700 0.96434815 0.23819722
#  5: 0.0008951684 0.2743278 0.33468493 0.55100025
#  6: 0.4064611420 0.6773607 0.76984110 0.03732329
#  7: 0.5057564366 0.6481860 0.46362786 0.10610388
#  8: 0.2232209216 0.3236526 0.94694898 0.32251033
#  9: 0.9359388519 0.1219411 0.79383274 0.52573046
# 10: 0.1367496471 0.0121765 0.80466101 0.86377631
#       meas2_feb  meas2_mar  meas3_jan meas3_feb
#  1: 0.438328012 0.48073709 0.15475894 0.6031853
#  2: 0.951975449 0.38903743 0.41983518 0.1806789
#  3: 0.006685105 0.53661894 0.01102345 0.6490597
#  4: 0.719372826 0.49205557 0.24464973 0.9969787
#  5: 0.318251195 0.98845165 0.75659034 0.3054178
#  6: 0.117685706 0.10667707 0.60810269 0.1472457
#  7: 0.210446290 0.76481919 0.70049390 0.4345142
#  8: 0.900549510 0.68583183 0.93117619 0.3676016
#  9: 0.628168668 0.02526911 0.55742879 0.7729294
# 10: 0.498467443 0.17884387 0.97037215 0.1924867

We want to melt grouping each month's data, so patterns is quite helpful:

melt(DT[ , lapply(.SD, mean)],
     measure.vars = patterns("_jan", "_feb", "_mar"))

Looks beautiful, but the output less so:

#    variable    value1    value2    value3
#1:        1 0.4542663 0.3788571 0.6180074
#2:        2 0.4204972 0.4789930 0.4648342
#3:        3 0.5354431 0.4650098        NA

We've lost a lot of info. Does value1 correspond to jan, feb, or mar? Does variable == 1 mean meas1, meas2, or meas3?

I hope that value1 means jan, value2 means feb, and value3 means mar, but this is unclear; ditto that valuek corresponds to variable == k. This is especially true if the column order of the input is potentially unknown.

The value.name and variable.name fields are pretty useless to help us here.

It seems the robust way to deal with this as of now is to specify the measure.vars in a list, like so:

melt(DT[ , lapply(.SD, mean)],
     measure.vars = list(paste0("meas", 1:3, "_jan"),
                         paste0("meas", 1:3, "_feb"),
                         paste0("meas", 1:2, "_mar")))

With this, confidence is restored in the order of output, so we can at least rename the output without worry of mis-naming something. But this seems verbose and ugly, and to make patterns seem a lot less helpful as a function.

@arunsrinivasan
Copy link
Member

Good point. It'd be nice to have value column and variable levels filled by guessing. But I won't be working on melt/cast anytime soon for now. This stuff is in R-side, IIRC. PRs are of course welcome.

@jangorecki do you think you can take a look at this? We'll need a way to extract the names and levels for value/variable column..

@mrdwab
Copy link

mrdwab commented Feb 26, 2016

@MichaelChirico, It seems like we're thinking along the same lines for these things.

For what it's worth, I always find that "stubs" at the end of a variable name are more cumbersome to deal with. This is the case even with base R's reshape function.

Here's something that I had put together. Something in there slows it down to ~ half the speed of just using melt (but it's still very fast).

Even with that function, I would propose changing the name format to "jan_mean1" and so on.

Here's an example:

set.seed(10239)
DT <- setnames(as.data.table(replicate(8, runif(10))),
               paste0("meas", rep(1:3, c(3, 3, 2)), 
                      rep(c("_jan", "_feb", "_mar"))))
ReshapeLong_(setnames(DT[, lapply(.SD, mean)], 
                      gsub("(.*)_(.*)", "\\2_\\1", names(DT))),
             stubs = c("jan", "feb", "mar"))
#    variable       jan       feb       mar
# 1:   _meas1 0.4542663 0.3788571 0.6180074
# 2:   _meas2 0.4204972 0.4789930 0.4648342
# 3:   _meas3 0.5354431 0.4650098        NA

My rudimentary version of this was merged.stack in "splitstackshape", which has an atStart argument. However, it needs an "id" column the way it was written:

merged.stack(DT[, lapply(.SD, mean)][, id := sequence(.N)], 
             var.stubs = c("_jan", "_feb", "_mar"), 
             sep = "var.stubs", atStart = FALSE)
#    id .time_1      _jan      _feb      _mar
# 1:  1   meas1 0.4542663 0.3788571 0.6180074
# 2:  1   meas2 0.4204972 0.4789930 0.4648342
# 3:  1   meas3 0.5354431 0.4650098        NA

@mrdwab
Copy link

mrdwab commented Feb 26, 2016

@Arun, I think we were discussing about using a character vector in patterns (which could then be reused in the value.name argument). (Probably in a chat on SO.)

@jangorecki
Copy link
Member

@arunsrinivasan sure
@mrdwab be aware "@Arun" won't notify our Arun, it's different person :)

@jangorecki jangorecki self-assigned this Feb 26, 2016
@arunsrinivasan
Copy link
Member

@jangorecki this is not a priority just so you know :-).

@jangorecki
Copy link
Member

@MichaelChirico I don't see why value.name would be useless. If you pass a vector of names corresponding to pattern (or list elements if list used) as value.name = c("jan","feb","mar") you get expected column names. Still this could be extended to extract from names of measure.vars list / patterns(...), but it wouldn't unlock any new capabilities.
I do agree on variable column values, currently for multi measure vars it is just an integer sequence, so no straight way to put expected values there. This seems to be Cfmelt code, what I could do is to remap integers after Cfmelt processing.

@jangorecki
Copy link
Member

library(data.table)

set.seed(10239)
DT <- setnames(as.data.table(replicate(8, runif(10))),
               paste0("meas", rep(1:3, c(3, 3, 2)), 
                      rep(c("_jan", "_feb", "_mar"))))

melt(DT[ , lapply(.SD, mean)],
     measure.vars = list(paste0("meas", 1:3, "_jan"),
                         paste0("meas", 1:3, "_feb"),
                         paste0("meas", 1:2, "_mar")),
     variable.name = "var.label",
     value.name = c("jan","feb","mar"))
#   var.label       jan       feb       mar
#1:     meas1 0.4542663 0.3788571 0.6180074
#2:     meas2 0.4204972 0.4789930 0.4648342
#3:     meas3 0.5354431 0.4650098        NA
melt(DT[ , lapply(.SD, mean)],
     measure.vars = patterns("_jan", "_feb", "_mar"),
     variable.name = "var.label",
     value.name = c("jan","feb","mar"))
#   var.label       jan       feb       mar
#1:     meas1 0.4542663 0.3788571 0.6180074
#2:     meas2 0.4204972 0.4789930 0.4648342
#3:     meas3 0.5354431 0.4650098        NA

it trims value.name values (also ,_) from column names that goes intovariablecolumn values.
In general it should be handled with vectorized longest common substring, which in base R seems to be quite a big block of code.
For exact changes see: master...jangorecki:named-melt - it is yet very error prone (fails one example now), it passvalue.nametogsubpattern argument so any special chars like dot will now have unexpected output.
Before I go any further with that I would prefer to be sure we want such solution, to post-processCfmeltwithvariable column values remapping. Feel free to pick it up further.

@MichaelChirico
Copy link
Member Author

@jangorecki hmm, I thought I had tried that for value.name, must have been too late at night to have been thinking straight.

Certainly this capability is not clear from the man page:

value.name: name for the molten data values column. The default name is 'value'

Not clear that it accepts name_s_ -- reads as if value.name must be passed length-one character.

While we're on the man page, this phrasing is misleading:

measure.vars: ... Use the function patterns to provide multiple patterns conveniently

patterns is not a function, as evidenced by its implementation:

measure.sub = substitute(measure.vars)
if (is.call(measure.sub) && measure.sub[[1L]] == "patterns") {
    measure.vars = lapply(eval(measure.sub), grep, names(data))
}

If it were a function, this workaround/approach would work

mos <- c("jan","feb","mar")

melt(DT[ , lapply(.SD, mean)],
     measure.vars = do.call("patterns",as.list(paste0("_", mos))),
     value.name = mos)

@jangorecki
Copy link
Member

For programmatic use you need to have that call in place. You can do it with substitute, just wrap below call with eval.

substitute(
    melt(DT[ , lapply(.SD, mean)],
         measure.vars = .measure.vars,
         value.name = mos),
    list(.measure.vars = as.call(c(as.name("patterns"),paste0("_", mos))))
)

Anyway I'm not quite convinced to Cfmelt post-process remapping variable values. Any better solution from me would probably need to wait until I pick up some C skills.

@jangorecki jangorecki removed their assignment Mar 3, 2016
@MichaelChirico
Copy link
Member Author

@jangorecki certainly the dox can be improved, value.name should be improved:

name for the molten data values column. The default name is 'value'

Almost impossible to guess that value.name will correctly handle an input of length >1 from that

@franknarf1
Copy link
Contributor

This may have already been covered/ obvious (I see Jan mentioned "this could be extended to extract from names of measure.vars list / patterns(...)"), but it would be nice to write value.name into patterns:

melt(crm, meas=patterns(N = "^N", a = "^a", b = "^b", d = "^d", g = "^g"))

(Borrowed from an SO question.)

@arunsrinivasan
Copy link
Member

Oh I like the named patterns function.

@franknarf1
Copy link
Contributor

(I think this still falls under the same general issue:)

library(data.table)
df <- data.table(
  a1 = c(1,2,3), 
  a2 = c(2,3,4), 
  b1 = c(4,5,6), 
  b2 = c(5,6,7)
)
aux <- data.table(
  id = c("a1", "a2", "b1", "b2"),
  group = c("a", "a", "b", "b")
)

Suppose the goal is to have (a1, b1) rbinded over (a2, b2):

# current syntax
aux_sp = split(aux$id, aux$group)
melt(df, meas = aux_sp, value.name = names(aux_sp))

# desired syntax, with names taken from the named list aux_sp
melt(df, meas = aux_sp)

Taken from SO: http://stackoverflow.com/q/42375113/

@MichaelChirico
Copy link
Member Author

@franknarf1 I was just coming back here to suggest the same thing about naming the "arguments" to patterns. I think that's a very natural implementation for this.

@MichaelChirico
Copy link
Member Author

@mrdwab
Copy link

mrdwab commented Jan 18, 2018

@MichaelChirico , @mattdowle , I like the addition of named patterns. However, I think there were two embedded questions in this issue, but only one of them is closed. Isn't part of the question also how to make the "variable" column retain more descriptive details when melting, and not just return integer values?

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jan 18, 2018

@mrdwab indeed, I had missed that, however #2551 is dedicated to specifically that (and is on my to-do list), so rest assured the feature is still forthcoming 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants