Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dcast: add argument to error on "Aggregate function missing, defaulting to 'length'" #5386

Open
ecoRoland2 opened this issue May 19, 2022 · 13 comments

Comments

@ecoRoland2
Copy link

We all love seeing Aggregate function missing, defaulting to 'length' as it usually indicates a bug (possibly upstream resulting in unexpected input). I really need this to be an error but can't even use options(warn = 2) because dcast only gives a message and not a warning.

I suggest modifying dcast.data.table. One easy option would be a strict argument that defaults to FALSE but promotes the message to an error if set to TRUE. If we want to avoid additional arguments, we could have fun.aggregate accept NA as input with the same result.

Right now the best way to catch this case appears to be stopifnot(anyDuplicated(DT[, .(LHS, RHS)]) == 0L) but I dislike littering my code with this.

@ecoRoland2 ecoRoland2 changed the title dcast: add option to error on "Aggregate function missing, defaulting to 'length'" dcast: add argument to error on "Aggregate function missing, defaulting to 'length'" May 19, 2022
@tdhock tdhock added the reshape dcast melt label May 19, 2022
@AltfunsMA
Copy link

Upvoted. Any solution seems better than the current state. Even simply changing the 'message' to 'warning' would make debugging a lot easier if we've got a number of calls to dcast

@Nj221102
Copy link
Contributor

Nj221102 commented Mar 29, 2024

Hi , i am working on this issue
my approach

  1. Introduce a new argument, "strict", to the function signature, defaulting to "FALSE".
  2. Modify the function logic to check the value of "strict" and raise an error instead of a warning, when its true.

is this all that needs to be done here ?

@tdhock
Copy link
Member

tdhock commented Mar 29, 2024

I agree that a warning would be more useful than a message. Rather than adding a new strict argument, I would propose just changing the message to a warning, is that OK with others?

@MichaelChirico
Copy link
Member

One free approach for downstreams is to throw on message since it is a condition:

DT = data.table(a = rep(1:2, 3), b = rep(1:2, each=3L), c = 1:6)
tryCatch(dcast(DT, a ~ b, value.var="c"), message=stop)

You could also accomplish this with globalCallingHandlers():

globalCallingHandlers(message = stop)
dcast(DT, a ~ b, value.var="c")

(or message=warning with options(warn=2))

So, I'm not sure we need to do anything here. message() does seem like a roughly appropriate level of condition, though I'm not opposed to making it a warning. I would definitely not add a new argument for this.

This does seem like a case for #5913 -- if we instead emitted a classed message like dt_missing_agg_function_message, it would be much more palatable to add a handler for that than a global upgrade of message conditions:

globalCallingHandlers(dt_missing_agg_function_message = stop)

@ecoRoland2
Copy link
Author

ecoRoland2 commented Apr 8, 2024

I would still argue it should be a warning and not a message. I have never actually wanted length as aggregate function. Whenever I have seen this message, it was unintended and indicated issues in upstream code or wrong assumptions about the input data.

The approach suggested by @MichaelChirico is elegant but data.table users who are familiar with globalCallingHandlers are probably a rare exception. So, this would need to be well documented in help("dcast.data.table").

@MichaelChirico
Copy link
Member

TBF I have wanted fun.aggregate=length many times, and there are plenty of such usages on GitHub:

https://github.com/search?q=lang%3AR+%2F%28%5B%5E%3A%5D%7C%5B%5E2%5D%3A%3A%29dcast%5B%28%5D%2F+%2Ffun%5B.%5Daggregate%5Cs*%3D%5Cs*length%2F+-path%3Adcast.data.table+-path%3Afcast.R&type=code

But would be fine with needing to provide that explicitly when needed instead of relying on it as a default. @ecoRoland2 / @Nj221102 would you like to file a PR upgrading the message() to warning()?

@ecoRoland2
Copy link
Author

I suspect that there are quite a few real-world applications with undiscovered bugs due to this "silent default".

@MichaelChirico If you are fine with that, wouldn't it follow that the default fun.aggregate=length should be formally deprecated? I don't know your procedure for that but it probably looks like 1. Add a deprecation message in the next release, 2. Turn the message into a warning in the release after that, 3. Throw an error if there are duplicates and no fun.aggregate is supplied.

@Nj221102
Copy link
Contributor

Nj221102 commented Apr 15, 2024

TBF I have wanted fun.aggregate=length many times, and there are plenty of such usages on GitHub:

https://github.com/search?q=lang%3AR+%2F%28%5B%5E%3A%5D%7C%5B%5E2%5D%3A%3A%29dcast%5B%28%5D%2F+%2Ffun%5B.%5Daggregate%5Cs*%3D%5Cs*length%2F+-path%3Adcast.data.table+-path%3Afcast.R&type=code

But would be fine with needing to provide that explicitly when needed instead of relying on it as a default. @ecoRoland2 / @Nj221102 would you like to file a PR upgrading the message() to warning()?

@MichaelChirico for now what would be better , upgrade this from a basic "message" to a "warning," or just add the deprecation message as suggested by @ecoRoland2.

@tdhock
Copy link
Member

tdhock commented Apr 15, 2024

I don't think deprecation of length as default would be a good idea, that is a really big change.

@tdhock
Copy link
Member

tdhock commented Apr 15, 2024

also I'm OK with not changing anything, and just using @MichaelChirico 's work-around,

tryCatch(dcast(DT, a ~ b, value.var="c"), message=stop)

@MichaelChirico
Copy link
Member

I am not convinced deprecation is a good idea. Even upgrading to warning can be a breaking change (for packages and strict production scripts where options(warn=2) is set).

@ecoRoland2
Copy link
Author

Using length as default and throwing a message is just a very bad design decision. The fact that a message was considered necessary is sufficient proof of that. (And I know that it wasn't a decision made by the data.table maintainers.)

In my opinion, upgrading to a warning is a breaking change. Basically, you will have to supply length explicitely because a warning isn't something you should ignore. That's exactly the same consequence as deprecation would have.

I would be fine if it was throwing a classed message (and in that case might even consider adding the global calling handler to my Rprofile.site). But I still think users shouldn't need to do that to get safe behavior.

@MichaelChirico
Copy link
Member

Using length as default and throwing a message is just a very bad design decision

Sorry, I disagree. I really only support upgrading message --> warning as the class of signal. #5914 is now merged, meaning we can make the signal class more specific than the R-provided ones as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants