Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any reason for j to evaluate when i returns 0 rows? #2662

Closed
MichaelChirico opened this issue Mar 7, 2018 · 10 comments
Closed

Is there any reason for j to evaluate when i returns 0 rows? #2662

MichaelChirico opened this issue Mar 7, 2018 · 10 comments
Labels

Comments

@MichaelChirico
Copy link
Member

DT = data.table(a = NA)
DT[!is.na(a), plot(density(a))]

This returns

Error in density.default(a) : argument 'x' must be numeric

I'm surprised we got to j. Is there any reason we'd want to evaluate j here?

I guess this is to cover grouping situations where we want to make sure each group returns the same shape, even if it has 0 rows?

If so, is there not something we can do to be smarter about how to handle this?

(in this situation, I'm not even sure what my expected output is, TBH)

@franknarf1
Copy link
Contributor

Is there any reason for j to evaluate when i returns 0 rows?

I use .N there all the time. As in..

X[Y, .N, by=.EACHI]
# or
X[cond, stopifnot(.N == 0L)]

I probably wrote the latter a dozen times yesterday.

I imagine I might also do j = if (.N == 0L) bah else gah there too. For your use-case, I guess this solves it. I also do j = {gah; NULL} though that doesn't work in your case.

Somewhat related: #2022

@MichaelChirico
Copy link
Member Author

interesting... so then maybe there's some compromise to strike where j is inspected for structure and not evaluated?

@franknarf1
Copy link
Contributor

franknarf1 commented Mar 7, 2018

Hm, maybe you have a better example? Your example fails not only with zero rows but also with one (assuming you use a variable of a valid class, so not NA):

library(data.table)

# zero rows
data.table(a = NA_real_)[!is.na(a), plot(density(a))]
# Error in density.default(a) : 
#   need at least 2 points to select a bandwidth automatically

# one row
data.table(a = 1)[!is.na(a), plot(density(a))]
# Error in density.default(a) : 
#   need at least 2 points to select a bandwidth automatically

(So this seems to argue for the if (.N > threshold) do_stuff idiom here, I mean.)

@brianbolt
Copy link

FWIW, the behavior of data.table changed in 1.9.8 to cause j to be evaluated regardless of the size. There is some lengthy discussion here along with examples:

#759 (comment)

For my team, this change was/is a big problem which caused us to lock data.table to 1.9.6 with no plans to update as it is too much overhead to find all uses of data.table that would be affected by this change. It would help us immensely if there was an option to not evaluate j if .N is 0

@jangorecki
Copy link
Member

I used to use if(.N > 0) to deal with it.

@brianbolt you should be able to use dtq package to identify each single case like this without much hassle https://github.com/jangorecki/dtq

@mattdowle
Copy link
Member

mattdowle commented Mar 16, 2018

Here's the news item from 1.9.8 in the potentially-breaking-changes section at the top :

image

The driver was := adding columns consistently (a side-effect of j that happens even when i is empty).

#759 contains this comment from me :

What is this expression please (roughly) and why does it break on empty input? I'm thinking we should focus on your RHS of := so that it returns numeric(), integer() or character() on empty input.
And you must have switches in your current code after this statement then? To cope with the new b column existing or not depending on whether i returns all-FALSE or not?

In this case, the examples data and example expression is critical.

Consider the provided example so far :

DT = data.table(a = NA)
DT[!is.na(a), plot(density(a))]
Error in density.default(a) : argument 'x' must be numeric

but you get that error anyway, regardless of how many i returns :

> DT[, plot(density(a))]
Error in density.default(a) : argument 'x' must be numeric
> plot(density(NA))
Error in density.default(NA) : argument 'x' must be numeric
> plot(density(c(NA,NA,NA)))
Error in density.default(c(NA, NA, NA)) : argument 'x' must be numeric

In this comment, Uwe changed his RHS expression to be a different function that is robust to empty input. That's fixing the root cause and seems right to me.

I added a further comment here to show what I mean. Similarly, we need better examples to consider.

@brianbolt So you're stuck at 1.9.6 but it is working. You know you're depending on that behaviour in 1.9.6 (because it broke in 1.9.8) but pinpointing which lines is what you'd like so you can migrate safely. If I understand correctly, you accept the new behaviour is the right way to go? If not, lets continue to discuss the cases with better examples from you. If yes, what might work is a modified version of 1.9.6 that emits a trace every time i is empty and does not run j. This way you get no other breakage, but you've identified the dependent lines and can fix them. If that's what you'd like, maybe that's possible.

@mattdowle
Copy link
Member

mattdowle commented Mar 16, 2018

Anyway, it's nice to hear people have production code using data.table. Most people weighing up the options don't think you exist. If folk who do have production code using data.table could think of ways they could support the project, that would be great, @brianbolt.

@jaapwalhout
Copy link

jaapwalhout commented Mar 16, 2018

@mattdowle I'm planning to use data.table in production code as well. Would be nice if there was a function to check R-script files (.R) for breaking changes. If you like the idea, I could elaborate on this in a FR.

(Although this isn't support, it could serve imo as a direction for improving the chances of data.table being used in production code)

@mattdowle
Copy link
Member

mattdowle commented Mar 16, 2018

I had a few more thoughts on this. In no particular order.

  1. The news item in 1.10.0 had a note :

image

I do admit that I could have handled 1.9.8 better, and there should have been a migration option for this change. That wasn't "the best we could have done", I could have added an option too and should have done.

  1. I much appreciate @brianbolt has made a constructive suggestion. He hasn't just complained about breaking change, but suggests there should have been an option. Agreed. I know that Google are stuck at v1.9.6 too and that something is preventing upgrade, but I don't know what (I did ask). You aren't at Google are you Brian by any chance? Is this the issue? I know that Google have a very strict single monolithic version model for all projects, described in detail in this paper. Is this what we're dealing with?

  2. @jaapwalhout Yes please elaborate. What about somehow analyzing changes to existing tests in tests.Raw. It's likely very tough though.

  3. When you're upgrading the software used in production systems, everyone knows this is very tough anyway. It always breaks. You don't make any changes to any versions at all of anything, unless you do it in a staging environment first. You run the system in parallel in staging for a while and fix the problems in staging. Once you have a mirror system that's upgraded and working and producing the same results as un-upgraded production, say for a full week with no problems, staging becomes production. You just flip a flag. This takes an enormous amount of effort to tee your feeds and duplicate all activity to staging, but you have to do it. When you manage a production system this way, it becomes easier to upgrade and maintain it. The longer you leave production un-upgraded, the harder it is to do when you finally are forced to. Trying to do it in one weekend (when production is offline) with the risk of not being back up by Monday morning, is the way it was done in the 90's.

  4. Are we really talking about one mistake here? The breaking changes without an option in 1.9.8. Or are there others.

The fault was not providing an upgrade option in this case. And over in PR #2652, I've been tamed there by others on this project, and now we're not only providing options to return the old behaviour but also not even changing the default yet but giving notice that it will change in future. This is taking an enormous amount of time for such few people on this project. I haven't even started CRAN_Release procedures yet to check reverse dependencies (over 300). We should have a pre-release code freeze too at some point.

@jangorecki
Copy link
Member

jangorecki commented Jan 29, 2019

The question asked by Michael in this issue has been answered.
Also follow up about breaking changes was well explained by Matt. I can also mention that we try to make transition regarding breaking changes as smooth as possible, giving old-behaviour-option, warnings, and even more sophisticated ways, before forcing user to upgrade of their code.
For production system I would recommend put tests inside your R package that you use in production, and just run tests environment using options(repos=c("https://Rdatatable.github.io/data.table","your_cran_mirror")).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants