Is there any reason for j to evaluate when i returns 0 rows? #2662

MichaelChirico · 2018-03-07T10:38:19Z

DT = data.table(a = NA)
DT[!is.na(a), plot(density(a))]

This returns

Error in density.default(a) : argument 'x' must be numeric

I'm surprised we got to j. Is there any reason we'd want to evaluate j here?

I guess this is to cover grouping situations where we want to make sure each group returns the same shape, even if it has 0 rows?

If so, is there not something we can do to be smarter about how to handle this?

(in this situation, I'm not even sure what my expected output is, TBH)

The text was updated successfully, but these errors were encountered:

franknarf1 · 2018-03-07T16:58:41Z

Is there any reason for j to evaluate when i returns 0 rows?

I use .N there all the time. As in..

X[Y, .N, by=.EACHI]
# or
X[cond, stopifnot(.N == 0L)]

I probably wrote the latter a dozen times yesterday.

I imagine I might also do j = if (.N == 0L) bah else gah there too. For your use-case, I guess this solves it. I also do j = {gah; NULL} though that doesn't work in your case.

Somewhat related: #2022

MichaelChirico · 2018-03-07T17:22:52Z

interesting... so then maybe there's some compromise to strike where j is inspected for structure and not evaluated?

franknarf1 · 2018-03-07T20:13:56Z

Hm, maybe you have a better example? Your example fails not only with zero rows but also with one (assuming you use a variable of a valid class, so not NA):

library(data.table)

# zero rows
data.table(a = NA_real_)[!is.na(a), plot(density(a))]
# Error in density.default(a) : 
#   need at least 2 points to select a bandwidth automatically

# one row
data.table(a = 1)[!is.na(a), plot(density(a))]
# Error in density.default(a) : 
#   need at least 2 points to select a bandwidth automatically

(So this seems to argue for the if (.N > threshold) do_stuff idiom here, I mean.)

brianbolt · 2018-03-07T22:58:31Z

FWIW, the behavior of data.table changed in 1.9.8 to cause j to be evaluated regardless of the size. There is some lengthy discussion here along with examples:

#759 (comment)

For my team, this change was/is a big problem which caused us to lock data.table to 1.9.6 with no plans to update as it is too much overhead to find all uses of data.table that would be affected by this change. It would help us immensely if there was an option to not evaluate j if .N is 0

jangorecki · 2018-03-08T16:27:10Z

I used to use if(.N > 0) to deal with it.

@brianbolt you should be able to use dtq package to identify each single case like this without much hassle https://github.com/jangorecki/dtq

mattdowle · 2018-03-16T07:57:11Z

Here's the news item from 1.9.8 in the potentially-breaking-changes section at the top :

The driver was := adding columns consistently (a side-effect of j that happens even when i is empty).

#759 contains this comment from me :

What is this expression please (roughly) and why does it break on empty input? I'm thinking we should focus on your RHS of := so that it returns numeric(), integer() or character() on empty input.
And you must have switches in your current code after this statement then? To cope with the new b column existing or not depending on whether i returns all-FALSE or not?

In this case, the examples data and example expression is critical.

Consider the provided example so far :

DT = data.table(a = NA)
DT[!is.na(a), plot(density(a))]
Error in density.default(a) : argument 'x' must be numeric

but you get that error anyway, regardless of how many i returns :

> DT[, plot(density(a))]
Error in density.default(a) : argument 'x' must be numeric
> plot(density(NA))
Error in density.default(NA) : argument 'x' must be numeric
> plot(density(c(NA,NA,NA)))
Error in density.default(c(NA, NA, NA)) : argument 'x' must be numeric

In this comment, Uwe changed his RHS expression to be a different function that is robust to empty input. That's fixing the root cause and seems right to me.

I added a further comment here to show what I mean. Similarly, we need better examples to consider.

@brianbolt So you're stuck at 1.9.6 but it is working. You know you're depending on that behaviour in 1.9.6 (because it broke in 1.9.8) but pinpointing which lines is what you'd like so you can migrate safely. If I understand correctly, you accept the new behaviour is the right way to go? If not, lets continue to discuss the cases with better examples from you. If yes, what might work is a modified version of 1.9.6 that emits a trace every time i is empty and does not run j. This way you get no other breakage, but you've identified the dependent lines and can fix them. If that's what you'd like, maybe that's possible.

mattdowle · 2018-03-16T08:00:28Z

Anyway, it's nice to hear people have production code using data.table. Most people weighing up the options don't think you exist. If folk who do have production code using data.table could think of ways they could support the project, that would be great, @brianbolt.

jaapwalhout · 2018-03-16T18:25:54Z

@mattdowle I'm planning to use data.table in production code as well. Would be nice if there was a function to check R-script files (.R) for breaking changes. If you like the idea, I could elaborate on this in a FR.

(Although this isn't support, it could serve imo as a direction for improving the chances of data.table being used in production code)

mattdowle · 2018-03-16T20:27:43Z

I had a few more thoughts on this. In no particular order.

The news item in 1.10.0 had a note :

I do admit that I could have handled 1.9.8 better, and there should have been a migration option for this change. That wasn't "the best we could have done", I could have added an option too and should have done.

I much appreciate @brianbolt has made a constructive suggestion. He hasn't just complained about breaking change, but suggests there should have been an option. Agreed. I know that Google are stuck at v1.9.6 too and that something is preventing upgrade, but I don't know what (I did ask). You aren't at Google are you Brian by any chance? Is this the issue? I know that Google have a very strict single monolithic version model for all projects, described in detail in this paper. Is this what we're dealing with?
@jaapwalhout Yes please elaborate. What about somehow analyzing changes to existing tests in tests.Raw. It's likely very tough though.
When you're upgrading the software used in production systems, everyone knows this is very tough anyway. It always breaks. You don't make any changes to any versions at all of anything, unless you do it in a staging environment first. You run the system in parallel in staging for a while and fix the problems in staging. Once you have a mirror system that's upgraded and working and producing the same results as un-upgraded production, say for a full week with no problems, staging becomes production. You just flip a flag. This takes an enormous amount of effort to tee your feeds and duplicate all activity to staging, but you have to do it. When you manage a production system this way, it becomes easier to upgrade and maintain it. The longer you leave production un-upgraded, the harder it is to do when you finally are forced to. Trying to do it in one weekend (when production is offline) with the risk of not being back up by Monday morning, is the way it was done in the 90's.
Are we really talking about one mistake here? The breaking changes without an option in 1.9.8. Or are there others.

The fault was not providing an upgrade option in this case. And over in PR #2652, I've been tamed there by others on this project, and now we're not only providing options to return the old behaviour but also not even changing the default yet but giving notice that it will change in future. This is taking an enormous amount of time for such few people on this project. I haven't even started CRAN_Release procedures yet to check reverse dependencies (over 300). We should have a pre-release code freeze too at some point.

jangorecki · 2019-01-29T06:38:13Z

The question asked by Michael in this issue has been answered.
Also follow up about breaking changes was well explained by Matt. I can also mention that we try to make transition regarding breaking changes as smooth as possible, giving old-behaviour-option, warnings, and even more sophisticated ways, before forcing user to upgrade of their code.
For production system I would recommend put tests inside your R package that you use in production, and just run tests environment using options(repos=c("https://Rdatatable.github.io/data.table","your_cran_mirror")).

MichaelChirico added the question label Mar 7, 2018

jangorecki closed this as completed Jan 29, 2019

MichaelChirico mentioned this issue Jul 29, 2019

data.table(double(0), 1) returns a zero-row table #3727

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any reason for j to evaluate when i returns 0 rows? #2662

Is there any reason for j to evaluate when i returns 0 rows? #2662

MichaelChirico commented Mar 7, 2018

franknarf1 commented Mar 7, 2018

MichaelChirico commented Mar 7, 2018

franknarf1 commented Mar 7, 2018 •

edited

brianbolt commented Mar 7, 2018

jangorecki commented Mar 8, 2018

mattdowle commented Mar 16, 2018 •

edited

mattdowle commented Mar 16, 2018 •

edited

jaapwalhout commented Mar 16, 2018 •

edited

mattdowle commented Mar 16, 2018 •

edited

jangorecki commented Jan 29, 2019 •

edited

Is there any reason for j to evaluate when i returns 0 rows? #2662

Is there any reason for j to evaluate when i returns 0 rows? #2662

Comments

MichaelChirico commented Mar 7, 2018

franknarf1 commented Mar 7, 2018

MichaelChirico commented Mar 7, 2018

franknarf1 commented Mar 7, 2018 • edited

brianbolt commented Mar 7, 2018

jangorecki commented Mar 8, 2018

mattdowle commented Mar 16, 2018 • edited

mattdowle commented Mar 16, 2018 • edited

jaapwalhout commented Mar 16, 2018 • edited

mattdowle commented Mar 16, 2018 • edited

jangorecki commented Jan 29, 2019 • edited

franknarf1 commented Mar 7, 2018 •

edited

mattdowle commented Mar 16, 2018 •

edited

mattdowle commented Mar 16, 2018 •

edited

jaapwalhout commented Mar 16, 2018 •

edited

mattdowle commented Mar 16, 2018 •

edited

jangorecki commented Jan 29, 2019 •

edited