Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand nomatch to accept any value (control fills, rolls, omits during joins) #857

Open
danielkrizian opened this issue Oct 5, 2014 · 24 comments

Comments

@danielkrizian
Copy link

@danielkrizian danielkrizian commented Oct 5, 2014

Please consider the following feature request.

Currently, nomatch accepts only NA or 0:

DT[i, nomatch = NULL]
# Error in if (!is.na(nomatch) && nomatch != 0L) stop("nomatch must either be NA or 0, or (ideally) NA_integer_ or 0L") : 

By expanding it to accept wider range of values, one can perform for example na.fill operation during the join operation, not after, lending to faster code. Related SO

Proposed behavior:

DT[i, nomatch = any_character_or_real]     # fill with supplied value; could be 0, "foo", even NA ...
DT[i, nomatch = NA]                        # fill with NA. Special case of above. Same behavior as today.
DT[i, nomatch = 0]                         # fill with zeroes. Special case of above. 
DT[i, nomatch = NULL]                      # omit non-matched rows. Equivalent to today's nomatch = 0
DT[i, nomatch = locf]                      # reserved symbol. Equivalent to roll=+Inf
DT[i, nomatch = nocb]                      # reserved symbol. Equivalent to roll=-Inf

Alternatively, this could be achieved by introducing new argument fill, accepting any value:
[.data.table(..., fill=NA, ...)


jangorecki: adding some features to scope

  • rollequal=TRUE|FALSE
nomatch=.(0L)    # fill numeric columns with 0,  character columns with NA
nomatch=.(0L, "missing")  # fill numeric columns with 0, character columns with "missing"
nomatch=.(NA, price=LOCF())  # fill NA, other than price column which should be rolled
@danielkrizian danielkrizian changed the title Expand `nomatch` to accept any value - control fills, rolls, omits all in one place Expand `nomatch` to accept any value (control fills, rolls, omits during joins) Oct 5, 2014
@jangorecki

This comment has been minimized.

Copy link
Member

@jangorecki jangorecki commented Dec 13, 2014

Merging with #940 as it is the same feature.
Extending Daniel's examples for an prepared error message in case of nomatch.

data.table join nomatch=quote(stop())

We have 2 dts:

library(data.table)
dict <- data.table(symbol = c("a","b","c"), ratio = c(5,10,15), key = "symbol")
dt <- data.table(symbol = c("a","b","c","d"), value = rnorm(4))

Which we want join:

dict[dt][,list(symbol, new_value = ratio * value)]

Consider the case when we expect to have match on all data in the join.
In case of nomatch we want to raise error.
Currently it is possible using:

new_dt <- dict[dt]
if(any(is.na(new_dt$ratio))) stop("missing data in the reference dictionary")

Which makes us unable to make chaining in such case.
If we could use something like this:

dict[dt,nomatch=quote(stop("missing data in the reference dictionary"))
     ][,list(symbol, new_value = ratio * value)]

just to stop process and raise the error in case of any nomatch.

@arunsrinivasan arunsrinivasan added this to the v1.9.6 milestone Apr 23, 2015
@arunsrinivasan

This comment has been minimized.

Copy link
Member

@arunsrinivasan arunsrinivasan commented Apr 23, 2015

Added to 1.9.6. Will add a global option and issue a warning message so that transition can be made in the next major release. Setting the global option will allow users to use the new implementation already.

nomatch = NA - current behaviour (and default).
nomatch = 0 - fills with 0 instead of removing non-matched rows.
nomatch = NULL - current behaviour of nomatch = 0.
nomatch = LOCF(val, rollends=TRUE/FALSE) and nomatch = NOCB(val, rollends=TRUE/FALSE) - could potentially replace roll and rollends arguments??
nomatch = val - should type convert appropriately and use val for all columns.

@arunsrinivasan arunsrinivasan added the High label Apr 23, 2015
@DavidArenburg

This comment has been minimized.

Copy link
Member

@DavidArenburg DavidArenburg commented Apr 23, 2015

I think that second option is a bit revolutionary and potentially going to "piss" a lot of people who either don't read documentation/warnings or will have to rewrite huge code chunks...

@eantonya

This comment has been minimized.

Copy link
Contributor

@eantonya eantonya commented Apr 23, 2015

It's bad enough that this will break a lot of code, but now we'd have to type even more characters for this very common operation? Uhh...

I don't like just saying I dislike smth without presenting a viable alternative, but the only thing I can think of is a new argument, which sucks in its own little way :-\

@arunsrinivasan

This comment has been minimized.

Copy link
Member

@arunsrinivasan arunsrinivasan commented Apr 23, 2015

On extra characters, I guess you're talking about "roll" arguments? I'm not quite certain what are the extra characters otherwise.. And, I am not sure about the "roll" part, hence the "??". The idea is to first implement nomatch = <any_value> first.

# previously
X[Y, roll = Inf, rollends=TRUE]
# after
X[Y, nomatch = LOCF(Inf, TRUE)] # default case can be just nomatch = "locf"

Is this what you find hard?

For one, I find it straightforward, as "roll" actually is used to "fill in" missing values. Second, the number of arguments that are mutually exclusive in [.data.table are one too many, and this helps address that.. (by removing roll and rollends as separate arguments). Right now, using roll=TRUE and nomatch = 0L returns the rolled result, which I find confusing. Providing nomatch = roll is explicit in that it asks the results to be rolled in the event of a no match.

On breaking code.. like I said before, in this version (1.9.6), we'll have a global argument which will be set to current behaviour and will issue a warning that nomatch = 0L should be replaced to nomatch = NULL and the current behaviour will be deprecated in the next version (which is plenty time), and we'll add a new vignette to explain the benefits. There will also be an option to switch off this warning. (or at least this is what I've in mind right now). @mattdowle thoughts?

I think this is a hugely welcoming change, similar to by = .EACHI (which you advocated btw, @eddi, which also would have resulted in lot of code being broken). The point is to bring in as much consistency as possible, by transitioning smoothly (and by giving ample time for users to shift their code), and this kind of "major" change shouldn't happen a lot (it's a lot of effort for us as well), but it's essential here.

@eantonya

This comment has been minimized.

Copy link
Contributor

@eantonya eantonya commented Apr 23, 2015

I'm talking about typing NULL instead of 0, I didn't even think about the roll. For me, nomatch=0 has been a major annoyance from the very beginning, and I'm not too happy about it becoming even more annoying.

That said I totally understand the need for filling with other numbers/throwing an error, I'm just not particularly happy with this implementation.

Actually the root of my annoyance comes from nomatch=NA being the default, instead of nomatch=NULL/0. Since we're talking about a breaking change, can we please discuss making NULL/0 the default? I'll make a short "pro" argument: it's the default for base merge and ime by far the more common type of a join/merge.

@arunsrinivasan arunsrinivasan modified the milestones: v1.9.8, v1.9.6 Apr 23, 2015
@arunsrinivasan

This comment has been minimized.

Copy link
Member

@arunsrinivasan arunsrinivasan commented Apr 23, 2015

Taking into account what Gsee mentioned about having a release as fallback in case changes are too major (and possibly affecting current code too much), I've changed it to 1.9.8. 1.9.6 will keep things as expected for people who don't want their code to be affected (and they need not upgrade if they wish so).

I agree, purely on consistency, that nomatch = NULL should be the default. IMO, we'll have to do that as well. The default should be an inner join.

This means, this'll be default in 2.0.0 (depending on what Matt feels as well).

@eantonya

This comment has been minimized.

Copy link
Contributor

@eantonya eantonya commented Apr 23, 2015

Ok, if nomatch=NULL is made the default, then I have no issues, and the change would be great.

@jangorecki

This comment has been minimized.

Copy link
Member

@jangorecki jangorecki commented Apr 23, 2015

As for breaking the code. The sooner we have it as an option to better tested it will be, so if you are not in rush with 1.9.6 (sad) it can be worth to put it there. But for the breaking changes - moving from option to default - the best will be 2.0.0 release.

If we consider adding new argument to [ to it is worth to review other open issues which would like to get own new arguments. Maybe they can be nicely combined. The one very similar would be #614, or maybe also #830 would need one - this can be combined with roll maybe?

PS. do you drop the nomatch=quote(stop("my error")) FR at all?

@jangorecki

This comment has been minimized.

Copy link
Member

@jangorecki jangorecki commented Sep 27, 2015

If such breaking change is going to happened I think we should put some info packageStartupMessage yet before 1.9.8 release - example one below.

"There is a breaking change coming in next stable release to CRAN (2.0.0) related to `nomatch` argument. That affects joins and binary search filtering. To keep the current behavior of `nomatch` in 2.0.0+ recode missing `nomatch` to `nomatch=NA` and `nomatch=0` to `nomatch=NULL` / missing. Alternatively use `options('datatable.oldnomatch'=TRUE)`."

Maybe we can also provide an option datatable.newnomatch which could makes new behavior available in 1.9.8? This can make switch easier.

@MichaelChirico

This comment has been minimized.

Copy link
Member

@MichaelChirico MichaelChirico commented Jan 27, 2016

adding that this feature would make a lot of problems simpler and that breaking existing code with nomatch = 0 is worth it in the long run as I've always thought that choice of notation was strange anyway.

@gsee

This comment has been minimized.

Copy link

@gsee gsee commented Jan 29, 2016

Meanwhile, there are people writing production code today that need the current functionality of nomatch=0L and every time they write a new line of code that includes it, they're increasing their workload in the future if the behavior is changed. Regardless of whether the behavior of nomatch=0 is ever changed, it would be nice to be able to start using nomatch=NULL, because at least we know that once that is implemented, it's not likely to break in the foreseeable future.

@MichaelChirico

This comment has been minimized.

Copy link
Member

@MichaelChirico MichaelChirico commented Jan 29, 2016

@gsee that's a good idea, but I do think you're exaggerating claims of an increased workload... ctrl+F will still only need to be pressed once (or grep could be used to do all R scripts at once)

@gsee

This comment has been minimized.

Copy link

@gsee gsee commented Jan 29, 2016

The point is that I don't like writing a line of code when I know it's going to break in the future. If I had a way to write code such that I wouldn't have to fix it, I'd prefer to do that.

@jangorecki

This comment has been minimized.

Copy link
Member

@jangorecki jangorecki commented Jan 29, 2016

@gsee totally agree. Many times when I'm using nomatch I'm concerned if I can write it in a way it could be supported in future. One option is to use nomatch = if(packageVersion("data.table") >= "2.0.0") NULL else 0 but I'm not sure if it is recommended way.
Another option is to not upgrade your production environment. If everything works as expected then I would say it is a good practice.
Having good test coverage for own package would easily detect most of the issues when migration to new version of a dependency. Giving API for future changes, as the one written above, can help package maintainers. Additionally pointing what test they could include in their test workflow.

@mekkim

This comment has been minimized.

Copy link

@mekkim mekkim commented Feb 3, 2016

Just adding my two-cents to highlight that, as @jangorecki knows, this is an issue that commonly comes up on StackExchange, etc. The nomatch=0 syntax always confused me (nomatch=null always seemed more logical). Here's a current thread with a very basic case. Thanks for your great work and this upcoming feature, however it is implemented!

@skanskan

This comment has been minimized.

Copy link

@skanskan skanskan commented Jul 11, 2017

Could you provide an example on how to use this nomatch roll method, please?
For example, how can I do this
zoo::na.locf(c(NA, 1,2,NA, NA, 1, NA))
using just data.table?

@mattdowle mattdowle removed this from the Candidate milestone May 10, 2018
jangorecki added a commit that referenced this issue Oct 4, 2018
@jangorecki

This comment has been minimized.

Copy link
Member

@jangorecki jangorecki commented Oct 4, 2018

This significant breaking change can be relatively safely implemented using one of the following strategies.

  • regular procedure

1.12.0: allow nomatch=NULL, identical to nomatch=0L
1.13.0: nomatch=0L prints message about future change
1.14.0: nomatch=0L raises warning
1.15.0: nomatch=0L raises error
1.16.0: nomatch=0L fill missing rows with 0L

So anyone upgrading DT at least once per each major release will not be able to get incorrect results because of nomatch=0L raising an error at some point, before changing to requested behavior in this issue. Of course options can be provided to relax those rules.

  • "forever" compatibility

Optionally to limit breaking change to extreme we can keep 0L mapping to NULL "forever" unless special option options("datatable.nomatch.fill.0"=TRUE). Then the old code is still portable to new versions, and the not-so-common case of filling non matching rows with 0 can be achieved, with a little bit extra effort (setting option) than any other nomatch value. This would need to be smart enough to determine if nomatch was hardcoded or provided as parameter, and if parameter than warn (just because user might want to fill non-matching rows with dynamically calculated value).

@MichaelChirico

This comment has been minimized.

Copy link
Member

@MichaelChirico MichaelChirico commented Oct 4, 2018

So, in 1.15.0, only nomatch = NULL is accepted? Anything else is an error? Or nomatch = x will fill with x unless x == 0?

@jangorecki

This comment has been minimized.

Copy link
Member

@jangorecki jangorecki commented Oct 4, 2018

@MichaelChirico no decisions yet, I just thrown few ideas to make transition smooth. We can also ask users to wrap nomatch=0 into nomatch=I(0) if they want to fill with zeros, just for few releases to make nomatch=0 backward compatible for longer period. R's lazy evaluation gives many ways to handle transition. Other nomatch values can be supported anytime as they won't break any code.
Note that in current releases frequency 1.15.0 will be released around 2024 :-)

@mattdowle mattdowle added this to the 1.12.0 milestone Oct 4, 2018
@mattdowle

This comment has been minimized.

Copy link
Member

@mattdowle mattdowle commented Oct 4, 2018

Yep: nomatch=NULL good and agree with Jan's PR #3095.

The desire is to specify fill values in nomatch=, right? So nomatch=0L will eventually mean use 0L instead of NA to fill the missing rows. But what about differing column types? There might be a mixture of character columns and numeric columns to fill and what should match=0L mean for the character columns?

Rather than changing the meaning of nomatch=0L, how about allowing a new list value.

nomatch=.(0L)    # fill numeric columns with 0,  character columns with NA
nomatch=.(0L, "missing")  # fill numeric columns with 0, character columns with "missing"
nomatch=.(NA, price=LOCF())  # fill NA, other than price column which should be rolled

This would allow the ability of filling with 0 rather than NA, without needing to change the meaning of nomatch=0L and the associated breakage.

Btw, the reason for the default being nomatch=NA (outer join) is for statistical safety. I found that in the majority of cases in my day to day work, I expected there to be data available and if there wasn't, then I wanted to measure how much was missing or at least be tripped up with the NAs afterwards. It's just too easy to miss incorrect joins due to missing data when the default is inner join. Inner join to me means ignore missing data silently which doesn't feel robust. If you expect missing data and that's ok, then I like to see nomatch=NULL in the code to remind myself that missing data is being dropped silently. I also like the nomatch="error" option idea too. In other words, I'm not keen on changing the default behaviour away from nomatch=NA since there's a good reason for it.

@jangorecki

This comment has been minimized.

Copy link
Member

@jangorecki jangorecki commented Oct 5, 2018

Very good question about filling multiple columns, and brilliant idea to address it retaining nomatch=0L backward compatible at the same time.

To include also rollequal=TRUE|FALSE #857

mattdowle added a commit that referenced this issue Oct 6, 2018
@mattdowle mattdowle modified the milestones: 1.12.0, 1.12.2 Jan 6, 2019
@jangorecki jangorecki modified the milestones: 1.12.2, 1.12.4 Jan 24, 2019
@MichaelChirico MichaelChirico changed the title Expand `nomatch` to accept any value (control fills, rolls, omits during joins) Expand nomatch to accept any value (control fills, rolls, omits during joins) Feb 19, 2019
@MichaelChirico MichaelChirico mentioned this issue Feb 19, 2019
6 of 30 tasks complete
@ethanbsmith

This comment has been minimized.

Copy link

@ethanbsmith ethanbsmith commented May 14, 2019

it seems like there are 2 discrete ideas that are being combined in a single parameter:

  1. the nature of the join (inner or right)
  2. the fill values to use for a left join

granted, if you do an inner join, you don't need fill values, but adding new parameters would get you around the transition and backward compatibility quite easily:

add join.type as a new parameter (currently only allowing inner, right)
add join.fill as a new parameter taking a list of name/value pars as described in Matt's Oct4th post.

the join.type could be implemented now, with warnings and backward compatibility would be a simple thunk to the new parameter.

this would also leave open the possibility for supporting full join in the future, or possibly even the sacrilegious left join ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.