%notin% is not safe in the way it handles NA by default #5481

Kamgang-B · 2022-10-08T10:05:55Z

IMHO, the typical use of the function %notin% is likely expected to be DT[lhs %notin% rhs), ...] where 1- rhs contains no missing value and 2- the user wants to return/modify rows where lhs contains only values in rhs.

Also, I don't expect users to do something like !lhs %notin% (since %in% is already convenient for this operation).

For these reasons, I think that it is better to be on the safe side by allowing DT[lhs %notin% rhs,...], to return/modify only rows whose values are in rhs. In doing so, the user will have to explicitly add NA to the rhs if he also wants to include rows with missing values.

Consider the following example:

 dt = data.table(x=c(1:3, NA, 4L, NA), y=1:6, z=10*c(3, 1, 4, 8, 3, 8))

       x     y     z
   <int> <int> <num>
1:     1     1    30
2:     2     2    10
3:     3     3    40
4:    NA     4    80
5:     4     5    30
6:    NA     6    80

dt[x %notin% 1:3, y := z]

       x     y     z
   <int> <int> <num>
1:     1     1    30
2:     2     2    10
3:     3     3    40
4:    NA    80    80
5:     4    30    30
6:    NA    80    80

In doing this operation, I don't really think users expect the rows where x is NA to be modified.

So, even if %notin% is meant to provide a more memory-efficient version of !lhs %in% rhs% (IIRW), I also think that it would better to handle missing values more safely.

P.S.: I wonder if it's also possible to export a functional alternative of %notin%. something like notin(x, table, nomatch=-1L).

The text was updated successfully, but these errors were encountered:

ben-schwen · 2022-10-09T21:06:16Z

I don't see the point in your example.

What would be your supposed output for
dt[x %notin% 1:3, y := z][] ?
And what would be the supposed output for
dt[x %in% 1:3, y := z][] ?

Kamgang-B · 2022-10-09T22:09:32Z

My point is that dt[x %notin% 1:3, y := z][] modifies y if x is missing and I think that it should not since we do not know if the unkown real values of missing values in x are contained in 1:3 or not.

I expected the following

dt[x %notin% 1:3, y := z][]
	   x     y     z
   <int> <int> <num>
1:     1     1    30
2:     2     2    10
3:     3     3    40
4:    NA     4    80
5:     4    30    30
6:    NA     6    80

The output below is what I expected for %in%; and in practice, the output of dt[!x %in% 1:3] is not what users likely typically want (IMHO) if x contains missing values.

dt[x %in% 1:3, y := z][]    # note that this does not modify y if x is missing
	   x     y     z
   <int> <int> <num>
1:     1    30    30
2:     2    10    10
3:     3    40    40
4:    NA     4    80
5:     4     5    30
6:    NA     6    80

I don't really know how other users typically use %in% in negation but my tipical use is DT[!x %in% c(val1, val2, ..., NA), ...] (where val1, val2, ... are not missing) and not DT[!x %in% c(val1, val2, ...,valn), ...], so that rows where x is missing are not included in the output.

hdn012 · 2022-10-12T15:24:34Z

@Kamgang-B I don't think this is an issue at all.
In your case of dt[x %in% 1:3, y := z] with the note of "note that this does not modify y if x is missing". I think that it is pretty clear from the syntax that NA is not in 1:3, therefore, no modification. The note is superfluous. If I wanted the behavior that require NA to be modified, I would have write dt[x %in% c(1:3, NA), y := z].
Similarly, dt[x %notin% c(1:3, NA), y := z] would bring out your preferred output. Writing dt[x %notin% 1:3, y := z] would mean it is true that NA is not in 1:3 and modification is required.
The functional syntax for %notin% is `%notin%`(x, table) with backticks (similar to `+`(2, 3) is for 2 + 3 or `%notin%`(dt$x, 1:3) is for dt$x %notin% 1:3)

jangorecki · 2022-10-12T15:36:14Z

I think it is a good question.
Should be addressed before releasing this feature to CRAN. By addressing I mean, well documenting, and eventually if there is agreement, changing behavior.
Personally my use cases were !x%in%values, but maybe I simply didn't have NAs in x. Both situations can be surprise to users but I think that simply negating IN may be a less frequently a surprise.

datocrats-org · 2022-10-21T20:36:05Z

Maybe I'm missing the meaning... dt[x %notin% 1:3 implies that x is not in 1,2,3 and that would include NA because NA is not 1, 2, or 3

It would be great if R could handle a mixed type vector or list though I'm not sure that it can, I think NA being skipped in this case makes perfect sense. data.table is especially useful whereby you can assign a value in j contingent on i whereby you can say
dt[is.na(b)==TRUE,y:=z] as a second step here.

I would discourage chaining it or building "NA handling" into the awesome %notin% because you can repeat the above any number of times in very legible manner, without needing to learn the internal mechanics of the method and its NA handling or type safety.

mczek · 2023-02-15T20:46:37Z

I agree with @datocrats-org that the behavior makes sense because just as NA is not %in% the vector 1:3, NA should be %notin% the vector 1:3. To make things clear for users I'd be happy to add an example including NA in the left hand side as it is in the test cases

Kamgang-B · 2023-04-16T07:59:48Z

@hdn012 @datocrats-org @mczek
I understand when you say that NA is not in 1:3. In this sense, you are right and %notin% is the logical complement of %in%. But my point is different. I am talking more bout the user intention/expectation when using %notin%. When doing x %notin% 1:3, does the user really expect the result to be TRUE when x is missing!? I feel (and maybe only me) like it's more natural to have TRUE if x is not in 1:3 AND is not missing (even if it would no longer be consistent with !x %in% 1:3).

@hdn012

The functional syntax for %notin% is %notin%(x, table) with backticks (similar to +(2, 3) is for 2 + 3 or %notin%(dt$x, 1:3) is for dt$x %notin% 1:3)

I think that I was not clear enough when talking about the functional form. I know that `%notin%`(x, table) is the functional form of dt$x %notin% 1:3. I am talking about something in the spirit of %like% and like (the latter is more flexible because it provides more arguments (vector, pattern, ignore.case = FALSE, fixed = FALSE, perl = FALSE) than the former one (vector, pattern)).
So, the idea is that a new function notin could provide an additional argument that allows users to have better control of NAs. Something like:

notin = function(x, table, na=FALSE) {   # na=FALSE/TRUE-->set NA to FALSE/TRUE
  if(na && !anyNA(table)) table = c(table, NA)
  x %notin% table                        # where %notin% is the current implement of %notin%
}

@jangorecki

Both situations can be surprise to users

That's why I think that it would be nice to complement %notin% with a more flexible version like notin(x, table, na=FALSE) (similar to %like% and like).

Kindly consider exporting a new more flexible function notin as a feature request (if the current implementation of %notin% is kept unchanged).

jangorecki · 2023-04-16T13:18:49Z

I like the idea

jangorecki · 2023-11-06T12:27:04Z

If there is nobody willing to submit such extended function notin now then we can proceed with documentation improvement, ideally also including examples

jangorecki added this to the 1.14.5 milestone Oct 12, 2022

jangorecki added the documentation label Oct 12, 2022

jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023

jangorecki added the dev label Nov 6, 2023

jangorecki added a commit that referenced this issue Nov 6, 2023

notin docs #5481

978cf51

jangorecki mentioned this issue Nov 6, 2023

notin docs #5481 #5729

Merged

jangorecki closed this as completed in #5729 Dec 8, 2023

jangorecki added a commit that referenced this issue Dec 8, 2023

notin docs #5481 (#5729)

4e7d46b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

%notin% is not safe in the way it handles NA by default #5481

%notin% is not safe in the way it handles NA by default #5481

Kamgang-B commented Oct 8, 2022 •

edited

Loading

ben-schwen commented Oct 9, 2022 •

edited

Loading

Kamgang-B commented Oct 9, 2022

hdn012 commented Oct 12, 2022

jangorecki commented Oct 12, 2022 •

edited

Loading

datocrats-org commented Oct 21, 2022 •

edited

Loading

mczek commented Feb 15, 2023

Kamgang-B commented Apr 16, 2023 •

edited

Loading

jangorecki commented Apr 16, 2023

jangorecki commented Nov 6, 2023

%notin% is not safe in the way it handles NA by default #5481

%notin% is not safe in the way it handles NA by default #5481

Comments

Kamgang-B commented Oct 8, 2022 • edited Loading

ben-schwen commented Oct 9, 2022 • edited Loading

Kamgang-B commented Oct 9, 2022

hdn012 commented Oct 12, 2022

jangorecki commented Oct 12, 2022 • edited Loading

datocrats-org commented Oct 21, 2022 • edited Loading

mczek commented Feb 15, 2023

Kamgang-B commented Apr 16, 2023 • edited Loading

jangorecki commented Apr 16, 2023

jangorecki commented Nov 6, 2023

Kamgang-B commented Oct 8, 2022 •

edited

Loading

ben-schwen commented Oct 9, 2022 •

edited

Loading

jangorecki commented Oct 12, 2022 •

edited

Loading

datocrats-org commented Oct 21, 2022 •

edited

Loading

Kamgang-B commented Apr 16, 2023 •

edited

Loading