Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

%notin% is not safe in the way it handles NA by default #5481

Closed
Kamgang-B opened this issue Oct 8, 2022 · 9 comments · Fixed by #5729
Closed

%notin% is not safe in the way it handles NA by default #5481

Kamgang-B opened this issue Oct 8, 2022 · 9 comments · Fixed by #5729

Comments

@Kamgang-B
Copy link
Contributor

Kamgang-B commented Oct 8, 2022

IMHO, the typical use of the function %notin% is likely expected to be DT[lhs %notin% rhs), ...] where 1- rhs contains no missing value and 2- the user wants to return/modify rows where lhs contains only values in rhs.

Also, I don't expect users to do something like !lhs %notin% (since %in% is already convenient for this operation).

For these reasons, I think that it is better to be on the safe side by allowing DT[lhs %notin% rhs,...], to return/modify only rows whose values are in rhs. In doing so, the user will have to explicitly add NA to the rhs if he also wants to include rows with missing values.

Consider the following example:

 dt = data.table(x=c(1:3, NA, 4L, NA), y=1:6, z=10*c(3, 1, 4, 8, 3, 8))

       x     y     z
   <int> <int> <num>
1:     1     1    30
2:     2     2    10
3:     3     3    40
4:    NA     4    80
5:     4     5    30
6:    NA     6    80

dt[x %notin% 1:3, y := z]

       x     y     z
   <int> <int> <num>
1:     1     1    30
2:     2     2    10
3:     3     3    40
4:    NA    80    80
5:     4    30    30
6:    NA    80    80

In doing this operation, I don't really think users expect the rows where x is NA to be modified.

So, even if %notin% is meant to provide a more memory-efficient version of !lhs %in% rhs% (IIRW), I also think that it would better to handle missing values more safely.

P.S.: I wonder if it's also possible to export a functional alternative of %notin%. something like notin(x, table, nomatch=-1L).

@ben-schwen
Copy link
Member

ben-schwen commented Oct 9, 2022

I don't see the point in your example.

What would be your supposed output for
dt[x %notin% 1:3, y := z][] ?
And what would be the supposed output for
dt[x %in% 1:3, y := z][] ?

@Kamgang-B
Copy link
Contributor Author

My point is that dt[x %notin% 1:3, y := z][] modifies y if x is missing and I think that it should not since we do not know if the unkown real values of missing values in x are contained in 1:3 or not.

I expected the following

dt[x %notin% 1:3, y := z][]
	   x     y     z
   <int> <int> <num>
1:     1     1    30
2:     2     2    10
3:     3     3    40
4:    NA     4    80
5:     4    30    30
6:    NA     6    80

The output below is what I expected for %in%; and in practice, the output of dt[!x %in% 1:3] is not what users likely typically want (IMHO) if x contains missing values.

dt[x %in% 1:3, y := z][]    # note that this does not modify y if x is missing
	   x     y     z
   <int> <int> <num>
1:     1    30    30
2:     2    10    10
3:     3    40    40
4:    NA     4    80
5:     4     5    30
6:    NA     6    80

I don't really know how other users typically use %in% in negation but my tipical use is DT[!x %in% c(val1, val2, ..., NA), ...] (where val1, val2, ... are not missing) and not DT[!x %in% c(val1, val2, ...,valn), ...], so that rows where x is missing are not included in the output.

@hdn012
Copy link

hdn012 commented Oct 12, 2022

@Kamgang-B I don't think this is an issue at all.
In your case of dt[x %in% 1:3, y := z] with the note of "note that this does not modify y if x is missing". I think that it is pretty clear from the syntax that NA is not in 1:3, therefore, no modification. The note is superfluous. If I wanted the behavior that require NA to be modified, I would have write dt[x %in% c(1:3, NA), y := z].
Similarly, dt[x %notin% c(1:3, NA), y := z] would bring out your preferred output. Writing dt[x %notin% 1:3, y := z] would mean it is true that NA is not in 1:3 and modification is required.
The functional syntax for %notin% is `%notin%`(x, table) with backticks (similar to `+`(2, 3) is for 2 + 3 or `%notin%`(dt$x, 1:3) is for dt$x %notin% 1:3)

@jangorecki
Copy link
Member

jangorecki commented Oct 12, 2022

I think it is a good question.
Should be addressed before releasing this feature to CRAN. By addressing I mean, well documenting, and eventually if there is agreement, changing behavior.
Personally my use cases were !x%in%values, but maybe I simply didn't have NAs in x. Both situations can be surprise to users but I think that simply negating IN may be a less frequently a surprise.

@jangorecki jangorecki added this to the 1.14.5 milestone Oct 12, 2022
@datocrats-org
Copy link

datocrats-org commented Oct 21, 2022

Maybe I'm missing the meaning... dt[x %notin% 1:3 implies that x is not in 1,2,3 and that would include NA because NA is not 1, 2, or 3

It would be great if R could handle a mixed type vector or list though I'm not sure that it can, I think NA being skipped in this case makes perfect sense. data.table is especially useful whereby you can assign a value in j contingent on i whereby you can say
dt[is.na(b)==TRUE,y:=z] as a second step here.

I would discourage chaining it or building "NA handling" into the awesome %notin% because you can repeat the above any number of times in very legible manner, without needing to learn the internal mechanics of the method and its NA handling or type safety.

@mczek
Copy link
Contributor

mczek commented Feb 15, 2023

I agree with @datocrats-org that the behavior makes sense because just as NA is not %in% the vector 1:3, NA should be %notin% the vector 1:3. To make things clear for users I'd be happy to add an example including NA in the left hand side as it is in the test cases

@Kamgang-B
Copy link
Contributor Author

Kamgang-B commented Apr 16, 2023

@hdn012 @datocrats-org @mczek
I understand when you say that NA is not in 1:3. In this sense, you are right and %notin% is the logical complement of %in%. But my point is different. I am talking more bout the user intention/expectation when using %notin%. When doing x %notin% 1:3, does the user really expect the result to be TRUE when x is missing!? I feel (and maybe only me) like it's more natural to have TRUE if x is not in 1:3 AND is not missing (even if it would no longer be consistent with !x %in% 1:3).

@hdn012

The functional syntax for %notin% is %notin%(x, table) with backticks (similar to +(2, 3) is for 2 + 3 or %notin%(dt$x, 1:3) is for dt$x %notin% 1:3)

I think that I was not clear enough when talking about the functional form. I know that `%notin%`(x, table) is the functional form of dt$x %notin% 1:3. I am talking about something in the spirit of %like% and like (the latter is more flexible because it provides more arguments (vector, pattern, ignore.case = FALSE, fixed = FALSE, perl = FALSE) than the former one (vector, pattern)).
So, the idea is that a new function notin could provide an additional argument that allows users to have better control of NAs. Something like:

notin = function(x, table, na=FALSE) {   # na=FALSE/TRUE-->set NA to FALSE/TRUE
  if(na && !anyNA(table)) table = c(table, NA)
  x %notin% table                        # where %notin% is the current implement of %notin%
}

@jangorecki

Both situations can be surprise to users

That's why I think that it would be nice to complement %notin% with a more flexible version like notin(x, table, na=FALSE) (similar to %like% and like).

Kindly consider exporting a new more flexible function notin as a feature request (if the current implementation of %notin% is kept unchanged).

@jangorecki
Copy link
Member

I like the idea

@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
@jangorecki jangorecki added the dev label Nov 6, 2023
@jangorecki
Copy link
Member

If there is nobody willing to submit such extended function notin now then we can proceed with documentation improvement, ideally also including examples

jangorecki added a commit that referenced this issue Nov 6, 2023
jangorecki added a commit that referenced this issue Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants