Fixing crash when attempting to join on character(0) #4272

tlapak · 2020-03-01T21:35:03Z

Attempting to join or merge on character(0) currently crashes R in two out of three possible cases. At least on Windows:

library("data.table")

X = data.table(A='a')
Y = data.table(B='b')

X[Y, on=character(0)] #crashes
merge(X, Y, by.x=character(0), by.y=character(0)) #also crashes
merge(X, Y, by=character(0)) #doesn't crash
# Error in merge.data.table(X, Y, by = character(0)) : 
#  A non-empty vector of column names for `by` is required.

Turns out that merge checks the length of by but does not check the length of by.x or by.y (either is sufficient as the equality is checked). Likewise, [.data.table, or rather .parse_on, doesn't check the length of on. I have added the checks as well as tests for all three cases.

(Actually, only checking in .parse_on would be sufficient to prevent the crash, but this way produces a more useful error message when using merge.)

I have also taken the liberty of making a grammar fix to the relevant error message of merge, hope that is acceptable.

Now also closes #4499

MichaelChirico · 2020-03-02T00:08:02Z

R/merge.R

@@ -21,8 +21,8 @@ merge.data.table = function(x, y, by = NULL, by.x = NULL, by.y = NULL, all = FAL
  if (!missing(by) && !missing(by.x))
    warning("Supplied both `by` and `by.x/by.y`. `by` argument will be ignored.")
  if (!is.null(by.x)) {
-    if ( !is.character(by.x) || !is.character(by.y))
-      stop("A non-empty vector of column names are required for `by.x` and `by.y`.")
+    if (length(by.x) == 0L || !is.character(by.x) || !is.character(by.y))


aha! I was just looking at this code yesterday and something looked funny but I didn't bother stress testing it. nice catch!

MichaelChirico · 2020-03-02T00:10:11Z

R/data.table.R

@@ -3031,7 +3031,7 @@ isReallyReal = function(x) {
    onsub = as.call(c(quote(c), onsub))
  }
  on = eval(onsub, parent.frame(2L), parent.frame(2L))
-  if (!is.character(on))
+  if (length(on) == 0L || !is.character(on))


yes, perfect. we also shouldn't have gotten to checking by.x&by.y separately in the first place because here by.x=by.y so simply by should be used

I'm not quite sure what you mean. At this point we're not checking separately if we come through merge. merge sets by=by.x and then later calls y[x, on=by]. If we don't check in merge we catch it here but this is the point where it gets caught when using x[y] syntax.

(I would've been really mad if you had pushed a fix yesterday.)

NEWS.md

codecov · 2020-03-02T05:34:47Z

Codecov Report

Merging #4272 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #4272   +/-   ##
=======================================
  Coverage   99.60%   99.60%           
=======================================
  Files          73       73           
  Lines       14027    14029    +2     
=======================================
+ Hits        13972    13974    +2     
  Misses         55       55

Impacted Files	Coverage Δ
R/data.table.R	`100.00% <100.00%> (ø)`
R/merge.R	`100.00% <100.00%> (ø)`
src/bmerge.c	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3436568...91ec306. Read the comment docs.

jangorecki

Could you please fix that in bmerge.c? I just run into that problem using internal functions. Segfaults are pretty severe issues that should be eliminated, not only from exported API, but in general.

tlapak · 2020-05-10T20:11:37Z

I'll have a look at it but it may take a bit for me to get the chance to actually write and test the fix. But it should just be the same length check only in the C function and then raise an internal error. I assume you're calling bmerge directly?

jangorecki · 2020-05-10T21:52:35Z

Yes, somewhere around

if (LENGTH(icolsArg) > LENGTH(xcolsArg))

in SEXP bmerge, to check those are non-zero length, should do. If you remove your current fixes, then you can easily reach there with your unit tests. Which will be probably good, to handle that in single place.

tlapak · 2020-06-06T20:18:41Z

Now also closes #4499. I opted to not raise an error there in order to pass 2126.1 and 2126.2/be consistent with the behavior expected there. I do think there is an argument to be made for all those cases to be an error or for joins with empty data.tables to return an empty data.table. The current behavior is close-ish though.

I also think it's better to leave the argument checks for joins with on=character() close to the exposed API as that makes it possible to return more meaningful errors.

src/bmerge.c

jangorecki · 2020-06-06T22:01:23Z

Thanks for incorporating my feedback. It should be safe to put it into coming release.

mattdowle · 2020-06-08T21:44:38Z

Thanks @tlapak! I've invited you to be project member, please accept using the button that should appear on your GitHub projects or profile page. That way in future you can create branches in the main project directly. I'll add you to contributors list as well in a follow up commit (easier for me than pushing to your fork).

tlapak added 5 commits March 1, 2020 20:52

Fixed crash when attempting to join on character(0)

d0fd094

Minor grammar fix in error message

2e185f6

Atted test for crash when attempting to join on character(0)

a3e0e62

Added news entry

e85e836

Reflect grammar change to error message in tests

3e5ae0b

MichaelChirico approved these changes Mar 2, 2020

View reviewed changes

jangorecki reviewed Mar 2, 2020

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

Added link to news file linking the PR

bdfb694

tlapak mentioned this pull request May 9, 2020

empty dt, empty on, edge case join segfault #4438

Closed

jangorecki linked an issue May 9, 2020 that may be closed by this pull request

empty dt, empty on, edge case join segfault #4438

Closed

jangorecki requested changes May 10, 2020

View reviewed changes

tlapak mentioned this pull request May 26, 2020

Keyed join with empty data.table and roll='nearest' segfaults #4499

Closed

tlapak added 5 commits June 6, 2020 21:56

Add check in bmerge.c

cba5c55

Add tests for check in bmerge.c

008e472

Update to pass 2126.*

153f712

Expand tests 2126

04cedfd

Merge branch 'master' into fix_join_crash

91ec306

jangorecki reviewed Jun 6, 2020

View reviewed changes

src/bmerge.c Show resolved Hide resolved

jangorecki approved these changes Jun 6, 2020

View reviewed changes

jangorecki added this to the 1.12.9 milestone Jun 6, 2020

mattdowle merged commit 9fd131d into Rdatatable:master Jun 8, 2020

mattdowle added a commit that referenced this pull request Jun 8, 2020

Added Vaclav to contributor list in DESCRIPTION; #4272

dc5e11a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing crash when attempting to join on character(0) #4272

Fixing crash when attempting to join on character(0) #4272

tlapak commented Mar 1, 2020 •

edited

Loading

MichaelChirico Mar 2, 2020

MichaelChirico Mar 2, 2020

tlapak Mar 2, 2020

codecov bot commented Mar 2, 2020 •

edited

Loading

jangorecki left a comment •

edited

Loading

tlapak commented May 10, 2020

jangorecki commented May 10, 2020

tlapak commented Jun 6, 2020

jangorecki commented Jun 6, 2020

mattdowle commented Jun 8, 2020 •

edited

Loading

Fixing crash when attempting to join on character(0) #4272

Fixing crash when attempting to join on character(0) #4272

Conversation

tlapak commented Mar 1, 2020 • edited Loading

MichaelChirico Mar 2, 2020

Choose a reason for hiding this comment

MichaelChirico Mar 2, 2020

Choose a reason for hiding this comment

tlapak Mar 2, 2020

Choose a reason for hiding this comment

codecov bot commented Mar 2, 2020 • edited Loading

Codecov Report

jangorecki left a comment • edited Loading

Choose a reason for hiding this comment

tlapak commented May 10, 2020

jangorecki commented May 10, 2020

tlapak commented Jun 6, 2020

jangorecki commented Jun 6, 2020

mattdowle commented Jun 8, 2020 • edited Loading

tlapak commented Mar 1, 2020 •

edited

Loading

codecov bot commented Mar 2, 2020 •

edited

Loading

jangorecki left a comment •

edited

Loading

mattdowle commented Jun 8, 2020 •

edited

Loading