Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing crash when attempting to join on character(0) #4272

Merged
merged 11 commits into from
Jun 8, 2020

Conversation

tlapak
Copy link
Contributor

@tlapak tlapak commented Mar 1, 2020

Attempting to join or merge on character(0) currently crashes R in two out of three possible cases. At least on Windows:

library("data.table")

X = data.table(A='a')
Y = data.table(B='b')

X[Y, on=character(0)] #crashes
merge(X, Y, by.x=character(0), by.y=character(0)) #also crashes
merge(X, Y, by=character(0)) #doesn't crash
# Error in merge.data.table(X, Y, by = character(0)) : 
#  A non-empty vector of column names for `by` is required. 

Turns out that merge checks the length of by but does not check the length of by.x or by.y (either is sufficient as the equality is checked). Likewise, [.data.table, or rather .parse_on, doesn't check the length of on. I have added the checks as well as tests for all three cases.

(Actually, only checking in .parse_on would be sufficient to prevent the crash, but this way produces a more useful error message when using merge.)

I have also taken the liberty of making a grammar fix to the relevant error message of merge, hope that is acceptable.

Now also closes #4499

@@ -21,8 +21,8 @@ merge.data.table = function(x, y, by = NULL, by.x = NULL, by.y = NULL, all = FAL
if (!missing(by) && !missing(by.x))
warning("Supplied both `by` and `by.x/by.y`. `by` argument will be ignored.")
if (!is.null(by.x)) {
if ( !is.character(by.x) || !is.character(by.y))
stop("A non-empty vector of column names are required for `by.x` and `by.y`.")
if (length(by.x) == 0L || !is.character(by.x) || !is.character(by.y))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha! I was just looking at this code yesterday and something looked funny but I didn't bother stress testing it. nice catch!

@@ -3031,7 +3031,7 @@ isReallyReal = function(x) {
onsub = as.call(c(quote(c), onsub))
}
on = eval(onsub, parent.frame(2L), parent.frame(2L))
if (!is.character(on))
if (length(on) == 0L || !is.character(on))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, perfect. we also shouldn't have gotten to checking by.x&by.y separately in the first place because here by.x=by.y so simply by should be used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what you mean. At this point we're not checking separately if we come through merge. merge sets by=by.x and then later calls y[x, on=by]. If we don't check in merge we catch it here but this is the point where it gets caught when using x[y] syntax.

(I would've been really mad if you had pushed a fix yesterday.)

NEWS.md Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Mar 2, 2020

Codecov Report

Merging #4272 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4272   +/-   ##
=======================================
  Coverage   99.60%   99.60%           
=======================================
  Files          73       73           
  Lines       14027    14029    +2     
=======================================
+ Hits        13972    13974    +2     
  Misses         55       55           
Impacted Files Coverage Δ
R/data.table.R 100.00% <100.00%> (ø)
R/merge.R 100.00% <100.00%> (ø)
src/bmerge.c 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3436568...91ec306. Read the comment docs.

@jangorecki jangorecki linked an issue May 9, 2020 that may be closed by this pull request
Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please fix that in bmerge.c? I just run into that problem using internal functions. Segfaults are pretty severe issues that should be eliminated, not only from exported API, but in general.

@tlapak
Copy link
Contributor Author

tlapak commented May 10, 2020

I'll have a look at it but it may take a bit for me to get the chance to actually write and test the fix. But it should just be the same length check only in the C function and then raise an internal error. I assume you're calling bmerge directly?

@jangorecki
Copy link
Member

Yes, somewhere around

if (LENGTH(icolsArg) > LENGTH(xcolsArg))

in SEXP bmerge, to check those are non-zero length, should do. If you remove your current fixes, then you can easily reach there with your unit tests. Which will be probably good, to handle that in single place.

@tlapak
Copy link
Contributor Author

tlapak commented Jun 6, 2020

Now also closes #4499. I opted to not raise an error there in order to pass 2126.1 and 2126.2/be consistent with the behavior expected there. I do think there is an argument to be made for all those cases to be an error or for joins with empty data.tables to return an empty data.table. The current behavior is close-ish though.

I also think it's better to leave the argument checks for joins with on=character() close to the exposed API as that makes it possible to return more meaningful errors.

@jangorecki
Copy link
Member

Thanks for incorporating my feedback. It should be safe to put it into coming release.

@jangorecki jangorecki added this to the 1.12.9 milestone Jun 6, 2020
@mattdowle
Copy link
Member

mattdowle commented Jun 8, 2020

Thanks @tlapak! I've invited you to be project member, please accept using the button that should appear on your GitHub projects or profile page. That way in future you can create branches in the main project directly. I'll add you to contributors list as well in a follow up commit (easier for me than pushing to your fork).

@mattdowle mattdowle merged commit 9fd131d into Rdatatable:master Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Keyed join with empty data.table and roll='nearest' segfaults empty dt, empty on, edge case join segfault
4 participants