Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing crash when attempting to join on character(0) #4272

Merged
merged 11 commits into from Jun 8, 2020

Conversation

@tlapak
Copy link
Contributor

@tlapak tlapak commented Mar 1, 2020

Attempting to join or merge on character(0) currently crashes R in two out of three possible cases. At least on Windows:

library("data.table")

X = data.table(A='a')
Y = data.table(B='b')

X[Y, on=character(0)] #crashes
merge(X, Y, by.x=character(0), by.y=character(0)) #also crashes
merge(X, Y, by=character(0)) #doesn't crash
# Error in merge.data.table(X, Y, by = character(0)) : 
#  A non-empty vector of column names for `by` is required. 

Turns out that merge checks the length of by but does not check the length of by.x or by.y (either is sufficient as the equality is checked). Likewise, [.data.table, or rather .parse_on, doesn't check the length of on. I have added the checks as well as tests for all three cases.

(Actually, only checking in .parse_on would be sufficient to prevent the crash, but this way produces a more useful error message when using merge.)

I have also taken the liberty of making a grammar fix to the relevant error message of merge, hope that is acceptable.

Now also closes #4499

@@ -21,8 +21,8 @@ merge.data.table = function(x, y, by = NULL, by.x = NULL, by.y = NULL, all = FAL
if (!missing(by) && !missing(by.x))
warning("Supplied both `by` and `by.x/by.y`. `by` argument will be ignored.")
if (!is.null(by.x)) {
if ( !is.character(by.x) || !is.character(by.y))
stop("A non-empty vector of column names are required for `by.x` and `by.y`.")
if (length(by.x) == 0L || !is.character(by.x) || !is.character(by.y))

This comment has been minimized.

@MichaelChirico

MichaelChirico Mar 2, 2020
Member

aha! I was just looking at this code yesterday and something looked funny but I didn't bother stress testing it. nice catch!

@@ -3031,7 +3031,7 @@ isReallyReal = function(x) {
onsub = as.call(c(quote(c), onsub))
}
on = eval(onsub, parent.frame(2L), parent.frame(2L))
if (!is.character(on))
if (length(on) == 0L || !is.character(on))

This comment has been minimized.

@MichaelChirico

MichaelChirico Mar 2, 2020
Member

yes, perfect. we also shouldn't have gotten to checking by.x&by.y separately in the first place because here by.x=by.y so simply by should be used

This comment has been minimized.

@tlapak

tlapak Mar 2, 2020
Author Contributor

I'm not quite sure what you mean. At this point we're not checking separately if we come through merge. merge sets by=by.x and then later calls y[x, on=by]. If we don't check in merge we catch it here but this is the point where it gets caught when using x[y] syntax.

(I would've been really mad if you had pushed a fix yesterday.)

NEWS.md Outdated Show resolved Hide resolved
@codecov
Copy link

@codecov codecov bot commented Mar 2, 2020

Codecov Report

Merging #4272 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4272   +/-   ##
=======================================
  Coverage   99.60%   99.60%           
=======================================
  Files          73       73           
  Lines       14027    14029    +2     
=======================================
+ Hits        13972    13974    +2     
  Misses         55       55           
Impacted Files Coverage Δ
R/data.table.R 100.00% <100.00%> (ø)
R/merge.R 100.00% <100.00%> (ø)
src/bmerge.c 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3436568...91ec306. Read the comment docs.

@jangorecki jangorecki linked an issue that may be closed by this pull request May 9, 2020
Copy link
Member

@jangorecki jangorecki left a comment

Could you please fix that in bmerge.c? I just run into that problem using internal functions. Segfaults are pretty severe issues that should be eliminated, not only from exported API, but in general.

@tlapak
Copy link
Contributor Author

@tlapak tlapak commented May 10, 2020

I'll have a look at it but it may take a bit for me to get the chance to actually write and test the fix. But it should just be the same length check only in the C function and then raise an internal error. I assume you're calling bmerge directly?

@jangorecki
Copy link
Member

@jangorecki jangorecki commented May 10, 2020

Yes, somewhere around

if (LENGTH(icolsArg) > LENGTH(xcolsArg))

in SEXP bmerge, to check those are non-zero length, should do. If you remove your current fixes, then you can easily reach there with your unit tests. Which will be probably good, to handle that in single place.

tlapak added 5 commits May 25, 2020
@tlapak
Copy link
Contributor Author

@tlapak tlapak commented Jun 6, 2020

Now also closes #4499. I opted to not raise an error there in order to pass 2126.1 and 2126.2/be consistent with the behavior expected there. I do think there is an argument to be made for all those cases to be an error or for joins with empty data.tables to return an empty data.table. The current behavior is close-ish though.

I also think it's better to leave the argument checks for joins with on=character() close to the exposed API as that makes it possible to return more meaningful errors.

src/bmerge.c Show resolved Hide resolved
@jangorecki
Copy link
Member

@jangorecki jangorecki commented Jun 6, 2020

Thanks for incorporating my feedback. It should be safe to put it into coming release.

@jangorecki jangorecki added this to the 1.12.9 milestone Jun 6, 2020
@mattdowle
Copy link
Member

@mattdowle mattdowle commented Jun 8, 2020

Thanks @tlapak! I've invited you to be project member, please accept using the button that should appear on your GitHub projects or profile page. That way in future you can create branches in the main project directly. I'll add you to contributors list as well in a follow up commit (easier for me than pushing to your fork).

@mattdowle mattdowle merged commit 9fd131d into Rdatatable:master Jun 8, 2020
4 checks passed
4 checks passed
codecov/patch 100.00% of diff hit (target 99.60%)
Details
codecov/project 99.60% (+0.00%) compared to 3436568
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
mattdowle added a commit that referenced this pull request Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

4 participants
You can’t perform that action at this time.