Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed duplicate column names in merge() when by.x in names(y) #2631

Merged
merged 16 commits into from Feb 23, 2018
Merged

Fixed duplicate column names in merge() when by.x in names(y) #2631

merged 16 commits into from Feb 23, 2018

Conversation

@sritchie73
Copy link
Contributor

@sritchie73 sritchie73 commented Feb 17, 2018

When joining two data.tables using merge() the resulting data.table will contain duplicate column names if by.x != by.y and by.x is also in names(y).

An example:

# Create example data.frames
parents <- data.table(name=c("Sarah", "Max", "Qin", "Lex"), 
                      sex=c("F", "M", "F", "M"), 
                      age=c(41, 43, 36, 51))
children <- data.table(parent=c("Sarah", "Max", "Qin"), 
                       name=c("Oliver", "Sebastian", "Kai-lee"),
                       sex=c("M", "M", "F"), 
                       age=c(5,8,7))

# Merge() creates a duplicated "name" column:
merge(parents, children, by.x = "name", by.y = "parent")

Output:

   name sex.x age.x      name sex.y age.y
1   Max     M    43 Sebastian     M     8
2   Qin     F    36   Kai-lee     F     7
3 Sarah     F    41    Oliver     M     5

This behaviour is also present in base:::merge.data.frame(), but throws an additional warning:

Warning message:
In merge.data.frame(parents, children, by.x = "name", by.y = "parent") :
  column name ‘name’ is duplicated in the result

This patch fixes this problem by checking for names shared between by.x and names(y), and adding the appropriate suffixes to those column names.

@sritchie73
Copy link
Contributor Author

@sritchie73 sritchie73 commented Feb 17, 2018

Also added an equivalent warning message to base for cases when duplicate column names are returned. With the patch above, the only way this can happen is if the user supplies identical values to the suffixes argument (e.g. suffixes=c("", "")).

joined = merge(parents, children, by.x="name", by.y="parent")
test(1877.1, length(names(joined)), length(unique(names(joined))))
test(1877.2, merge(parents, children, by.x="name", by.y="parent", suffixes=c("",""),
warning = "column names 'name', 'sex', 'age' are duplicated in the result"),
Copy link
Member

@HughParsonage HughParsonage Feb 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing , should be removed.

@sritchie73
Copy link
Contributor Author

@sritchie73 sritchie73 commented Feb 17, 2018

RE: failed checks: I can't work out how to write the test that checks for the appropriate warning message - would appreciate advice here since there's no documentation for data.table:::test().

I also can't run the tests on my local machine because the package fails to compile (I don't have OpenMP).

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Feb 17, 2018

Looks good. On test(), search tests.Rraw for warning= to see other examples in other tests to follow. You can also pass output= and/or error= to search for patterns in those messages.
It should compile ok without OpenMP as it should detect if OpenMP is available and switch out. How are you compiling and what is the message it fails with at what point? This is my dev cycle: cc.R and Pasha added a Makefile (both linked in item 8 here).

sritchie73 added 5 commits Feb 18, 2018
This test now fails because the output of merge.data.table does
not match the output of merge.data.frame because base::merge.data.frame
still leads to duplicate column names where by.x is in names(y).
@codecov-io
Copy link

@codecov-io codecov-io commented Feb 18, 2018

Codecov Report

Merging #2631 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2631      +/-   ##
==========================================
+ Coverage   93.13%   93.14%   +<.01%     
==========================================
  Files          61       61              
  Lines       12120    12130      +10     
==========================================
+ Hits        11288    11298      +10     
  Misses        832      832
Impacted Files Coverage Δ
R/merge.R 94.73% <100%> (+0.79%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e2386ab...8ad0242. Read the comment docs.

@sritchie73
Copy link
Contributor Author

@sritchie73 sritchie73 commented Feb 18, 2018

Thanks @mattdowle, I couldn't get the package to build using RStudio package build system - but I probably have something misconfigured in my environment. I was able to get the tests to work with a bit of trial and error.

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Feb 18, 2018

I added a little bit to the Contributing wiki to make it a bit easier to understand how the testing regime works. Thanks for the PR.

@MarkusBonsch
Copy link
Contributor

@MarkusBonsch MarkusBonsch commented Feb 18, 2018

I love it, although it breaks consistency with merge.data.frame. There are quite some other issues with merge.data.table at the moment (#2593). Once these are all fixed (I am working on it), we have to update the documentation to clearly reflect, where we have differences to merge.data.frame.

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Feb 18, 2018

@sritchie73 I assume the Rstudio issue is the same as #2585. I've been using the command line to re-install lately

@sritchie73
Copy link
Contributor Author

@sritchie73 sritchie73 commented Feb 18, 2018

Thanks @MarkusBonsch , for what its worth I've also sent a similar patch to the R-devel mailing list for merge.data.frame - but the two people who responded (neither core dev team) were pessimistic about my chances of getting the patch accepted.

@MichaelChirico I get a different error:

==> R CMD INSTALL --no-multiarch --with-keep.source data.table

* installing to library ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library’
* installing *source* package ‘data.table’ ...
** libs
clang -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG   -I/usr/local/include  -fopenmp -fPIC  -Wall -g -O2  -c assign.c -o assign.o
clang: error: unsupported option '-fopenmp'
make: *** [assign.o] Error 1
ERROR: compilation failed for package ‘data.table’
* removing ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/data.table’
* restoring previous ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/data.table’

Exited with status 1.

In the end I managed to get data.table:::test() working almost the same way in my local R session, by copying the file preamble, then fixed the remaining issues that came up in Travis after pushing the changes out to my branch.

Copy link
Member

@jangorecki jangorecki left a comment

If this PR makes merge.data.table to be less consistent to data.frame method then please describe that in manual. You can link your R-devel patch there also so we (anyone) can track it later on if it happens to be merged to R-devel.

@sritchie73
Copy link
Contributor Author

@sritchie73 sritchie73 commented Feb 18, 2018

Here is the thread on R-devel: http://r.789695.n4.nabble.com/Duplicate-column-names-created-by-base-merge-when-by-x-has-the-same-name-as-a-column-in-y-td4748345.html

There is now a suggestion of just adding the suffix to the column name in y to keep backwards compatibility (i.e. any by.x column can still be referred to by its original name). If that patch is accepted I can similarly update merge.data.table to have this behaviour also.

Copy link
Member

@jangorecki jangorecki left a comment

LGTM. You sometimes use paste(, sep="") and another time paste0, as we now depends on R 3.1 we can safely use paste0, but this is minor thing. We can wait a little bit and see what R-devel will decide.

@sritchie73
Copy link
Contributor Author

@sritchie73 sritchie73 commented Feb 18, 2018

Thanks @jangorecki - I had used paste rather than paste0 in the first instance to match the code style for the existing suffix handling which immediately precedes my patch. Would you prefer I change it to paste0, or keep it consistent with the code preceding it?

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Feb 18, 2018

Lets wait for R-devel, paste is not that important

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Feb 23, 2018

Given Martin Maechler's reply today, it's looking likely to be accepted for R. Wow! Well navigated! I updated manual page accordingly (including link to thread too as Jan suggested) and will merge.
Will invite you to be project member too.

@sritchie73
Copy link
Contributor Author

@sritchie73 sritchie73 commented Feb 23, 2018

Thanks!

I am following up with Martin to clarify the functionality of the proposed no.dups argument so we can mirror the behaviour of merge.data.frame.

@mattdowle mattdowle merged commit b392bfe into Rdatatable:master Feb 23, 2018
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

7 participants