allow joins for more than two data frames #1963

bkamins · 2019-09-25T14:44:29Z

Fixes #1962

nalimilan · 2019-09-25T15:05:49Z

Why not. Though can you check that other software allows that? If not, there may be a reason.

I guess it would be more efficient to join data frames by pairs? Not sure how hard it would be to implement.

src/abstractdataframe/join.jl

bkamins · 2019-09-27T07:27:40Z

This PR should be good for a final review. Thank you!

nalimilan

Looks OK, but have you checked whether other implementations allow that?

src/abstractdataframe/join.jl

nalimilan · 2019-10-09T16:53:31Z

src/abstractdataframe/join.jl

+ordering of the left `DataFrame` takes precedence over the ordering of the right `DataFrame`.
+
+If there are more than two data frames passed to `join` the joining is performed
+recursively with left associativity.


What does this imply in practice? I haven't thought about this too deeply, but I guess it could be more efficient for future optimizations to perform joins in a different order in some cases. I guess that wouldn't change the result as long as we use the same order for columns?

I have deliberately restricted this feature to :inner, :outer and :cross to make sure that the only difference would be:

efficiency (as you note - but I have left it for later)

column names in case makeunique=true (different columns might get renamed) - this can be handled in future more efficient implementation, as we have to anyway guarantee predictable column naming result (that is why I specify the contract - this allows the user to know how the columns might get renamed)

possibly ordering of rows in :outer and :cross joins (but again - this can be handled in more efficient implementation, and again - we have to make sure we have a contract here)

On the other hand other kinds of joins either make little sense or would produce different results (e.g. :right join would produce different values of rows in the output depending on the order of joins) - that is why I excluded these options (and I think no one really would need them in practice).

OK. Any data/thoughts about whether other implementations allow passing several data sets and why?

Pandas join allows multiple data frames: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html.

In R there is merge_all and merge_recurse in reshape.

In general I think it is easy to apply foldl on a vector of data frames to achieve what is wanted so it was not a top priority to have.
However, I do not see a problem with supporting it, and that is why I thought it is OK to have this PR.

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2019-10-09T17:17:16Z

Thank you for the fixes!

bkamins mentioned this pull request Sep 25, 2019

can't join more than two dataframes? #1962

Closed

nalimilan reviewed Sep 25, 2019

View reviewed changes

src/abstractdataframe/join.jl Show resolved Hide resolved

nalimilan reviewed Sep 25, 2019

View reviewed changes

src/abstractdataframe/join.jl Outdated Show resolved Hide resolved

bkamins added 3 commits October 5, 2019 12:19

allow joins for more than two data frames

56b2320

fix typo

1103262

fix doc

b4a3548

bkamins force-pushed the flexible_join branch from b4090dd to b4a3548 Compare October 5, 2019 10:23

nalimilan reviewed Oct 9, 2019

View reviewed changes

Apply suggestions from code review

8b8520a

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

nalimilan approved these changes Oct 14, 2019

View reviewed changes

bkamins merged commit c3771d4 into JuliaData:master Oct 14, 2019

bkamins deleted the flexible_join branch October 14, 2019 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow joins for more than two data frames #1963

allow joins for more than two data frames #1963

bkamins commented Sep 25, 2019

nalimilan commented Sep 25, 2019

bkamins commented Sep 27, 2019

nalimilan left a comment

nalimilan Oct 9, 2019

bkamins Oct 9, 2019

nalimilan Oct 14, 2019

bkamins Oct 14, 2019 •

edited

Loading

nalimilan Oct 14, 2019

bkamins commented Oct 9, 2019

allow joins for more than two data frames #1963

allow joins for more than two data frames #1963

Conversation

bkamins commented Sep 25, 2019

nalimilan commented Sep 25, 2019

bkamins commented Sep 27, 2019

nalimilan left a comment

Choose a reason for hiding this comment

nalimilan Oct 9, 2019

Choose a reason for hiding this comment

bkamins Oct 9, 2019

Choose a reason for hiding this comment

nalimilan Oct 14, 2019

Choose a reason for hiding this comment

bkamins Oct 14, 2019 • edited Loading

Choose a reason for hiding this comment

nalimilan Oct 14, 2019

Choose a reason for hiding this comment

bkamins commented Oct 9, 2019

bkamins Oct 14, 2019 •

edited

Loading