Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow joins for more than two data frames #1963

Merged
merged 4 commits into from
Oct 14, 2019

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Sep 25, 2019

Fixes #1962

@nalimilan
Copy link
Member

Why not. Though can you check that other software allows that? If not, there may be a reason.

I guess it would be more efficient to join data frames by pairs? Not sure how hard it would be to implement.

@bkamins
Copy link
Member Author

bkamins commented Sep 27, 2019

This PR should be good for a final review. Thank you!

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK, but have you checked whether other implementations allow that?

src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
ordering of the left `DataFrame` takes precedence over the ordering of the right `DataFrame`.

If there are more than two data frames passed to `join` the joining is performed
recursively with left associativity.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this imply in practice? I haven't thought about this too deeply, but I guess it could be more efficient for future optimizations to perform joins in a different order in some cases. I guess that wouldn't change the result as long as we use the same order for columns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have deliberately restricted this feature to :inner, :outer and :cross to make sure that the only difference would be:

  1. efficiency (as you note - but I have left it for later)
  2. column names in case makeunique=true (different columns might get renamed) - this can be handled in future more efficient implementation, as we have to anyway guarantee predictable column naming result (that is why I specify the contract - this allows the user to know how the columns might get renamed)
  3. possibly ordering of rows in :outer and :cross joins (but again - this can be handled in more efficient implementation, and again - we have to make sure we have a contract here)

On the other hand other kinds of joins either make little sense or would produce different results (e.g. :right join would produce different values of rows in the output depending on the order of joins) - that is why I excluded these options (and I think no one really would need them in practice).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Any data/thoughts about whether other implementations allow passing several data sets and why?

Copy link
Member Author

@bkamins bkamins Oct 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas join allows multiple data frames: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html.

In R there is merge_all and merge_recurse in reshape.

In general I think it is easy to apply foldl on a vector of data frames to achieve what is wanted so it was not a top priority to have.
However, I do not see a problem with supporting it, and that is why I thought it is OK to have this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, cool.

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Oct 9, 2019

Thank you for the fixes!

@bkamins bkamins merged commit c3771d4 into JuliaData:master Oct 14, 2019
@bkamins bkamins deleted the flexible_join branch October 14, 2019 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

can't join more than two dataframes?
2 participants