-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow joins for more than two data frames #1963
Conversation
Why not. Though can you check that other software allows that? If not, there may be a reason. I guess it would be more efficient to join data frames by pairs? Not sure how hard it would be to implement. |
This PR should be good for a final review. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK, but have you checked whether other implementations allow that?
ordering of the left `DataFrame` takes precedence over the ordering of the right `DataFrame`. | ||
|
||
If there are more than two data frames passed to `join` the joining is performed | ||
recursively with left associativity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this imply in practice? I haven't thought about this too deeply, but I guess it could be more efficient for future optimizations to perform joins in a different order in some cases. I guess that wouldn't change the result as long as we use the same order for columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have deliberately restricted this feature to :inner
, :outer
and :cross
to make sure that the only difference would be:
- efficiency (as you note - but I have left it for later)
- column names in case
makeunique=true
(different columns might get renamed) - this can be handled in future more efficient implementation, as we have to anyway guarantee predictable column naming result (that is why I specify the contract - this allows the user to know how the columns might get renamed) - possibly ordering of rows in
:outer
and:cross
joins (but again - this can be handled in more efficient implementation, and again - we have to make sure we have a contract here)
On the other hand other kinds of joins either make little sense or would produce different results (e.g. :right
join would produce different values of rows in the output depending on the order of joins) - that is why I excluded these options (and I think no one really would need them in practice).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Any data/thoughts about whether other implementations allow passing several data sets and why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas join
allows multiple data frames: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html.
In R there is merge_all
and merge_recurse
in reshape.
In general I think it is easy to apply foldl
on a vector of data frames to achieve what is wanted so it was not a top priority to have.
However, I do not see a problem with supporting it, and that is why I thought it is OK to have this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, cool.
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Thank you for the fixes! |
Fixes #1962