-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Add unpivot #2204
[FEAT] Add unpivot #2204
Conversation
2f7d12b
to
34290e9
Compare
ec00b2b
to
125666a
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2204 +/- ##
==========================================
+ Coverage 85.59% 85.65% +0.05%
==========================================
Files 71 71
Lines 7594 7639 +45
==========================================
+ Hits 6500 6543 +43
- Misses 1094 1096 +2
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 stuff!
Also if you see a bunch of decoupled comments, its cuz i was clicking 'add single comment' instead of 'start a review', only realized after like 5 comments lmao 😢
return [ | ||
PartialPartitionMetadata( | ||
num_rows=None, | ||
size_bytes=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if i'm wrong here, but can you derive num_rows from self.len() * values.len()
? And also size_bytes should be the same as input right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah we can totally get the number of rows here. However the size_bytes is not going to be exactly the same as the values in the id columns are going to be repeated
let value_dtype = values_fields | ||
.iter() | ||
.map(|f| f.dtype.clone()) | ||
.try_reduce(|a, b| try_get_supertype(&a, &b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel like there's a lot of try_get_super_type calls, would it make sense to just do it once, then pipe the type down all the way into the actual unpivot logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree with you. The thing is both the logical plan and the micropartition expect a schema, but micropartitions can also be created on their own, which means it won't make sense to pipe the result from the logical op creation. Moreover I also do schema resolution differently depending on if the micropartition is empty, but it is trivial to derive a schema from a table so I don't see a great reason to combine them.
But that means there are three places in the code where we find the dtype of the value. I'm open to any ideas about how to improve that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha, I think it's fine then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Adds the unpivot dataframe operation