Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add unpivot #2204

Merged
merged 10 commits into from
May 6, 2024
Merged

[FEAT] Add unpivot #2204

merged 10 commits into from
May 6, 2024

Conversation

kevinzwang
Copy link
Member

Adds the unpivot dataframe operation

@github-actions github-actions bot added the enhancement New feature or request label Apr 29, 2024
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 29, 2024
Copy link

codecov bot commented Apr 30, 2024

Codecov Report

Attention: Patch coverage is 95.52239% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 85.65%. Comparing base (29d310b) to head (99d7489).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2204      +/-   ##
==========================================
+ Coverage   85.59%   85.65%   +0.05%     
==========================================
  Files          71       71              
  Lines        7594     7639      +45     
==========================================
+ Hits         6500     6543      +43     
- Misses       1094     1096       +2     
Files Coverage Δ
daft/execution/execution_step.py 93.77% <100.00%> (+0.19%) ⬆️
daft/execution/rust_physical_plan_shim.py 96.25% <100.00%> (+0.25%) ⬆️
daft/logical/builder.py 90.97% <100.00%> (+0.35%) ⬆️
daft/table/micropartition.py 90.70% <100.00%> (+0.16%) ⬆️
daft/dataframe/dataframe.py 90.19% <92.30%> (-0.06%) ⬇️

Copy link
Contributor

@colin-ho colin-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 stuff!

Also if you see a bunch of decoupled comments, its cuz i was clicking 'add single comment' instead of 'start a review', only realized after like 5 comments lmao 😢

return [
PartialPartitionMetadata(
num_rows=None,
size_bytes=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if i'm wrong here, but can you derive num_rows from self.len() * values.len() ? And also size_bytes should be the same as input right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah we can totally get the number of rows here. However the size_bytes is not going to be exactly the same as the values in the id columns are going to be repeated

src/daft-plan/src/logical_ops/unpivot.rs Show resolved Hide resolved
let value_dtype = values_fields
.iter()
.map(|f| f.dtype.clone())
.try_reduce(|a, b| try_get_supertype(&a, &b))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like there's a lot of try_get_super_type calls, would it make sense to just do it once, then pipe the type down all the way into the actual unpivot logic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree with you. The thing is both the logical plan and the micropartition expect a schema, but micropartitions can also be created on their own, which means it won't make sense to pipe the result from the logical op creation. Moreover I also do schema resolution differently depending on if the micropartition is empty, but it is trivial to derive a schema from a table so I don't see a great reason to combine them.

But that means there are three places in the code where we find the dtype of the value. I'm open to any ideas about how to improve that!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha, I think it's fine then.

src/daft-plan/src/partitioning.rs Show resolved Hide resolved
tests/dataframe/test_unpivot.py Show resolved Hide resolved
daft/dataframe/dataframe.py Show resolved Hide resolved
@kevinzwang kevinzwang requested a review from colin-ho May 6, 2024 06:38
Copy link
Contributor

@colin-ho colin-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kevinzwang kevinzwang merged commit c57aaad into main May 6, 2024
29 checks passed
@kevinzwang kevinzwang deleted the kevin/unpivot branch May 6, 2024 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants