[FEAT] Add unpivot #2204

kevinzwang · 2024-04-29T18:42:43Z

Adds the unpivot dataframe operation

codecov · 2024-04-30T18:02:54Z

Codecov Report

Attention: Patch coverage is 95.52239% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 85.65%. Comparing base (29d310b) to head (99d7489).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2204      +/-   ##
==========================================
+ Coverage   85.59%   85.65%   +0.05%     
==========================================
  Files          71       71              
  Lines        7594     7639      +45     
==========================================
+ Hits         6500     6543      +43     
- Misses       1094     1096       +2

Files	Coverage Δ
daft/execution/execution_step.py	`93.77% <100.00%> (+0.19%)`	⬆️
daft/execution/rust_physical_plan_shim.py	`96.25% <100.00%> (+0.25%)`	⬆️
daft/logical/builder.py	`90.97% <100.00%> (+0.35%)`	⬆️
daft/table/micropartition.py	`90.70% <100.00%> (+0.16%)`	⬆️
daft/dataframe/dataframe.py	`90.19% <92.30%> (-0.06%)`	⬇️

src/daft-core/src/series/ops/mod.rs

src/daft-core/src/series/ops/if_else.rs

src/daft-core/src/series/ops/mod.rs

colin-ho

🔥 stuff!

Also if you see a bunch of decoupled comments, its cuz i was clicking 'add single comment' instead of 'start a review', only realized after like 5 comments lmao 😢

colin-ho · 2024-05-01T23:42:01Z

daft/execution/execution_step.py

+        return [
+            PartialPartitionMetadata(
+                num_rows=None,
+                size_bytes=None,


Correct me if i'm wrong here, but can you derive num_rows from self.len() * values.len() ? And also size_bytes should be the same as input right?

Oh yeah we can totally get the number of rows here. However the size_bytes is not going to be exactly the same as the values in the id columns are going to be repeated

src/daft-plan/src/logical_ops/unpivot.rs

colin-ho · 2024-05-01T23:50:17Z

src/daft-plan/src/logical_ops/unpivot.rs

+        let value_dtype = values_fields
+            .iter()
+            .map(|f| f.dtype.clone())
+            .try_reduce(|a, b| try_get_supertype(&a, &b))


i feel like there's a lot of try_get_super_type calls, would it make sense to just do it once, then pipe the type down all the way into the actual unpivot logic?

Yeah I agree with you. The thing is both the logical plan and the micropartition expect a schema, but micropartitions can also be created on their own, which means it won't make sense to pipe the result from the logical op creation. Moreover I also do schema resolution differently depending on if the micropartition is empty, but it is trivial to derive a schema from a table so I don't see a great reason to combine them.

But that means there are three places in the code where we find the dtype of the value. I'm open to any ideas about how to improve that!

Ah gotcha, I think it's fine then.

src/daft-plan/src/partitioning.rs

tests/dataframe/test_unpivot.py

daft/dataframe/dataframe.py

colin-ho

LGTM!

kevinzwang requested review from samster25, jaychia and colin-ho April 29, 2024 18:42

github-actions bot added the enhancement New feature or request label Apr 29, 2024

kevinzwang force-pushed the kevin/unpivot branch from 2f7d12b to 34290e9 Compare April 29, 2024 18:45

github-actions bot added the documentation Improvements or additions to documentation label Apr 29, 2024

kevinzwang added 4 commits April 30, 2024 10:36

add unpivot

dab44d9

undo accidental cargo lock changes

6c984af

add melt and docs

c27fc95

fix things from rebase

125666a

kevinzwang force-pushed the kevin/unpivot branch from ec00b2b to 125666a Compare April 30, 2024 17:48

fix naming

b69e02b

fix empty partition schema

59ddac3