feat: Tolerances for inner lists and arrays#21
feat: Tolerances for inner lists and arrays#21Marius Merkle (MariusMerkleQC) wants to merge 5 commits intomainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #21 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 10 10
Lines 742 759 +17
=========================================
+ Hits 742 759 +17 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| assert actual.to_list() == [True, False, False] | ||
| else: | ||
| assert actual.to_list() == [True, True, False] | ||
| assert actual.to_list() == [True, True, False] |
There was a problem hiding this comment.
We now get the desired output 🥳
| """Collect max-list-length scalar expressions for every List level in the type | ||
| tree.""" | ||
| if isinstance(dtype, pl.List): | ||
| return [expr.list.len().max(), *_list_length_exprs(expr.explode(), dtype.inner)] |
There was a problem hiding this comment.
The aliases clash, but it doesn't matter because we take max_horizontal immediately after.
| schema={"pk": pl.Int64, "a_right": rhs_type}, | ||
| ) | ||
|
|
||
| max_list_length: int | None = None |
There was a problem hiding this comment.
Pull request overview
Extends tolerance-based element-wise comparisons to apply to inner list levels nested inside other types (e.g., structs/arrays containing lists) by computing a per-column maximum list length across the full dtype “tree”.
Changes:
- Update
_max_list_lengths_by_columnto detect lists at any nesting level and compute the maximum list length across both frames and all list levels. - Adjust list-vs-list sequence comparison to always unroll element-wise using the computed
max_list_length(and propagate it into recursive comparisons). - Update/extend tests to cover nested inner-list tolerance behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
diffly/comparison.py |
Computes max list lengths for columns with nested lists anywhere in their dtype tree. |
diffly/_conditions.py |
Removes nested-list fallback equality and requires max_list_length for List-vs-List unrolling; propagates length to recursive comparisons. |
tests/test_conditions.py |
Updates expected results and adds a new test for inner-list-only nesting (struct containing list). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Oliver Borchert (borchero)
left a comment
There was a problem hiding this comment.
I actually didn't consider this previously but comparing all list elements independently will cause severe performance regressions when running the comparison on (large) lists with non-floats. Can we benchmark this and use all of this complicated logic only if there is a float somewhere in the "list hierarchy" (-> separate PR)?
| if max_list_length is None: | ||
| raise ValueError( | ||
| "max_list_length must be provided for List-vs-List comparisons " | ||
| "in _compare_sequence_columns()." | ||
| ) | ||
| n_elements = max_list_length |
There was a problem hiding this comment.
Triggering this branch in tests seems artificial. Tbh, I would just assume that this is set correctly (after all, it's a parameter that is exclusively set internally):
| if max_list_length is None: | |
| raise ValueError( | |
| "max_list_length must be provided for List-vs-List comparisons " | |
| "in _compare_sequence_columns()." | |
| ) | |
| n_elements = max_list_length | |
| n_elements = cast(int, max_list_length) |
Motivation
Follow-up to #19, this finally solves #8 to 100%. We so far defaulted to naive comparison for inner lists vs lists, so whenever they were nested within some other data structure (like an array of lists, a struct of struct of lists, etc.). Element-wise comparison accounting for tolerances is now applied instead: whenever two columns contain a list anywhere in their data type "tree", we compute the maximum length of the lists, where maximum is both over
(1) left and right data frame
(2) on any level in the data type tree
In list vs list comparisons, we then traverse all elements up to
max_list_lengthand cover out-of-bounds by returningNone. This doesn't yield false positive matches as we combine the element-wise check with a list-length check.Changes
_max_list_lengths_by_columnto consider all data type levelstest_condition_equal_columns_nested_list_array_with_tolerancetest_condition_equal_columns_lists_only_innerwhere lists are not an outer but inner data type