Skip to content

feat: Tolerances for inner lists and arrays#21

Open
Marius Merkle (MariusMerkleQC) wants to merge 5 commits intomainfrom
nested_list_comparison
Open

feat: Tolerances for inner lists and arrays#21
Marius Merkle (MariusMerkleQC) wants to merge 5 commits intomainfrom
nested_list_comparison

Conversation

@MariusMerkleQC
Copy link
Copy Markdown
Collaborator

@MariusMerkleQC Marius Merkle (MariusMerkleQC) commented Mar 27, 2026

Motivation

Follow-up to #19, this finally solves #8 to 100%. We so far defaulted to naive comparison for inner lists vs lists, so whenever they were nested within some other data structure (like an array of lists, a struct of struct of lists, etc.). Element-wise comparison accounting for tolerances is now applied instead: whenever two columns contain a list anywhere in their data type "tree", we compute the maximum length of the lists, where maximum is both over
(1) left and right data frame
(2) on any level in the data type tree

In list vs list comparisons, we then traverse all elements up to max_list_length and cover out-of-bounds by returning None. This doesn't yield false positive matches as we combine the element-wise check with a list-length check.

Changes

  • Adjusted _max_list_lengths_by_column to consider all data type levels
  • Adjusted expected outcome in test_condition_equal_columns_nested_list_array_with_tolerance
  • Added a new test test_condition_equal_columns_lists_only_inner where lists are not an outer but inner data type

@github-actions github-actions bot added the enhancement New feature or request label Mar 27, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (2a3010b) to head (3d5467a).

Additional details and impacted files
@@            Coverage Diff            @@
##              main       #21   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           10        10           
  Lines          742       759   +17     
=========================================
+ Hits           742       759   +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

assert actual.to_list() == [True, False, False]
else:
assert actual.to_list() == [True, True, False]
assert actual.to_list() == [True, True, False]
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now get the desired output 🥳

"""Collect max-list-length scalar expressions for every List level in the type
tree."""
if isinstance(dtype, pl.List):
return [expr.list.len().max(), *_list_length_exprs(expr.explode(), dtype.inner)]
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aliases clash, but it doesn't matter because we take max_horizontal immediately after.

schema={"pk": pl.Int64, "a_right": rhs_type},
)

max_list_length: int | None = None
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was incorrect in #19. However, I suggest an alternative approach to the tests in #23 to ensure we use the correct max_list_length values.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends tolerance-based element-wise comparisons to apply to inner list levels nested inside other types (e.g., structs/arrays containing lists) by computing a per-column maximum list length across the full dtype “tree”.

Changes:

  • Update _max_list_lengths_by_column to detect lists at any nesting level and compute the maximum list length across both frames and all list levels.
  • Adjust list-vs-list sequence comparison to always unroll element-wise using the computed max_list_length (and propagate it into recursive comparisons).
  • Update/extend tests to cover nested inner-list tolerance behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
diffly/comparison.py Computes max list lengths for columns with nested lists anywhere in their dtype tree.
diffly/_conditions.py Removes nested-list fallback equality and requires max_list_length for List-vs-List unrolling; propagates length to recursive comparisons.
tests/test_conditions.py Updates expected results and adds a new test for inner-list-only nesting (struct containing list).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually didn't consider this previously but comparing all list elements independently will cause severe performance regressions when running the comparison on (large) lists with non-floats. Can we benchmark this and use all of this complicated logic only if there is a float somewhere in the "list hierarchy" (-> separate PR)?

Comment on lines +209 to 214
if max_list_length is None:
raise ValueError(
"max_list_length must be provided for List-vs-List comparisons "
"in _compare_sequence_columns()."
)
n_elements = max_list_length
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triggering this branch in tests seems artificial. Tbh, I would just assume that this is set correctly (after all, it's a parameter that is exclusively set internally):

Suggested change
if max_list_length is None:
raise ValueError(
"max_list_length must be provided for List-vs-List comparisons "
"in _compare_sequence_columns()."
)
n_elements = max_list_length
n_elements = cast(int, max_list_length)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Properly perform floating point comparisons for structs and lists

3 participants