Skip to content

Fix(table_diff): Correctly handle joins with composite keys where one or more of the key fields are null#5007

Merged
erindru merged 2 commits intomainfrom
erin/table-diff-null-grain
Jul 24, 2025
Merged

Fix(table_diff): Correctly handle joins with composite keys where one or more of the key fields are null#5007
erindru merged 2 commits intomainfrom
erin/table-diff-null-grain

Conversation

@erindru
Copy link
Collaborator

@erindru erindru commented Jul 24, 2025

Prior to this, when doing a table diff on tables with a composite key where one or more of the fields contained a null value, the results were confusing.

In particular, rows would show as fully matched but then also show up in both the source / target samples as being entirely unmatched.

The cause was "did the row join" being based on __sqlmesh_join_key which produces a single string to join on and coalesces NULL values to '', but the "exists in source table" and "exists in target table" checks were based on older logic that said "no it doesnt exist if any of the key columns are null".

This PR changes the "exists" checks to just check for the presence of __sqlmesh_join_key being NULL or not, which indicates if the row as joined or not. Since we use a FULL JOIN, it will be NULL on one side of the join if there was no join and the side it's NULL on indicates which dataset it was present in.

assert row_diff.stats["join_count"] == 4
assert row_diff.stats["null_grain_count"] == 4
assert row_diff.stats["s_count"] != row_diff.stats["distinct_count_s"]
assert row_diff.full_match_pct == 82.35
Copy link
Collaborator Author

@erindru erindru Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I checked internally with @themisvaltinos and the main thing this was testing was the full_match_count (which hasnt changed).

The other values were just what happened to be getting output at the time, which are not correct. This bug has been around for a while

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth adding assertions for row_diff.partial_match_pctand row_diff.s_only_pct being 0 and row_diff.t_only_pct being 17.65

so that we can pick up with the test if we alter the query logic in the future any possible regression. as amongst other things this fix also fixes the partial matches that before were negative.

for example a wrong output with null columns before this fix would show the same column wrongly in both s and t samples and the percentages would be wrong as well:
Screenshot 2025-07-24 at 12 12 16

after the fix:
Screenshot 2025-07-24 at 12 12 32

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, i've added some extra assertions.

Although i'm doing what the original PR did and trusting that the current results are correct because I don't understand some of the math used to calculate the percentages (like why do we multiply full_match_count by 2 to figure out the percentage for full_match_pct?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah there is some magic going on.. we multiply by 2 since inside _pct we divide by source count+target count. if full match count was 10 out of 20 source and 20 target columns this would be 10 / 40 so 25% instead of 50%. so we multiply by 2 to account for that

@erindru erindru force-pushed the erin/table-diff-null-grain branch from f9ae4b2 to f0d3cc9 Compare July 24, 2025 05:19
assert row_diff.stats["join_count"] == 4
assert row_diff.stats["null_grain_count"] == 4
assert row_diff.stats["s_count"] != row_diff.stats["distinct_count_s"]
assert row_diff.full_match_pct == 82.35
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was identical to test_grain_check (just parameterized to run across all engine adapters) so had to be modified in the same way

Copy link
Contributor

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR description! :)

@erindru erindru merged commit 47706cf into main Jul 24, 2025
27 checks passed
@erindru erindru deleted the erin/table-diff-null-grain branch July 24, 2025 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants