Fix(table_diff): Correctly handle joins with composite keys where one or more of the key fields are null by erindru · Pull Request #5007 · SQLMesh/sqlmesh

erindru · 2025-07-24T05:10:03Z

Prior to this, when doing a table diff on tables with a composite key where one or more of the fields contained a null value, the results were confusing.

In particular, rows would show as fully matched but then also show up in both the source / target samples as being entirely unmatched.

The cause was "did the row join" being based on __sqlmesh_join_key which produces a single string to join on and coalesces NULL values to '', but the "exists in source table" and "exists in target table" checks were based on older logic that said "no it doesnt exist if any of the key columns are null".

This PR changes the "exists" checks to just check for the presence of __sqlmesh_join_key being NULL or not, which indicates if the row as joined or not. Since we use a FULL JOIN, it will be NULL on one side of the join if there was no join and the side it's NULL on indicates which dataset it was present in.

erindru · 2025-07-24T05:12:30Z

tests/core/test_table_diff.py

-    assert row_diff.stats["join_count"] == 4
-    assert row_diff.stats["null_grain_count"] == 4
-    assert row_diff.stats["s_count"] != row_diff.stats["distinct_count_s"]
+    assert row_diff.full_match_pct == 82.35


Note: I checked internally with @themisvaltinos and the main thing this was testing was the full_match_count (which hasnt changed).

The other values were just what happened to be getting output at the time, which are not correct. This bug has been around for a while

maybe worth adding assertions for row_diff.partial_match_pctand row_diff.s_only_pct being 0 and row_diff.t_only_pct being 17.65

so that we can pick up with the test if we alter the query logic in the future any possible regression. as amongst other things this fix also fixes the partial matches that before were negative.

for example a wrong output with null columns before this fix would show the same column wrongly in both s and t samples and the percentages would be wrong as well:

after the fix:

Sure, i've added some extra assertions.

Although i'm doing what the original PR did and trusting that the current results are correct because I don't understand some of the math used to calculate the percentages (like why do we multiply full_match_count by 2 to figure out the percentage for full_match_pct?)

yeah there is some magic going on.. we multiply by 2 since inside _pct we divide by source count+target count. if full match count was 10 out of 20 source and 20 target columns this would be 10 / 40 so 25% instead of 50%. so we multiply by 2 to account for that

… or more of the key fields are null

erindru · 2025-07-24T05:20:28Z

tests/core/engine_adapter/integration/test_integration.py

-    assert row_diff.stats["join_count"] == 4
-    assert row_diff.stats["null_grain_count"] == 4
-    assert row_diff.stats["s_count"] != row_diff.stats["distinct_count_s"]
+    assert row_diff.full_match_pct == 82.35


This test was identical to test_grain_check (just parameterized to run across all engine adapters) so had to be modified in the same way

georgesittas

Great PR description! :)

erindru commented Jul 24, 2025

View reviewed changes

Fix(table_diff): Correctly handle joins with composite keys where one…

f0d3cc9

… or more of the key fields are null

erindru force-pushed the erin/table-diff-null-grain branch from f9ae4b2 to f0d3cc9 Compare July 24, 2025 05:19

erindru commented Jul 24, 2025

View reviewed changes

themisvaltinos approved these changes Jul 24, 2025

View reviewed changes

georgesittas approved these changes Jul 24, 2025

View reviewed changes

Add extra assertions

9cc8812

erindru merged commit 47706cf into main Jul 24, 2025
27 checks passed

erindru deleted the erin/table-diff-null-grain branch July 24, 2025 21:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix(table_diff): Correctly handle joins with composite keys where one or more of the key fields are null#5007

Fix(table_diff): Correctly handle joins with composite keys where one or more of the key fields are null#5007
erindru merged 2 commits intomainfrom
erin/table-diff-null-grain

erindru commented Jul 24, 2025

Uh oh!

erindru Jul 24, 2025 •

edited

Loading

Uh oh!

themisvaltinos Jul 24, 2025

Uh oh!

erindru Jul 24, 2025

Uh oh!

themisvaltinos Jul 24, 2025

Uh oh!

erindru Jul 24, 2025

Uh oh!

georgesittas left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

erindru commented Jul 24, 2025

Uh oh!

erindru Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

themisvaltinos Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

erindru Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

themisvaltinos Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

erindru Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

georgesittas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

erindru Jul 24, 2025 •

edited

Loading