Conversation
tests/core/test_table_diff.py
Outdated
| assert row_diff.stats["join_count"] == 4 | ||
| assert row_diff.stats["null_grain_count"] == 4 | ||
| assert row_diff.stats["s_count"] != row_diff.stats["distinct_count_s"] | ||
| assert row_diff.full_match_pct == 82.35 |
There was a problem hiding this comment.
Note: I checked internally with @themisvaltinos and the main thing this was testing was the full_match_count (which hasnt changed).
The other values were just what happened to be getting output at the time, which are not correct. This bug has been around for a while
There was a problem hiding this comment.
maybe worth adding assertions for row_diff.partial_match_pctand row_diff.s_only_pct being 0 and row_diff.t_only_pct being 17.65
so that we can pick up with the test if we alter the query logic in the future any possible regression. as amongst other things this fix also fixes the partial matches that before were negative.
for example a wrong output with null columns before this fix would show the same column wrongly in both s and t samples and the percentages would be wrong as well:

There was a problem hiding this comment.
Sure, i've added some extra assertions.
Although i'm doing what the original PR did and trusting that the current results are correct because I don't understand some of the math used to calculate the percentages (like why do we multiply full_match_count by 2 to figure out the percentage for full_match_pct?)
There was a problem hiding this comment.
yeah there is some magic going on.. we multiply by 2 since inside _pct we divide by source count+target count. if full match count was 10 out of 20 source and 20 target columns this would be 10 / 40 so 25% instead of 50%. so we multiply by 2 to account for that
… or more of the key fields are null
f9ae4b2 to
f0d3cc9
Compare
| assert row_diff.stats["join_count"] == 4 | ||
| assert row_diff.stats["null_grain_count"] == 4 | ||
| assert row_diff.stats["s_count"] != row_diff.stats["distinct_count_s"] | ||
| assert row_diff.full_match_pct == 82.35 |
There was a problem hiding this comment.
This test was identical to test_grain_check (just parameterized to run across all engine adapters) so had to be modified in the same way
georgesittas
left a comment
There was a problem hiding this comment.
Great PR description! :)

Prior to this, when doing a table diff on tables with a composite key where one or more of the fields contained a null value, the results were confusing.
In particular, rows would show as fully matched but then also show up in both the source / target samples as being entirely unmatched.
The cause was "did the row join" being based on
__sqlmesh_join_keywhich produces a single string to join on and coalesces NULL values to'', but the "exists in source table" and "exists in target table" checks were based on older logic that said "no it doesnt exist if any of the key columns are null".This PR changes the "exists" checks to just check for the presence of
__sqlmesh_join_keybeing NULL or not, which indicates if the row as joined or not. Since we use aFULL JOIN, it will be NULL on one side of the join if there was no join and the side it's NULL on indicates which dataset it was present in.