Remove duplicate code in dataframe comparer #19

MrPowers · 2021-03-27T12:50:49Z

Fixes #17
Fixes #18
Fixes #10

nchammas · 2021-03-29T15:20:17Z

chispa/dataframe_comparer.py

+    if transforms is not None:
+        df1 = reduce(lambda acc, fn: fn(acc), transforms, df1)
+        df2 = reduce(lambda acc, fn: fn(acc), transforms, df2)


Hmm, I would try to combine this into a single statement with the previous if transforms is None. Or, I would just remove the if statement here since it will always run.

Good catch, thanks! Fixed.

nchammas · 2021-03-29T15:24:05Z

chispa/dataframe_comparer.py

-                t.add_row([r1, r2])
-        if allRowsEqual == False:
-            raise DataFramesNotEqualError("\n" + t.get_string())
+        assert_generic_df_equality(df1, df2, are_rows_equal_enhanced, [True])
    else:


Are you planning to also modify this else block to call one of the other utility functions you have? e.g. Call assert_generic_df_equality() but with slightly different arguments so it doesn't consider NaNs equal.

Good catch, created a separate assert_basic_df_equality method to make the intentions of the code clearer. What I want to do now is benchmark assert_basic_df_equality vs assert_generic_df_equality(df1, df2, are_rows_equal, []) to see which is faster. Both should give the same response and want to give users the fastest option (cause slow Spark tests are painful).

nchammas · 2021-03-29T15:27:43Z

chispa/dataframe_comparer.py

-def assert_df_equality(df1, df2, ignore_nullable = False, transforms = [], allow_nan_equality = False):
-    df1 = reduce(lambda acc, fn: fn(acc), transforms, df1)
-    df2 = reduce(lambda acc, fn: fn(acc), transforms, df2)
+def assert_df_equality(df1, df2, ignore_nullable=False, transforms=None, allow_nan_equality=False, ignore_column_order=False, ignore_row_order=False):


ignore_row_order is an interesting option. AFAIK DataFrames are orderless unless they have an explicit .orderBy() tacked on to their plan. This is analogous to tables in SQL, btw.

So wouldn't you want the default to be ignore_row_order=True?

When ignore_row_order is set to True then both DataFrames are sorted before they're compared. This is slower and potentially less intuitive (cause the DataFrames in your test suite have a different order than the outputted error message). I think this is an important feature, especially for operations that can return results in random orders, but shouldn't be the default. Let's chat about this more.

Maybe I'm the outlier, but I consider the more intuitive check -- especially for testing purposes -- to ignore order. If some function produces a DataFrame that I want to check, I care about the contents. And by default, Spark offers no guarantees on row order unless your plan has an explicit .orderBy(). So relying on the stability of row order in the absence of an explicit order by clause is a recipe for surprises, much like it is in SQL.

In fact, I don't think .collect() even provides any guarantees that the row order of the resulting array will match the row order of the original DataFrame---again, unless the DataFrame has an explicit ordering specified. It's theoretically possible, for example, that you could call spark.range(3).collect() twice and get different row orders each time. So if you're relying on .collect() to preserve order without explicit ordering on the original DataFrames, then I would say that's technically incorrect.

By the way, in your own usages of this library (or the Scala equivalent), how often do you compare DataFrames where you care about the row order? I'm curious to see a few examples of that.

Created a separate issue to explore this suggestion.

MrPowers added 3 commits March 27, 2021 09:50

Remove duplicate code in dataframe comparer

8ab94e5

Set transforms argument to None by default

3a2dc72

Add flags to ignore row and column order

b1d5cef

nchammas reviewed Mar 29, 2021

View reviewed changes

MrPowers added 2 commits March 29, 2021 13:39

Address code review feedback

dc3becd

Refactor schema equality comparisons

c522a89

MrPowers mentioned this pull request Mar 29, 2021

Possibly set row comparison to true by default for DataFrame comparisons #20

Open

MrPowers merged commit c522a89 into main Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove duplicate code in dataframe comparer #19

Remove duplicate code in dataframe comparer #19

MrPowers commented Mar 27, 2021 •

edited

Loading

nchammas Mar 29, 2021

MrPowers Mar 29, 2021

nchammas Mar 29, 2021

MrPowers Mar 29, 2021

nchammas Mar 29, 2021

MrPowers Mar 29, 2021

nchammas Mar 29, 2021

MrPowers Mar 29, 2021

Remove duplicate code in dataframe comparer #19

Remove duplicate code in dataframe comparer #19

Conversation

MrPowers commented Mar 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MrPowers commented Mar 27, 2021 •

edited

Loading