Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicate code in dataframe comparer #19

Merged
merged 5 commits into from
Mar 29, 2021

Conversation

MrPowers
Copy link
Owner

@MrPowers MrPowers commented Mar 27, 2021

Fixes #17
Fixes #18
Fixes #10

Comment on lines 21 to 23
if transforms is not None:
df1 = reduce(lambda acc, fn: fn(acc), transforms, df1)
df2 = reduce(lambda acc, fn: fn(acc), transforms, df2)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I would try to combine this into a single statement with the previous if transforms is None. Or, I would just remove the if statement here since it will always run.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks! Fixed.

t.add_row([r1, r2])
if allRowsEqual == False:
raise DataFramesNotEqualError("\n" + t.get_string())
assert_generic_df_equality(df1, df2, are_rows_equal_enhanced, [True])
else:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning to also modify this else block to call one of the other utility functions you have? e.g. Call assert_generic_df_equality() but with slightly different arguments so it doesn't consider NaNs equal.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, created a separate assert_basic_df_equality method to make the intentions of the code clearer. What I want to do now is benchmark assert_basic_df_equality vs assert_generic_df_equality(df1, df2, are_rows_equal, []) to see which is faster. Both should give the same response and want to give users the fastest option (cause slow Spark tests are painful).

def assert_df_equality(df1, df2, ignore_nullable = False, transforms = [], allow_nan_equality = False):
df1 = reduce(lambda acc, fn: fn(acc), transforms, df1)
df2 = reduce(lambda acc, fn: fn(acc), transforms, df2)
def assert_df_equality(df1, df2, ignore_nullable=False, transforms=None, allow_nan_equality=False, ignore_column_order=False, ignore_row_order=False):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore_row_order is an interesting option. AFAIK DataFrames are orderless unless they have an explicit .orderBy() tacked on to their plan. This is analogous to tables in SQL, btw.

So wouldn't you want the default to be ignore_row_order=True?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When ignore_row_order is set to True then both DataFrames are sorted before they're compared. This is slower and potentially less intuitive (cause the DataFrames in your test suite have a different order than the outputted error message). I think this is an important feature, especially for operations that can return results in random orders, but shouldn't be the default. Let's chat about this more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm the outlier, but I consider the more intuitive check -- especially for testing purposes -- to ignore order. If some function produces a DataFrame that I want to check, I care about the contents. And by default, Spark offers no guarantees on row order unless your plan has an explicit .orderBy(). So relying on the stability of row order in the absence of an explicit order by clause is a recipe for surprises, much like it is in SQL.

In fact, I don't think .collect() even provides any guarantees that the row order of the resulting array will match the row order of the original DataFrame---again, unless the DataFrame has an explicit ordering specified. It's theoretically possible, for example, that you could call spark.range(3).collect() twice and get different row orders each time. So if you're relying on .collect() to preserve order without explicit ordering on the original DataFrames, then I would say that's technically incorrect.

By the way, in your own usages of this library (or the Scala equivalent), how often do you compare DataFrames where you care about the row order? I'm curious to see a few examples of that.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a separate issue to explore this suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment