Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get all composite key candidates function added #58

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

souvik-databricks
Copy link
Collaborator

Added the function to get all of the possible composite key candidates for a delta table

if exclude_cols is None:
exclude_cols = []
df_col_excluded = df.drop(*exclude_cols)
col_select_condition = df_col_excluded.distinct().count()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@souvik-databricks why does it need to distinct? If .distinct().count() < .count(), there is no unique combination, or?

Copy link
Collaborator Author

@souvik-databricks souvik-databricks Jan 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertkossendey .distinct() is required because if your data has duplicates in them and then the collect_set() in line 401 will matchup and we have only unique combination of values for the columns.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@souvik-databricks still struggling to understand, could you add a test with a completely duplicate line? Because that should return no combination at all right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertkossendey So in that scenario if you have a case of duplicate record then that would indicate that from the columns that you have included to search in, you don't have valid composite key combos. You would have to include more columns to act as part of the composite key.

If in the whole dataset you have a full exact duplicate record then it won't make any sense to keep it as that's then redundant information.

And I will add a test with duplicate record with the explanation of the output for that test case in the readme.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If in the whole dataset you have a full exact duplicate record then it won't make any sense to keep it as that's then redundant information.

@souvik-databricks but isn't that the information that there is no valid key candidate in the provided dataset?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was OOO for last week. @robertkossendey I am back now, I will resume the work now.

@MrPowers
Copy link
Owner

MrPowers commented Jan 1, 2023

This looks like a great start. Can you add more tests for the edge cases like when the result is empty, when the DataFrame contains null values, etc?

column for column in df_col_excluded.columns if
df_col_excluded.select(column).collect()[0][0] == col_select_condition
]
df_col_excluded.select("column_combos ->", *columns).show(truncate=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@souvik-databricks we can remove that show, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can remove the show

@MrPowers
Copy link
Owner

@souvik-databricks - let me know when you've addressed the feedback from @robertkossendey and this is ready for another review, thank you!

@MrPowers
Copy link
Owner

MrPowers commented Mar 5, 2023

@souvik-databricks - following up on this one. Are you planning to address this?

@souvik-databricks
Copy link
Collaborator Author

@MrPowers Let's go through this details over our sync.

@MrPowers
Copy link
Owner

MrPowers commented Oct 1, 2023

@souvik-databricks - anything else we need to push this over the finish line?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

find_composite_key_candidates should allow users to return all possible candidate keys
3 participants