Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should min_hash and min_hash_spark return the same result? #28

Closed
nguyenhuuthuat09 opened this issue Jul 26, 2023 · 2 comments
Closed

Comments

@nguyenhuuthuat09
Copy link

nguyenhuuthuat09 commented Jul 26, 2023

Hi, thank you for your great work!

As title, I wonder if these two codes will return the same result or not.

I tested these two codes with default configuration on the same dataset (Oscar 2201 - gl) but the output datasets from them are different. The output datasets have the same number of rows after removing duplicates but the content is different.

And the number of rows that I get after deduplicated is different from your example. But maybe it shouldn't be a problem because I tried the code on several different machines and it all outputs the same number of rows.

@ChenghaoMou
Copy link
Owner

Great question!

Unfortunately, there is some randomness I cannot control, especially for spark:

df = df.withColumn("__id__", F.monotonically_increasing_id()).cache()

The above line in spark version assigns ids randomly (e.g. [1, 2, 50, 51, 53, 54]), although they are still monotonically increasing. This means duplicates within a cluster will be removed randomly as a result. But the number of removed duplicate documents should be the same from the same algorithm.

If your dataset has an index column, you can modify the code to use that index, in which case, it should give you the same results.

@nguyenhuuthuat09
Copy link
Author

Thank you so much for the suggestion.

Actually I realized that I was wrong. When I use your original code, both minhash andminhash_spark give the same result. Previously I didn't sort the results of minhash_spark by id so I thought that these two codes would give different results.

Maybe with other larger datasets or a different spark cluster, the results will be different because as you mentioned, the monotonically_increasing_id function is non-deterministic. I just want to correct my mistake so others may not be misunderstood.

I also tried changing to use the id column of the dataset as you suggested, and the results of the two codes are the same.

Once again, thank you so much for your great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants