Should min_hash and min_hash_spark return the same result? #28

nguyenhuuthuat09 · 2023-07-26T15:59:21Z

Hi, thank you for your great work!

As title, I wonder if these two codes will return the same result or not.

I tested these two codes with default configuration on the same dataset (Oscar 2201 - gl) but the output datasets from them are different. The output datasets have the same number of rows after removing duplicates but the content is different.

And the number of rows that I get after deduplicated is different from your example. But maybe it shouldn't be a problem because I tried the code on several different machines and it all outputs the same number of rows.

The text was updated successfully, but these errors were encountered:

ChenghaoMou · 2023-07-26T16:39:33Z

Great question!

Unfortunately, there is some randomness I cannot control, especially for spark:

df = df.withColumn("__id__", F.monotonically_increasing_id()).cache()

The above line in spark version assigns ids randomly (e.g. [1, 2, 50, 51, 53, 54]), although they are still monotonically increasing. This means duplicates within a cluster will be removed randomly as a result. But the number of removed duplicate documents should be the same from the same algorithm.

If your dataset has an index column, you can modify the code to use that index, in which case, it should give you the same results.

nguyenhuuthuat09 · 2023-07-26T18:53:54Z

Thank you so much for the suggestion.

Actually I realized that I was wrong. When I use your original code, both minhash andminhash_spark give the same result. Previously I didn't sort the results of minhash_spark by id so I thought that these two codes would give different results.

Maybe with other larger datasets or a different spark cluster, the results will be different because as you mentioned, the monotonically_increasing_id function is non-deterministic. I just want to correct my mistake so others may not be misunderstood.

I also tried changing to use the id column of the dataset as you suggested, and the results of the two codes are the same.

Once again, thank you so much for your great work!

nguyenhuuthuat09 closed this as completed Jul 26, 2023

prikmm mentioned this issue Oct 27, 2023

Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py #71

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should min_hash and min_hash_spark return the same result? #28

Should min_hash and min_hash_spark return the same result? #28

nguyenhuuthuat09 commented Jul 26, 2023 •

edited

Loading

ChenghaoMou commented Jul 26, 2023

nguyenhuuthuat09 commented Jul 26, 2023

Should min_hash and min_hash_spark return the same result? #28

Should min_hash and min_hash_spark return the same result? #28

Comments

nguyenhuuthuat09 commented Jul 26, 2023 • edited Loading

ChenghaoMou commented Jul 26, 2023

nguyenhuuthuat09 commented Jul 26, 2023

nguyenhuuthuat09 commented Jul 26, 2023 •

edited

Loading