You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As title, I wonder if these two codes will return the same result or not.
I tested these two codes with default configuration on the same dataset (Oscar 2201 - gl) but the output datasets from them are different. The output datasets have the same number of rows after removing duplicates but the content is different.
And the number of rows that I get after deduplicated is different from your example. But maybe it shouldn't be a problem because I tried the code on several different machines and it all outputs the same number of rows.
The text was updated successfully, but these errors were encountered:
The above line in spark version assigns ids randomly (e.g. [1, 2, 50, 51, 53, 54]), although they are still monotonically increasing. This means duplicates within a cluster will be removed randomly as a result. But the number of removed duplicate documents should be the same from the same algorithm.
If your dataset has an index column, you can modify the code to use that index, in which case, it should give you the same results.
Actually I realized that I was wrong. When I use your original code, both minhash andminhash_spark give the same result. Previously I didn't sort the results of minhash_spark by id so I thought that these two codes would give different results.
Maybe with other larger datasets or a different spark cluster, the results will be different because as you mentioned, the monotonically_increasing_id function is non-deterministic. I just want to correct my mistake so others may not be misunderstood.
I also tried changing to use the id column of the dataset as you suggested, and the results of the two codes are the same.
Once again, thank you so much for your great work!
Hi, thank you for your great work!
As title, I wonder if these two codes will return the same result or not.
I tested these two codes with default configuration on the same dataset (
Oscar 2201 - gl
) but the output datasets from them are different. The output datasets have the same number of rows after removing duplicates but the content is different.And the number of rows that I get after deduplicated is different from your example. But maybe it shouldn't be a problem because I tried the code on several different machines and it all outputs the same number of rows.
The text was updated successfully, but these errors were encountered: