-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve CI failure by converting X_np to X_pylist and mapping indices to spark ids #670
Conversation
…ndices to spark ids Signed-off-by: Jinfeng <jinfengl@nvidia.com>
build |
@@ -81,7 +86,6 @@ def py_func(id: int) -> List[int]: | |||
assert_knn_equal(knn_df, "id", sg_distances, sg_indices) | |||
|
|||
|
|||
# @pytest.mark.slow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was a comment and the test case does not take too long (10s). Let me know if marking it slow is better.
assert array_equal(mg_distances, distances) | ||
|
||
# set total_tol because two nearest neighbors may have the same distance to the query | ||
assert array_equal(mg_indices, indices, total_tol=total_tol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, if indices differ should check that the distances are the same. This tol could let a bug slip through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a follow-up PR to check the returned distances. The checking seemed redundant due to the line "assert array_equal(mg_distances, distances)" which already gets returned distances checked. Perhaps we should recalculate the distance according to indices and then check whether the recalculated distances are equal. Quite troublesome if so but let me know if your comments meant something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok. I agree that the returned minimum distances matching is a good check and may be sufficient. Is there then a need to check the indices at all? Do we have confidence that the distances are correct for the corresponding indices?
@@ -98,7 +102,9 @@ def test_cpunn_noid() -> None: | |||
|
|||
with CleanSparkSession({}) as spark: | |||
|
|||
df = spark.createDataFrame(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this ever work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it works in pyspark 3.4.3, though it does not work in 3.3.4.
distances, indices = get_sgnn_res(X_vec, X_vec, n_neighbors) | ||
|
||
X_sparkid = np.array(pdf[alias.row_number].to_list()) | ||
indices_maped_to_sparkid = X_sparkid[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maped -> mapped
|
||
distances, indices = get_sgnn_res(X, X, n_neighbors) | ||
assert_knn_equal(knn_df, alias.row_number, distances, indices) | ||
assert_knn_equal(knn_df, alias.row_number, distances, indices_maped_to_sparkid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maped -> mapped
The CI failure appears in pyspark 3.3.4 and python 3.9.
Seems pyspark.createDataFrame does not recognize numpy array when in 3.3.4.
Also fix an error related to non-continuous spark ids.
And improve the tolerance of two neighbors having the same distance to a query.