Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve CI failure by converting X_np to X_pylist and mapping indices to spark ids #670

Merged
merged 1 commit into from
Jun 12, 2024

Conversation

lijinf2
Copy link
Collaborator

@lijinf2 lijinf2 commented Jun 11, 2024

The CI failure appears in pyspark 3.3.4 and python 3.9.

Seems pyspark.createDataFrame does not recognize numpy array when in 3.3.4.

Also fix an error related to non-continuous spark ids.

And improve the tolerance of two neighbors having the same distance to a query.

…ndices to spark ids

Signed-off-by: Jinfeng <jinfengl@nvidia.com>
@lijinf2
Copy link
Collaborator Author

lijinf2 commented Jun 11, 2024

build

@@ -81,7 +86,6 @@ def py_func(id: int) -> List[int]:
assert_knn_equal(knn_df, "id", sg_distances, sg_indices)


# @pytest.mark.slow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was a comment and the test case does not take too long (10s). Let me know if marking it slow is better.

assert array_equal(mg_distances, distances)

# set total_tol because two nearest neighbors may have the same distance to the query
assert array_equal(mg_indices, indices, total_tol=total_tol)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, if indices differ should check that the distances are the same. This tol could let a bug slip through.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a follow-up PR to check the returned distances. The checking seemed redundant due to the line "assert array_equal(mg_distances, distances)" which already gets returned distances checked. Perhaps we should recalculate the distance according to indices and then check whether the recalculated distances are equal. Quite troublesome if so but let me know if your comments meant something else.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. I agree that the returned minimum distances matching is a good check and may be sufficient. Is there then a need to check the indices at all? Do we have confidence that the distances are correct for the corresponding indices?

@@ -98,7 +102,9 @@ def test_cpunn_noid() -> None:

with CleanSparkSession({}) as spark:

df = spark.createDataFrame(X)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this ever work?

Copy link
Collaborator Author

@lijinf2 lijinf2 Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it works in pyspark 3.4.3, though it does not work in 3.3.4.

distances, indices = get_sgnn_res(X_vec, X_vec, n_neighbors)

X_sparkid = np.array(pdf[alias.row_number].to_list())
indices_maped_to_sparkid = X_sparkid[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maped -> mapped


distances, indices = get_sgnn_res(X, X, n_neighbors)
assert_knn_equal(knn_df, alias.row_number, distances, indices)
assert_knn_equal(knn_df, alias.row_number, distances, indices_maped_to_sparkid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maped -> mapped

@lijinf2 lijinf2 merged commit ea0eb9b into NVIDIA:branch-24.06 Jun 12, 2024
2 checks passed
@lijinf2 lijinf2 deleted the ci_fix branch June 14, 2024 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants