Support CPUNearestNeighbor for benchmarking exact nearest neighbors. #655

lijinf2 · 2024-05-13T21:40:13Z

cpu LSH will be moved to bench_approx_nearest_neighbors.py for benchmarking with GPU IVFFlat.

lijinf2 · 2024-05-13T21:40:50Z

build

lijinf2 · 2024-05-14T23:34:42Z

build

eordentlich · 2024-05-22T22:49:22Z

python/tests/test_benchmark.py

+        from pyspark.sql.functions import udf
+
+        spark_func = udf(py_func, "array<float>")
+        df = spark.range(len(X)).select("id", spark_func("id").alias("features"))


Any advantage to doing this way vs createDataFrame from pandas df?

It seems does not throw out a "task size larger than 1000k" warning on large dataset, but it looks no different on small dataset.

eordentlich · 2024-05-22T22:51:31Z

python/benchmark/benchmark/bench_nearest_neighbors.py

+        def cache_df(dfA: DataFrame, dfB: DataFrame) -> Tuple[DataFrame, DataFrame]:
+            dfA = dfA.cache()
+            dfB = dfB.cache()
+            dfA.count()


Did you verify that count actually caches the dataframe? I think sometimes it can be short-circuited via metadata (e.g. parquet files).

Signed-off-by: Jinfeng <jinfengl@nvidia.com>

…hbors

lijinf2 · 2024-05-24T07:19:22Z

build

lijinf2 · 2024-05-25T00:14:37Z

build

eordentlich · 2024-05-25T01:41:05Z

python/benchmark/benchmark/bench_nearest_neighbors.py

+
+                yield pd.DataFrame({"dummy": [1]})
+
+            dfA.mapInPandas(func_dummy, schema="dummy int").count()


Better to avoid python udfs for this kind of thing but probably ok.

eordentlich

👍

eordentlich reviewed May 22, 2024

View reviewed changes

lijinf2 added 4 commits May 24, 2024 00:07

support bench exact CPU knn

caedb00

Signed-off-by: Jinfeng <jinfengl@nvidia.com>

set maxResultSize=0 to unlimit broadcast

1b82b40

fix a typo in run_benchmark.sh, add test functions for CPUNearestNeig…

7186a06

…hbors

revise

1420417

lijinf2 force-pushed the ann_bench branch from b3df135 to 1420417 Compare May 24, 2024 07:08

limit omp job to 1 per spark task when using sklearn

f88af98

eordentlich reviewed May 25, 2024

View reviewed changes

eordentlich approved these changes May 25, 2024

View reviewed changes

lijinf2 merged commit d608e96 into NVIDIA:branch-24.06 May 25, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support CPUNearestNeighbor for benchmarking exact nearest neighbors. #655

Support CPUNearestNeighbor for benchmarking exact nearest neighbors. #655

lijinf2 commented May 13, 2024

lijinf2 commented May 13, 2024

lijinf2 commented May 14, 2024

eordentlich May 22, 2024

lijinf2 May 24, 2024

eordentlich May 22, 2024

lijinf2 May 24, 2024

lijinf2 commented May 24, 2024

lijinf2 commented May 25, 2024

eordentlich May 25, 2024

eordentlich left a comment


		yield pd.DataFrame({"dummy": [1]})

		dfA.mapInPandas(func_dummy, schema="dummy int").count()

Support CPUNearestNeighbor for benchmarking exact nearest neighbors. #655

Support CPUNearestNeighbor for benchmarking exact nearest neighbors. #655

Conversation

lijinf2 commented May 13, 2024

lijinf2 commented May 13, 2024

lijinf2 commented May 14, 2024

eordentlich May 22, 2024

Choose a reason for hiding this comment

lijinf2 May 24, 2024

Choose a reason for hiding this comment

eordentlich May 22, 2024

Choose a reason for hiding this comment

lijinf2 May 24, 2024

Choose a reason for hiding this comment

lijinf2 commented May 24, 2024

lijinf2 commented May 25, 2024

eordentlich May 25, 2024

Choose a reason for hiding this comment

eordentlich left a comment

Choose a reason for hiding this comment