-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support CPUNearestNeighbor for benchmarking exact nearest neighbors. #655
Conversation
build |
1 similar comment
build |
from pyspark.sql.functions import udf | ||
|
||
spark_func = udf(py_func, "array<float>") | ||
df = spark.range(len(X)).select("id", spark_func("id").alias("features")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any advantage to doing this way vs createDataFrame from pandas df?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems does not throw out a "task size larger than 1000k" warning on large dataset, but it looks no different on small dataset.
def cache_df(dfA: DataFrame, dfB: DataFrame) -> Tuple[DataFrame, DataFrame]: | ||
dfA = dfA.cache() | ||
dfB = dfB.cache() | ||
dfA.count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you verify that count actually caches the dataframe? I think sometimes it can be short-circuited via metadata (e.g. parquet files).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised.
Signed-off-by: Jinfeng <jinfengl@nvidia.com>
build |
build |
|
||
yield pd.DataFrame({"dummy": [1]}) | ||
|
||
dfA.mapInPandas(func_dummy, schema="dummy int").count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to avoid python udfs for this kind of thing but probably ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
cpu LSH will be moved to bench_approx_nearest_neighbors.py for benchmarking with GPU IVFFlat.