Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.samp…
…le` in Client side ### What changes were proposed in this pull request? Always set the seed of `Dataframe.sample` in Client side ### Why are the changes needed? Bug fix If the seed is not set in Client, it will be set in server side with a random int https://github.com/apache/spark/blob/c4df12cc884cddefcfcf8324b4d7b9349fb4f6a0/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala#L386 which cause inconsistent results in multiple executions In Spark Classic: ``` In [1]: df = spark.range(10000).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006] ``` In Spark Connect: before: ``` In [1]: df = spark.range(10000).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [969, 1005, 958, 996, 987, 1026, 991, 1020, 1012, 979] ``` after: ``` In [1]: df = spark.range(10000).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032] ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#46456 from zhengruifeng/py_connect_sample_seed. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
- Loading branch information