[SPARK-52588][SQL] Approx_top_k: accumulate and estimate #51308

yhuang-db · 2025-06-27T18:36:41Z

What changes were proposed in this pull request?

This PR adds two SQL functions; approx_top_k_accumulate, an aggregation function that accumulates input data into a sketch, and approx_top_k_estimate, an expression function that estimates the top k frequent items from a sketch.

1. approx_top_k_accumulate

Syntax

approx_too_k_accumulate(expr[, maxItemsTracked])

Arguments

expr: An expression of BOOLEAN, BINARY, STRING, DATE, TIMESTAMP or numeric type.
maxItemsTracked: An optional INTEGER literal. If maxItemsTracked is not specified, it defaults to 10000. This is the maximum number of distinct values that can be tracked by the sketch.

Returns

The return of this function is a STRUCT with three fields: (1) Sketch field, which is the BINARY form of the sketch status; (2) ItemTypeNull field, which is a null value indicating the original type of expr. And (3) MaxItemsTracked, which is the maxItemsTracked argument.

2. approx_top_k_estimate

Syntax

approx_top_k_estimate(state[, k])

Arguments

state: An expression for the sketch STRUCT that is generated by approx_top_k_accumulate or approx_top_k_combine
k: An optional INTEGER literal greater than 0. If k is not specified, it defaults to 5.

Returns

Results are returned as an ARRAY of type STRUCT, where each STRUCT contains an item field for the value (with its original input type) and a count field (of type LONG) with the approximate number of occurrences. The array is sorted by count descending.

Summary of changes:

Tests:

DataFrameAggregateSuite.scala
- End-to-end SQL query tests with approx_top_k_estimate(approx_top_k_accumulate(expr, maxItemsTracked), k) together.
ApproxTopKSuite.scala
- Negative expression tests for invalid parameters.

Implementation:

ApproxTopKAggregates.scala
- approx_top_k_accumulate
ApproxTopKExpressions.scala
- approx_top_k_estimation

Why are the changes needed?

They are useful sibling functions for approx_top_k queries.

Does this PR introduce any user-facing change?

Yes, this PR introduces a new user-facing SQL function. See user examples as below.

> SELECT approx_top_k_estimate(approx_top_k_accumulate(expr, 100), 10) FROM VALUES (0), (0), (1), (1), (2), (3), (4), (4) AS tab(expr);
 [{'item':4,'count':2},{'item':1,'count':2},{'item':0,'count':2},{'item':3,'count':1},{'item':2,'count':1}]

How was this patch tested?

Unit tests for end-to-end SQL queries and invalid input for expressions.

Was this patch authored or co-authored using generative AI tooling?

N/A

gengliangwang · 2025-07-08T22:35:29Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ApproxTopKExpressions.scala

+      defaultCheck
+    } else if (!k.foldable) {
+      TypeCheckFailure("K must be a constant literal")
+    } else {


shall we also check the StructType of state?

Also, let's add test for this.

gengliangwang · 2025-07-08T22:59:13Z

...rc/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxTopKAggregates.scala

+
+  def getSketchStateDataType(itemDataType: DataType): StructType =
+    StructType(
+      StructField("Sketch", BinaryType, nullable = false) ::


Sketch => sketch. let's use camelCase

gengliangwang · 2025-07-08T23:15:54Z

...rc/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxTopKAggregates.scala

+ * and the maximum number of items tracked by the sketch.
+ *
+ * @param expr            the child expression to accumulate items from
+ * @param maxItemsTracked the maximum number of items to track in the sketch


Let's also add doc for mutableAggBufferOffset and inputAggBufferOffset

gengliangwang · 2025-07-08T23:16:48Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxTopKSuite.scala

+        k = Literal(10),
+        maxItemsTracked = Literal(10000)
+      )
+      assert(agg.checkInputDataTypes().isFailure)


let's also check the failure message

gengliangwang · 2025-07-08T23:16:55Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxTopKSuite.scala

+      k = Sum(BoundReference(1, IntegerType, nullable = true)),
+      maxItemsTracked = Literal(10)
+    )
+    assert(badAgg.checkInputDataTypes().isFailure)


let's also check the failure message

gengliangwang · 2025-07-09T00:12:38Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

@@ -2931,6 +2904,112 @@ class DataFrameAggregateSuite extends QueryTest
      res,
      Row(LocalTime.of(22, 1, 0), LocalTime.of(3, 0, 0)))
  }
+
+  test("SPARK-52588: accumulate and estimate of Integer with default parameters") {


shall we move the new tests related to approx_top_k_* to a new test suite?

gengliangwang · 2025-07-09T00:17:47Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

@@ -2931,6 +2904,112 @@ class DataFrameAggregateSuite extends QueryTest
      res,
      Row(LocalTime.of(22, 1, 0), LocalTime.of(3, 0, 0)))
  }
+
+  test("SPARK-52588: accumulate and estimate of Integer with default parameters") {


Let's also test inputs with different data types, similar to the approx_top_k function.

github-actions bot added the SQL label Jun 27, 2025

yhuang-db changed the title ~~Spark 52588~~ [SPARK-52588][SQL] Approx_top_k: accumulate, combine, estimate Jun 27, 2025

yhuang-db added 9 commits July 3, 2025 12:48

init ApproxTopKAccumulate

bbd8ae4

init ApproxTopKCombine, combineSizeSpecified undone

9015bbb

init ApproxTopKCombine, combineSizeSpecified undone

f369de7

init ApproxTopKEstimate

1f319cf

init accumulate and estimate tests

c09fd83

unfinished estimate null check

612f8d8

fix estimate null check

03e432d

estimate and accumulate invalid parameter test

88cae20

remove combine for PR

ccfc661

yhuang-db force-pushed the SPARK-52588 branch from ba2d16d to ccfc661 Compare July 3, 2025 19:51

yhuang-db added 5 commits July 3, 2025 12:53

remove combine for PR

d89c121

separate expression suite and query suite

b95ff7a

add expression doc

edaf18e

add accumulation doc

b3811a7

nit doc

7e1e519

yhuang-db changed the title ~~[SPARK-52588][SQL] Approx_top_k: accumulate, combine, estimate~~ [SPARK-52588][SQL] Approx_top_k: accumulate and estimate Jul 3, 2025

yhuang-db marked this pull request as ready for review July 3, 2025 21:15

yhuang-db added 7 commits July 3, 2025 17:25

update expression type check test

a9153aa

remove k and max type check test

ae6cc81

add upper limit test for accumulate

d60702d

add invalid value tests

b616e7c

fix ApproxTopKAccumulate doc

fa8d569

Merge branch 'master' into SPARK-52588

2f69184

fix sql test

2d24c68

gengliangwang reviewed Jul 8, 2025

View reviewed changes

gengliangwang reviewed Jul 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52588][SQL] Approx_top_k: accumulate and estimate #51308

[SPARK-52588][SQL] Approx_top_k: accumulate and estimate #51308

Uh oh!

yhuang-db commented Jun 27, 2025 •

edited

Loading

Uh oh!

gengliangwang Jul 8, 2025 •

edited

Loading

Uh oh!

gengliangwang Jul 8, 2025

Uh oh!

gengliangwang Jul 8, 2025

Uh oh!

gengliangwang Jul 8, 2025

Uh oh!

gengliangwang Jul 8, 2025

Uh oh!

gengliangwang Jul 8, 2025

Uh oh!

gengliangwang Jul 9, 2025

Uh oh!

gengliangwang Jul 9, 2025

Uh oh!

Uh oh!

[SPARK-52588][SQL] Approx_top_k: accumulate and estimate #51308

Are you sure you want to change the base?

[SPARK-52588][SQL] Approx_top_k: accumulate and estimate #51308

Uh oh!

Conversation

yhuang-db commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

1. approx_top_k_accumulate

Syntax

Arguments

Returns

2. approx_top_k_estimate

Syntax

Arguments

Returns

Summary of changes:

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

gengliangwang Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yhuang-db commented Jun 27, 2025 •

edited

Loading

gengliangwang Jul 8, 2025 •

edited

Loading