Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support first_value/last_value in range query #3448

Merged
merged 3 commits into from Mar 11, 2024

Conversation

Taylor-lagrange
Copy link
Contributor

@Taylor-lagrange Taylor-lagrange commented Mar 7, 2024

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

  • Write a special Accumulator to support first/last aggregate function (In DataFusion, the function call first_value/last_value) in range query. When user specify order by param in first_value/last_value, the data will be sort by order by param, else time index will be default order by param to sort the data.
  • Fix the problem when using count(*) in range query.

Checklist

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR does not require documentation updates.

Closes #3460

@github-actions github-actions bot added the docs-not-required This change does not impact docs. label Mar 7, 2024
Copy link

codecov bot commented Mar 7, 2024

Codecov Report

Attention: Patch coverage is 42.78351% with 111 lines in your changes are missing coverage. Please review.

Project coverage is 85.01%. Comparing base (2a675e0) to head (4b9c726).
Report is 23 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3448      +/-   ##
==========================================
+ Coverage   85.00%   85.01%   +0.01%     
==========================================
  Files         889      896       +7     
  Lines      146596   147711    +1115     
==========================================
+ Hits       124613   125579     +966     
- Misses      21983    22132     +149     

@MichaelScofield
Copy link
Collaborator

Datafusion already has these two functions, see https://arrow.apache.org/datafusion/user-guide/sql/aggregate_functions.html#last-value, why implement our own?

@Taylor-lagrange
Copy link
Contributor Author

Because the DataFusion's first_value/last_value implementation depend on aggregate plan to sort input RecordBatch in aggregate plan execution, our range plan don't have such mechanism. Besides, the data update logic in first_value/last_value is different with range query data update logic, so I choose to implement our own first_value/last_value @MichaelScofield

@MichaelScofield
Copy link
Collaborator

@Taylor-lagrange Users are already able to execute "last_value" in range query like this:

select last_value(`c` order by `ts`) range '1s' as `c`, ts from test
where a = 'x' and ts > 1706716800000000 and ts < 1706803200000000 
align '1s' to '2023-12-01T00:00:00' by (a) order by ts asc

What's the difference with the new "last/first" functions?

@waynexia
Copy link
Member

waynexia commented Mar 7, 2024

The point is to provide a gramma candy inside range query, where the time index column is considered special. So in range query the order by ts can be omitted.

@Taylor-lagrange
Copy link
Contributor Author

The execution logic of DataFusion first_value/last_value is to directly obtain the last value of the input batch, and if a new batch enters later, the value of last_value will not be updated.

If DataFusion's last_value logic is to be executed correctly, it must satisfy:

  • The input data already sorted (It's guaranteed by aggregate plan order requirement, which not provided in range plan)
  • We can only input one batch in last_value (Aggregate plan will call merge_batch to get final last_value , which not have in range plan too)

If user use last_value before this pr, unless the data input happens to meet the two conditions I mentioned above, the results of the user query should be wrong

@Taylor-lagrange
Copy link
Contributor Author

The implementation of range plan only reuses the Accumulator of aggregate plan as the calculation method to adapt most of aggregate functions. But the Accumulator implementation of some functions itself relies on some guarantees provided by the aggregate plan. However, we do not provide these guarantees in the Range plan, so although some functions can be executed, the results may be incorrect.

@Taylor-lagrange
Copy link
Contributor Author

The Accumulator implementation of these functions inherently relies on some guarantees provided by the aggregate plan.

These guarantees are mainly about the orderliness of the input, but there are only a few of these functions, only first_value / last_value / array_agg. I will look at the corresponding implementation of the array_agg function later to see if we need to manually adapt the new Accumulator to this function.

Copy link
Collaborator

@MichaelScofield MichaelScofield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show an example of wrong result of Datafusion's last_value in range query?

src/query/src/range_select/plan.rs Outdated Show resolved Hide resolved
@Taylor-lagrange
Copy link
Contributor Author

Taylor-lagrange commented Mar 8, 2024

image

Because most of the time when use first_value, the param is order by time index, happen the data we store at the storage is also sorted according to the timeline, and when the amount of data is not large, it is easy to meet the two conditions I mentioned before. But logically the implementation of first_value is incorrect @MichaelScofield

@MichaelScofield
Copy link
Collaborator

@Taylor-lagrange Please add that case in your sqlness test.

@github-actions github-actions bot added docs-required This change requires docs update. and removed docs-not-required This change does not impact docs. labels Mar 8, 2024
@waynexia
Copy link
Member

waynexia commented Mar 8, 2024

The Accumulator implementation of these functions inherently relies on some guarantees provided by the aggregate plan.

These guarantees are mainly about the orderliness of the input, but there are only a few of these functions, only first_value / last_value / array_agg. I will look at the corresponding implementation of the array_agg function later to see if we need to manually adapt the new Accumulator to this function.

Considering the List type is not (well) supported in our system, we can postpone array_agg later after the type system is ready for those compound types.

@waynexia waynexia enabled auto-merge March 8, 2024 09:24
@waynexia waynexia added this pull request to the merge queue Mar 11, 2024
Merged via the queue into GreptimeTeam:main with commit 8c37c3f Mar 11, 2024
24 checks passed
@Taylor-lagrange Taylor-lagrange deleted the feat-range-first-last branch March 11, 2024 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-required This change requires docs update.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RANGE QUERY cannot support COUNT(*)
3 participants