Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new autoregression function: autoregress(lambda, offset, initial value) #65169

Closed
wants to merge 10 commits into from

Conversation

Alex-Cheng
Copy link
Contributor

@Alex-Cheng Alex-Cheng commented Jun 12, 2024

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add new autoregression function: autoregress(lambda, offset, initial value). It implement feature request #64884.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Motivation

The Autoregressive (AR) model is a fundamental component in the realm of time series analysis and forecasting.
It is a kind of multiple regression method that uses the relationship between the observed value Yt and the observed values in previous periods to predict the value of Y. Among them, the dependent variable is the observed value Yt, and the independent variables are the lagged values of the dependent variable Yt-1, Yt-2,...
The steps of the autoregressive prediction method are as follows:

  • Determine the autocorrelation sequence: According to the prediction purpose and requirements, organize the time series data (month, quarter, year) of the prediction target to make it comparable, and divide these series into the dependent variable and independent variable series.
  • Determine the regression model: Calculate the autocorrelation coefficient of each independent variable series, and determine the independent variable according to the size of the autocorrelation coefficient, that is, select the independent variable series with a larger autocorrelation coefficient to fit the regression model.
  • Estimate the parameters and use the model to predict: The method of finding the value of the model parameters is the same as that of other regression models. The independent variable in the prediction period is the next value of the independent variable series, which can be found in the original time series and used for prediction.

The syntax of the function is autoregress(x->{expression}, backward_offset, initial_value). The example of usage for this function is:

select autoregress(x -> toFloat64(column1 + column2 - x), 1, toFloat64(0.6) ); -- argument 2 being 1 means fetch T-1 result.

Parameters

  • argument 1 - expression: the expression of autoregressive.
  • argument 2 - backward_offset: autoregressive need T-n result calculated previously, the argument specifies the n
  • argument 3 - initial_value: the initial value used in the case where there is no T-n previous calculated result.

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing):

  • Allow: All Required Checks
  • Allow: Stateless tests
  • Allow: Stateful tests
  • Allow: Integration Tests
  • Allow: Performance tests
  • Allow: Normal Builds
  • Allow: Special Builds
  • Allow: All NOT Required Checks
  • Allow: batch 1, 2 for multi-batch jobs
  • Allow: batch 3, 4, 5, 6 for multi-batch jobs

  • Exclude: Style check
  • Exclude: Fast test
  • Exclude: All with ASAN
  • Exclude: All with TSAN, MSAN, UBSAN, Coverage
  • Exclude: All with aarch64, release, debug

  • Do not test
  • Upload binaries for special builds
  • Disable merge-commit
  • Disable CI cache

@nikitamikhaylov nikitamikhaylov added the can be tested Allows running workflows for external contributors label Jun 12, 2024
@robot-ch-test-poll robot-ch-test-poll added the pr-feature Pull request with new product feature label Jun 12, 2024
@robot-ch-test-poll
Copy link
Contributor

robot-ch-test-poll commented Jun 12, 2024

This is an automated comment for commit c495be8 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ failure
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests❌ failure
Successful checks
Check nameDescriptionStatus
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success

@Alex-Cheng
Copy link
Contributor Author

AST fuzzer reports an error:

2024.06.13 07:16:29.093700 [ 169 ] {a8e29bc2-60a6-4c64-9dac-021dc4df5095} <Fatal> : Logical error: 'Function 'autoregress' function data type for lambda argument with index 1 arguments size mismatch. Actual 2. Expected Function(Float64 -> Float64). In scope SELECT autoregress((auto_regress_0001__fuzz_44, x) -> toFloat64((col1 * col2) + x), 1, toFloat64(col1 + toLowCardinality(101))), col1, col2, col1 * col2, col1 + 101 FROM auto_regress_0001 WHERE -2147483647 ORDER BY id DESC NULLS FIRST'.

The function I implemented only allow the lambda expression accept one argument of type Float64. The AST fuzzer gives two arguments to lambda expression. How shall I fix the AST fuzzer error? Is it okay that I check all arguments and throw BAD_ARGUMENT exception in case where input arguments violate the requirements of the function?

@Alex-Cheng
Copy link
Contributor Author

@alexey-milovidov Is the document I wrote in the decription of the PR enough? According to the instruction, I wrote text for the function in the PR description, I think there would be a professional technical writer would translate it into a official document, is it right?

@Alex-Cheng
Copy link
Contributor Author

@rschu1ze could you please give some instructions about wrting document. I am not sure if I need to write EN, RU, ZH documents.

@Alex-Cheng
Copy link
Contributor Author

@nikitamikhaylov shall I write document in the PR? Is there any instructions for it? BTW, I noticed the failure of integration test "test_query_is_canceled_with_inf_retries", but it is not related to my PR. What should I do with it? Thank you.

@rschu1ze
Copy link
Member

@Alex-Cheng I did not look at this PR yet but since it is implemented as a regular function, #60555 may apply. In other words: Does the AR model need to see the data in the entire column to work? If yes, then a regular function is not the right choice, a window function is.

@rschu1ze could you please give some instructions about wrting document. I am not sure if I need to write EN, RU, ZH documents.

About docs: docs/en/sql-reference/functions/time-series-functions.md would be the right place. Only English is mandatory, Russian and Chinese docs are optional.

@Alex-Cheng
Copy link
Contributor Author

Alex-Cheng commented Jun 20, 2024

@Alex-Cheng I did not look at this PR yet but since it is implemented as a regular function, #60555 may apply. In other words: Does the AR model need to see the data in the entire column to work? If yes, then a regular function is not the right choice, a window function is.

@rschu1ze could you please give some instructions about wrting document. I am not sure if I need to write EN, RU, ZH documents.

About docs: docs/en/sql-reference/functions/time-series-functions.md would be the right place. Only English is mandatory, Russian and Chinese docs are optional.

The autoregression is to look t-n row on computing the value on t row, n is usually 1. The values that the function needs to see does not exist at the beginning of function execution. These values are computed by the function row by row. The computation on row i depends on the result of computation on row i-n. It is not a kind of aggregation function and I see the window function is similiar to aggregation function(it aggregates rows in a window frame). Based on the understanding, I think it is not window function.
Given n is 1, the autoregression is x + 1 that would work as following:
row 1 - initial value.
row 2 - evaluate expression x + 1 with x being t-1 row's result, that is row 1's value which is initial value.
row 3 - evaluate expresson x + 1 with x being row 2's value.
...
row n - evaluate expresson x + 1 with x being the value of row n-1

@rschu1ze
Copy link
Member

@Alex-Cheng The problem is that ClickHouse processes data based on independent chunks (blocks) of arbitrary size, controlled by setting max_block_size. These chunks could (in theory) be as small as a single row. Each call to FunctionAutoregress will then be passed a single row, meaning that even the default case n=1 (compute based on previous row) won't work. This is an extreme example and I think that in practice chunks are typically larger than 1 row but the general issue remains: Time series functions which need to see (some or all) past values cannot be meaningfully be implemented as regular functions :-(

@Alex-Cheng
Copy link
Contributor Author

@Alex-Cheng The problem is that ClickHouse processes data based on independent chunks (blocks) of arbitrary size, controlled by setting max_block_size. These chunks could (in theory) be as small as a single row. Each call to FunctionAutoregress will then be passed a single row, meaning that even the default case n=1 (compute based on previous row) won't work. This is an extreme example and I think that in practice chunks are typically larger than 1 row but the general issue remains: Time series functions which need to see (some or all) past values cannot be meaningfully be implemented as regular functions :-(

I will think a way to implement the function that can work across data chunks.

@rschu1ze
Copy link
Member

The code got quite a bit lot longer but I don't see how the problem (#65169 (comment)) was addressed. In fact, it is impossible to address in regular functions. I am afraid, you will need to implement the functionality as a window function.

@rschu1ze rschu1ze closed this Jun 27, 2024
@Alex-Cheng
Copy link
Contributor Author

Alex-Cheng commented Jun 28, 2024

The code got quite a bit lot longer but I don't see how the problem (#65169 (comment)) was addressed. In fact, it is impossible to address in regular functions. I am afraid, you will need to implement the functionality as a window function.

I plan to convert it to a draft PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants