Add new autoregression function: autoregress(lambda, offset, initial value) #65169

Alex-Cheng · 2024-06-12T13:29:57Z

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add new autoregression function: autoregress(lambda, offset, initial value). It implement feature request #64884.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Motivation

The Autoregressive (AR) model is a fundamental component in the realm of time series analysis and forecasting.
It is a kind of multiple regression method that uses the relationship between the observed value Y_t and the observed values in previous periods to predict the value of Y. Among them, the dependent variable is the observed value Y_t, and the independent variables are the lagged values of the dependent variable Y_t-1, Y_t-2,...
The steps of the autoregressive prediction method are as follows:

Determine the autocorrelation sequence: According to the prediction purpose and requirements, organize the time series data (month, quarter, year) of the prediction target to make it comparable, and divide these series into the dependent variable and independent variable series.
Determine the regression model: Calculate the autocorrelation coefficient of each independent variable series, and determine the independent variable according to the size of the autocorrelation coefficient, that is, select the independent variable series with a larger autocorrelation coefficient to fit the regression model.
Estimate the parameters and use the model to predict: The method of finding the value of the model parameters is the same as that of other regression models. The independent variable in the prediction period is the next value of the independent variable series, which can be found in the original time series and used for prediction.

The syntax of the function is autoregress(x->{expression}, backward_offset, initial_value). The example of usage for this function is:

select autoregress(x -> toFloat64(column1 + column2 - x), 1, toFloat64(0.6) ); -- argument 2 being 1 means fetch T-1 result.

Parameters

argument 1 - expression: the expression of autoregressive.
argument 2 - backward_offset: autoregressive need T-n result calculated previously, the argument specifies the n
argument 3 - initial_value: the initial value used in the case where there is no T-n previous calculated result.

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing):

Exclude: Style check
Exclude: Fast test
Exclude: All with ASAN
Exclude: All with TSAN, MSAN, UBSAN, Coverage
Exclude: All with aarch64, release, debug

Do not test
Upload binaries for special builds
Disable merge-commit
Disable CI cache

…ffset, initial_value)

robot-ch-test-poll · 2024-06-12T15:46:56Z

This is an automated comment for commit c495be8 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check name	Description	Status
Builds	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	❌ failure
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	❌ failure

Successful checks

Check name	Description	Status
Docs check	Builds and tests the documentation	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Flaky tests	Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc	✅ success
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success
Unit tests	Runs the unit tests for different release types	✅ success

Alex-Cheng · 2024-06-13T07:51:57Z

AST fuzzer reports an error:

2024.06.13 07:16:29.093700 [ 169 ] {a8e29bc2-60a6-4c64-9dac-021dc4df5095} <Fatal> : Logical error: 'Function 'autoregress' function data type for lambda argument with index 1 arguments size mismatch. Actual 2. Expected Function(Float64 -> Float64). In scope SELECT autoregress((auto_regress_0001__fuzz_44, x) -> toFloat64((col1 * col2) + x), 1, toFloat64(col1 + toLowCardinality(101))), col1, col2, col1 * col2, col1 + 101 FROM auto_regress_0001 WHERE -2147483647 ORDER BY id DESC NULLS FIRST'.

The function I implemented only allow the lambda expression accept one argument of type Float64. The AST fuzzer gives two arguments to lambda expression. How shall I fix the AST fuzzer error? Is it okay that I check all arguments and throw BAD_ARGUMENT exception in case where input arguments violate the requirements of the function?

Alex-Cheng · 2024-06-13T09:19:30Z

@alexey-milovidov Is the document I wrote in the decription of the PR enough? According to the instruction, I wrote text for the function in the PR description, I think there would be a professional technical writer would translate it into a official document, is it right?

Alex-Cheng · 2024-06-13T16:31:00Z

@rschu1ze could you please give some instructions about wrting document. I am not sure if I need to write EN, RU, ZH documents.

Alex-Cheng · 2024-06-14T02:09:02Z

@nikitamikhaylov shall I write document in the PR? Is there any instructions for it? BTW, I noticed the failure of integration test "test_query_is_canceled_with_inf_retries", but it is not related to my PR. What should I do with it? Thank you.

src/Functions/FunctionAutoregress.cpp

rschu1ze · 2024-06-18T20:45:32Z

@Alex-Cheng I did not look at this PR yet but since it is implemented as a regular function, #60555 may apply. In other words: Does the AR model need to see the data in the entire column to work? If yes, then a regular function is not the right choice, a window function is.

@rschu1ze could you please give some instructions about wrting document. I am not sure if I need to write EN, RU, ZH documents.

About docs: docs/en/sql-reference/functions/time-series-functions.md would be the right place. Only English is mandatory, Russian and Chinese docs are optional.

Alex-Cheng · 2024-06-20T15:39:07Z

@Alex-Cheng I did not look at this PR yet but since it is implemented as a regular function, #60555 may apply. In other words: Does the AR model need to see the data in the entire column to work? If yes, then a regular function is not the right choice, a window function is.

@rschu1ze could you please give some instructions about wrting document. I am not sure if I need to write EN, RU, ZH documents.

About docs: docs/en/sql-reference/functions/time-series-functions.md would be the right place. Only English is mandatory, Russian and Chinese docs are optional.

The autoregression is to look t-n row on computing the value on t row, n is usually 1. The values that the function needs to see does not exist at the beginning of function execution. These values are computed by the function row by row. The computation on row i depends on the result of computation on row i-n. It is not a kind of aggregation function and I see the window function is similiar to aggregation function(it aggregates rows in a window frame). Based on the understanding, I think it is not window function.
Given n is 1, the autoregression is x + 1 that would work as following:
row 1 - initial value.
row 2 - evaluate expression x + 1 with x being t-1 row's result, that is row 1's value which is initial value.
row 3 - evaluate expresson x + 1 with x being row 2's value.
...
row n - evaluate expresson x + 1 with x being the value of row n-1

rschu1ze · 2024-06-23T20:29:04Z

@Alex-Cheng The problem is that ClickHouse processes data based on independent chunks (blocks) of arbitrary size, controlled by setting max_block_size. These chunks could (in theory) be as small as a single row. Each call to FunctionAutoregress will then be passed a single row, meaning that even the default case n=1 (compute based on previous row) won't work. This is an extreme example and I think that in practice chunks are typically larger than 1 row but the general issue remains: Time series functions which need to see (some or all) past values cannot be meaningfully be implemented as regular functions :-(

Alex-Cheng · 2024-06-26T08:08:24Z

@Alex-Cheng The problem is that ClickHouse processes data based on independent chunks (blocks) of arbitrary size, controlled by setting max_block_size. These chunks could (in theory) be as small as a single row. Each call to FunctionAutoregress will then be passed a single row, meaning that even the default case n=1 (compute based on previous row) won't work. This is an extreme example and I think that in practice chunks are typically larger than 1 row but the general issue remains: Time series functions which need to see (some or all) past values cannot be meaningfully be implemented as regular functions :-(

I will think a way to implement the function that can work across data chunks.

…ression.

rschu1ze · 2024-06-27T20:00:44Z

The code got quite a bit lot longer but I don't see how the problem (#65169 (comment)) was addressed. In fact, it is impossible to address in regular functions. I am afraid, you will need to implement the functionality as a window function.

Alex-Cheng · 2024-06-28T05:33:08Z

The code got quite a bit lot longer but I don't see how the problem (#65169 (comment)) was addressed. In fact, it is impossible to address in regular functions. I am afraid, you will need to implement the functionality as a window function.

I plan to convert it to a draft PR.

Alex-Cheng added 2 commits June 12, 2024 21:17

feat: add new autoregression function: autoregress(lambda, backward_o…

b6541fd

…ffset, initial_value)

refactor: include lines.

d4f23b0

nikitamikhaylov added the can be tested Allows running workflows for external contributors label Jun 12, 2024

robot-ch-test-poll added the pr-feature Pull request with new product feature label Jun 12, 2024

feat: add doc.

f350d6e

fix: fix fuzzer test error.

19a47b6

Merge branch 'ClickHouse:master' into xch/feat_auto_regress

b055495

superdiaodiao reviewed Jun 18, 2024

View reviewed changes

src/Functions/FunctionAutoregress.cpp Outdated Show resolved Hide resolved

superdiaodiao reviewed Jun 18, 2024

View reviewed changes

src/Functions/FunctionAutoregress.cpp Outdated Show resolved Hide resolved

Merge branch 'ClickHouse:master' into xch/feat_auto_regress

cc3b2f2

Alex-Cheng and others added 4 commits June 26, 2024 16:14

feat: support multiple previous calucation results in autoregress exp…

d2c1b83

…ression.

doc: add doc for autoregress.

c963834

fix typo and add words to aspell ignore list.

2430f44

Merge branch 'ClickHouse:master' into xch/feat_auto_regress

c495be8

rschu1ze closed this Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new autoregression function: autoregress(lambda, offset, initial value) #65169

Add new autoregression function: autoregress(lambda, offset, initial value) #65169

Alex-Cheng commented Jun 12, 2024 •

edited

Loading

robot-ch-test-poll commented Jun 12, 2024 •

edited by robot-clickhouse-ci-2

Loading

Alex-Cheng commented Jun 13, 2024

Alex-Cheng commented Jun 13, 2024

Alex-Cheng commented Jun 13, 2024

Alex-Cheng commented Jun 14, 2024

rschu1ze commented Jun 18, 2024

Alex-Cheng commented Jun 20, 2024 •

edited

Loading

rschu1ze commented Jun 23, 2024

Alex-Cheng commented Jun 26, 2024

rschu1ze commented Jun 27, 2024

Alex-Cheng commented Jun 28, 2024 •

edited

Loading

Add new autoregression function: autoregress(lambda, offset, initial value) #65169

Add new autoregression function: autoregress(lambda, offset, initial value) #65169

Conversation

Alex-Cheng commented Jun 12, 2024 • edited Loading

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Motivation

Parameters

CI Settings (Only check the boxes if you know what you are doing):

robot-ch-test-poll commented Jun 12, 2024 • edited by robot-clickhouse-ci-2 Loading

Alex-Cheng commented Jun 13, 2024

Alex-Cheng commented Jun 13, 2024

Alex-Cheng commented Jun 13, 2024

Alex-Cheng commented Jun 14, 2024

rschu1ze commented Jun 18, 2024

Alex-Cheng commented Jun 20, 2024 • edited Loading

rschu1ze commented Jun 23, 2024

Alex-Cheng commented Jun 26, 2024

rschu1ze commented Jun 27, 2024

Alex-Cheng commented Jun 28, 2024 • edited Loading

Alex-Cheng commented Jun 12, 2024 •

edited

Loading

robot-ch-test-poll commented Jun 12, 2024 •

edited by robot-clickhouse-ci-2

Loading

Alex-Cheng commented Jun 20, 2024 •

edited

Loading

Alex-Cheng commented Jun 28, 2024 •

edited

Loading