Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function seriesDecomposeSTL() for time series decomposition into seasonal, trend and residual components #57078

Merged
merged 12 commits into from Jan 9, 2024

Conversation

bhavnajindal
Copy link
Contributor

@bhavnajindal bhavnajindal commented Nov 21, 2023

Implemented function seriesDecomposeSTL() for seasonal decompose of time series data. This function uses the (MIT-licensed) stl-cpp library.

Decomposing a time series into seasonal, trend, and residual components is a statistical method used to analyze data that changes over time. This method can be used to make informed decisions about their services by identifying patterns in data and understanding its overall behaviour. Applications of seriesDecomposeSTL() include:

  • Forecasting
  • Anomly detection
  • Outliers

Parameters: An array containing time series data and the period of the time series.

Returns: An array of arrays where the first array include seasonal components, the second array contains a trend,
and the third array a residual component.

Example:

SELECT seriesDecomposeSTL([10.1, 20.45, 40.34, 10.1, 20.45, 40.34, 10.1, 20.45, 40.34, 10.1, 20.45, 40.34, 10.1, 20.45, 40.34, 10.1, 20.45, 40.34, 10.1, 20.45, 40.34, 10.1, 20.45, 40.34], 3);

Output:

[[-13.529999,-3.1799996,16.71,-13.53,-3.1799996,16.71,-13.53,-3.1799996,16.71,-13.530001,-3.18,16.710001,-13.530001,-3.1800003,16.710001,-13.530001,-3.1800003,16.710001,-13.530001,-3.1799994,16.71,-13.529999,-3.1799994,16.709997],[23.63,23.63,23.630003,23.630001,23.630001,23.630001,23.630001,23.630001,23.630001,23.630001,23.630001,23.63,23.630001,23.630001,23.63,23.630001,23.630001,23.63,23.630001,23.630001,23.630001,23.630001,23.630001,23.630003],[0,0.0000019073486,-0.0000019073486,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-0.0000019073486,0,0]]

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Added function seriesDecomposeSTL() which decomposes a time series into a season, a trend and a residual component.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Nov 21, 2023
@robot-ch-test-poll2 robot-ch-test-poll2 added pr-feature Pull request with new product feature submodule changed At least one submodule changed in this PR. labels Nov 21, 2023
@robot-ch-test-poll2
Copy link
Contributor

robot-ch-test-poll2 commented Nov 21, 2023

This is an automated comment for commit b2434d0 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help✅ success
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker image for serversThe check to build and optionally push the mentioned image to docker hub✅ success
Docs CheckBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
Mergeable CheckChecks if all other necessary checks are successful✅ success
SQLTestThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
SQLancerFuzzing tests that detect logical bugs with SQLancer tool✅ success
SqllogicRun clickhouse on the sqllogic test set against sqlite and checks that all statements are passed✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors✅ success
Style CheckRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Check nameDescriptionStatus
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests❌ failure
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts❌ failure

@jkartseva jkartseva self-assigned this Nov 22, 2023
docs/en/sql-reference/functions/time-series-functions.md Outdated Show resolved Hide resolved
docs/en/sql-reference/functions/time-series-functions.md Outdated Show resolved Hide resolved
docs/en/sql-reference/functions/time-series-functions.md Outdated Show resolved Hide resolved
src/Functions/seriesDecomposeSTL.cpp Outdated Show resolved Hide resolved
src/Functions/seriesDecomposeSTL.cpp Outdated Show resolved Hide resolved
src/Functions/seriesDecomposeSTL.cpp Show resolved Hide resolved
src/configure_config.cmake Outdated Show resolved Hide resolved
@jkartseva
Copy link
Contributor

@bhavnajindal

Please describe how the function will be used and who will use it. What is the justification for adding another submodule dependency?

@bhavnajindal
Copy link
Contributor Author

Decomposing a time series into seasonal, trend, and residual components is a statistical method used to analyze data that changes over time. This method can be used by customers/businesses to make informed decisions about their services by identifying patterns in data and understanding its overall behaviour. The applications of series_decompose include:

  • Forecasting
  • Anomly detection
  • Outliers

The newer dependency stl-cpp is a header only, lightweight library that uses Seasonal and Trend decomposition using Loess to compute the above components.

@jkartseva
Copy link
Contributor

My question is whether this use case is broad enough to be considered for a CH function and what the anticipated number of potentially interested customers might be. If the application is too niche, I believe we should try another workflow? For instance, exporting data and applying an external library specializing in statistics, e.g. SciPy and statsmodels (the latter also supports STL decomposition)."
I'll bring this up for internal discussion. This is the 2nd PR introducing a dependency in a form of submodule. Will more follow, and if so, which are the anticipated dependencies or features?

@bkuschel
Copy link
Contributor

Hi @jkartseva, we require these functions in order to increase coverage of the KQL query front end, which is used by our customer to find outliers in log trace data. It is very important.

This features will be used in a high load environment with large amount of ingest and query traffic on 100's of terabytes of data. Keeping costs down while maintaining acceptable query performance is a challenge even without introducing python runtimes into the already busy clickhouse pods running on kubernetes. Having a "performance challenged" python runtime like fighting for resources to service queries won't end up well from our experience. Using an remote process runtime, through a URL increases complexity and has the same performance concerns, introducing load on the network, already consumed with ingest, replication and queries.

Why would you not want this functionality? It would appear to increase ClickHouse's value for log analytic use cases. There are already existing statistical functions, for example, you could say the same thing.

@rschu1ze
Copy link
Member

The main concern is about

  • introducing obscure/unmaintained/buggy 3rd party dependencies and/or
  • obscure functions that will be relevant to only a very small subset of users.

Anyways, I took a look and season and trend decomposition using Loess (STL) seems to be quite popular (Hyndman's popular textbook references it which is a good sign, also the original paper has 3800 citations which proves that it is significant. The C++ library was written by just one author (bad sign), but the code itself is not terribly complex (mostly procedural, good sign) and it is self-contained (header-only, good sign). That it is needed by KQL is also a good sign. So, green light from my side (though others may weigh in)

@jkartseva
Copy link
Contributor

I believe we should either copy it or write our implementation based on that library. BTW, it is a Fortran -> C++ port of https://www.netlib.org/a/stl, as mentioned.
If we copy, will be easier to maintain, and this way we can also address float32 precision issue.

@robot-ch-test-poll4 robot-ch-test-poll4 removed the submodule changed At least one submodule changed in this PR. label Jan 3, 2024
@bhavnajindal

This comment was marked as outdated.

docs/en/sql-reference/functions/time-series-functions.md Outdated Show resolved Hide resolved
docs/en/sql-reference/functions/time-series-functions.md Outdated Show resolved Hide resolved
docs/en/sql-reference/functions/time-series-functions.md Outdated Show resolved Hide resolved
src/Common/config.h.in Outdated Show resolved Hide resolved
src/Functions/seriesDecomposeSTL.cpp Outdated Show resolved Hide resolved
src/Functions/stl.hpp Show resolved Hide resolved
tests/queries/0_stateless/02813_series_decompose.sql Outdated Show resolved Hide resolved
tests/queries/0_stateless/02813_series_decompose.sql Outdated Show resolved Hide resolved
tests/queries/0_stateless/02813_series_decompose.sql Outdated Show resolved Hide resolved
tests/queries/0_stateless/02813_series_decompose.sql Outdated Show resolved Hide resolved
rschu1ze

This comment was marked as outdated.

@rschu1ze rschu1ze merged commit 64ab30f into ClickHouse:master Jan 9, 2024
262 of 264 checks passed
@rschu1ze rschu1ze changed the title Seasonal decompose of time series into seasonal, trend and residue components Add function seriesDecomposeSTL() for time series decomposition into seasonal, trend and residual components Jan 9, 2024
bhavnajindal pushed a commit to ClibMouse/ClickHouse that referenced this pull request Feb 26, 2024
Seasonal decompose of time series into seasonal, trend and residue components
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants