New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add function seriesDecomposeSTL()
for time series decomposition into seasonal, trend and residual components
#57078
Conversation
This is an automated comment for commit b2434d0 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page Successful checks
|
66a489b
to
6edd62d
Compare
Please describe how the function will be used and who will use it. What is the justification for adding another submodule dependency? |
Decomposing a time series into seasonal, trend, and residual components is a statistical method used to analyze data that changes over time. This method can be used by customers/businesses to make informed decisions about their services by identifying patterns in data and understanding its overall behaviour. The applications of series_decompose include:
The newer dependency |
My question is whether this use case is broad enough to be considered for a CH function and what the anticipated number of potentially interested customers might be. If the application is too niche, I believe we should try another workflow? For instance, exporting data and applying an external library specializing in statistics, e.g. SciPy and statsmodels (the latter also supports STL decomposition)." |
Hi @jkartseva, we require these functions in order to increase coverage of the KQL query front end, which is used by our customer to find outliers in log trace data. It is very important. This features will be used in a high load environment with large amount of ingest and query traffic on 100's of terabytes of data. Keeping costs down while maintaining acceptable query performance is a challenge even without introducing python runtimes into the already busy clickhouse pods running on kubernetes. Having a "performance challenged" python runtime like fighting for resources to service queries won't end up well from our experience. Using an remote process runtime, through a URL increases complexity and has the same performance concerns, introducing load on the network, already consumed with ingest, replication and queries. Why would you not want this functionality? It would appear to increase ClickHouse's value for log analytic use cases. There are already existing statistical functions, for example, you could say the same thing. |
The main concern is about
Anyways, I took a look and season and trend decomposition using Loess (STL) seems to be quite popular (Hyndman's popular textbook references it which is a good sign, also the original paper has 3800 citations which proves that it is significant. The C++ library was written by just one author (bad sign), but the code itself is not terribly complex (mostly procedural, good sign) and it is self-contained (header-only, good sign). That it is needed by KQL is also a good sign. So, green light from my side (though others may weigh in) |
I believe we should either copy it or write our implementation based on that library. BTW, it is a Fortran -> C++ port of https://www.netlib.org/a/stl, as mentioned. |
f585a73
to
8a7eaed
Compare
This comment was marked as outdated.
This comment was marked as outdated.
seriesDecomposeSTL()
for time series decomposition into seasonal, trend and residual components
Seasonal decompose of time series into seasonal, trend and residue components
Implemented function
seriesDecomposeSTL()
for seasonal decompose of time series data. This function uses the (MIT-licensed) stl-cpp library.Decomposing a time series into seasonal, trend, and residual components is a statistical method used to analyze data that changes over time. This method can be used to make informed decisions about their services by identifying patterns in data and understanding its overall behaviour. Applications of
seriesDecomposeSTL()
include:Parameters: An array containing time series data and the period of the time series.
Returns: An array of arrays where the first array include seasonal components, the second array contains a trend,
and the third array a residual component.
Example:
Output:
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Added function
seriesDecomposeSTL()
which decomposes a time series into a season, a trend and a residual component.Documentation entry for user-facing changes