Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponentially smoothed moving average as aggregate function. #27511

Closed
alexey-milovidov opened this issue Aug 10, 2021 · 10 comments · Fixed by #28914
Closed

Exponentially smoothed moving average as aggregate function. #27511

alexey-milovidov opened this issue Aug 10, 2021 · 10 comments · Fixed by #28914
Assignees
Labels
easy task Good for first contributors feature

Comments

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Aug 10, 2021

The function will take two arguments: value and time and also parameter - half-decay period.

Example: exponentialMovingAverage(300)(temperature, timestamp)
- exponentially smoothed moving average of the temperature for the past five minutes at the latest point of time.

The state of the aggregate function is current averaged value and the latest time: (v, t).
Whenever new value or new state is appeared, the state is updated as:

t_new = max(t_old, t)
v_new = v_old * (1 / exp2((t_new - t_old) / half_decay)) + v * (1 - 1 / exp2((t_new - t_old) / half_decay))

(a sort of - did I write the formula correctly?)
(does this way of calculation depend on the order of updates?)

@alexey-milovidov alexey-milovidov added feature easy task Good for first contributors labels Aug 10, 2021
@mathalex mathalex self-assigned this Aug 10, 2021
@UnamedRus
Copy link
Contributor

UnamedRus commented Aug 11, 2021

The state of the aggregate function is current averaged value and the latest time: (v, t).
(does this way of calculation depend on the order of updates?)

I doubt that it's possible to combine multiple states (which holds only last value + timestamp) and get some reasonable answer.
What if second state was calculated only from single value, so it's equals to this single value (or to zero), it would mess up result quite a bit.

And just aggregate function, which returns only last value would have limited usage due:
Last value could be calculated via regular arraySum and arithmetic functions without huge performance impact.

Main problem right now, is: without recursive functions like (arrayScan) it's impossible to calculate multiple values efficiently. (W/O calculation of whole sequence for each value)

@alexey-milovidov
Copy link
Member Author

I doubt that it's possible to combine multiple states (which holds only last value + timestamp) and get some reasonable answer.

No doubt. This is how exponential smoothing typically works.
Exponential smoothing is basically: sum(value * exp(-elapsed_time * k))

And just aggregate function, which returns only last value would have limited usage

It will be used for aggregated materialized views (antifraud, user summaries) and with window functions (smoothing over a window).

@alexey-milovidov
Copy link
Member Author

I've received this reply in email but cannot see it on GitHub (probably a bug on GitHub).
By Kenji Noguchi:

It's nice to have the dedicated function. I'm looking forward to it.

In the mean time, CH can do ema using window. N is the number of elements, and k is the decay factor.

SET allow_experimental_window_functions = 1;
WITH
 3   AS n,
 0.05 AS k,
 reverse(arrayMap(v->exp(-v * k), range(n))) as weights
SELECT
  number as original,
  arraySum(arrayMap(pair -> pair.1 * pair.2, arrayZip(arrayResize(grouped_numbers, n, 0), weights))) / n as ema
FROM (
 SELECT
   number,
   groupArray(number) OVER (ORDER BY number ASC ROWS BETWEEN n - 1 PRECEDING AND CURRENT ROW) AS grouped_numbers
 FROM (select number from system.numbers limit 10)
)

┌─original─┬─────────────────ema─┐
│        0 │                   0 │
│        1 │ 0.31707647483350115 │
│        2 │   0.983743141500168 │
│        3 │  1.9357654223454428 │
│        4 │  2.8877877031907175 │
│        5 │   3.839809984035993 │
│        6 │  4.7918322648812675 │
│        7 │   5.743854545726542 │
│        8 │   6.695876826571818 │
│        9 │   7.647899107417093 │
└──────────┴─────────────────────┘

Off topic but SMA is much simpler

WITH 3 as n
SELECT
  number as original,
  avg(number) OVER (ORDER BY number ASC ROWS BETWEEN n - 1 PRECEDING AND CURRENT ROW) AS sma
FROM (select number from system.numbers limit 10)

┌─original─┬─sma─┐
│        0 │   0 │
│        1 │ 0.5 │
│        2 │   1 │
│        3 │   2 │
│        4 │   3 │
│        5 │   4 │
│        6 │   5 │
│        7 │   6 │
│        8 │   7 │
│        9 │   8 │
└──────────┴─────┘

@akuzm
Copy link
Contributor

akuzm commented Aug 15, 2021

We should probably have fold aggregate function that would take a lambda: fold(curr, acc -> curr * k + acc * (1 - k), values, 0 /* initial value */) over (order by timestamp).
It would be similar to arrayFold #21589

@excitoon
Copy link
Contributor

excitoon commented Aug 25, 2021

@akuzm @alexey-milovidov @alz @filimonov , what about :
fold(c0, c1, ..., (c0, c0_acc, c1, c1_acc, ...) -> (c0 * (exp(-(c1-c1_acc)/C) as k) + c0_acc * (1 - k)), c1), (0, 0) /* initial value */) over (order by timestamp)
?

So, we have N columns, which are calculated by:

Xn' = f(X, Xn, Y, Yn...)[0]
Yn' = f(X, Xn, Y, Yn...)[1]
...
Zn' = f(X, Xn, Y, Yn...)[...]

In case of exponential smoothing,

F[x](x, x_acc, t, t_acc) = C * x + (1-C) * x_acc
F[t](x, x_acc, t, t_acc) = t

where C is exp(-(t-t_acc)/T).

@excitoon
Copy link
Contributor

BTW, aggregate functions can not use lambdas in its arguments currently 😞

@alexey-milovidov
Copy link
Member Author

Yes, it cannot be calculated deterministically if data is arrived in arbitrary order.
But it's still usable for window functions or aggregation after ORDER BY.
And even with non-deterministic calculation it gives decent results.

@alexey-milovidov
Copy link
Member Author

Exponential smoothing moving sum can be calculated independent on the order of values.

@filimonov
Copy link
Contributor

filimonov commented Sep 12, 2021

BTW, aggregate functions can not use lambdas in its arguments currently 😞

See laginframe - it is not really an agg function

@Alpha-su
Copy link

Alpha-su commented Feb 9, 2022

I've received this reply in email but cannot see it on GitHub (probably a bug on GitHub). By Kenji Noguchi:

It's nice to have the dedicated function. I'm looking forward to it.

In the mean time, CH can do ema using window. N is the number of elements, and k is the decay factor.

SET allow_experimental_window_functions = 1;
WITH
 3   AS n,
 0.05 AS k,
 reverse(arrayMap(v->exp(-v * k), range(n))) as weights
SELECT
  number as original,
  arraySum(arrayMap(pair -> pair.1 * pair.2, arrayZip(arrayResize(grouped_numbers, n, 0), weights))) / n as ema
FROM (
 SELECT
   number,
   groupArray(number) OVER (ORDER BY number ASC ROWS BETWEEN n - 1 PRECEDING AND CURRENT ROW) AS grouped_numbers
 FROM (select number from system.numbers limit 10)
)

┌─original─┬─────────────────ema─┐
│        0 │                   0 │
│        1 │ 0.31707647483350115 │
│        2 │   0.983743141500168 │
│        3 │  1.9357654223454428 │
│        4 │  2.8877877031907175 │
│        5 │   3.839809984035993 │
│        6 │  4.7918322648812675 │
│        7 │   5.743854545726542 │
│        8 │   6.695876826571818 │
│        9 │   7.647899107417093 │
└──────────┴─────────────────────┘

Off topic but SMA is much simpler

WITH 3 as n
SELECT
  number as original,
  avg(number) OVER (ORDER BY number ASC ROWS BETWEEN n - 1 PRECEDING AND CURRENT ROW) AS sma
FROM (select number from system.numbers limit 10)

┌─original─┬─sma─┐
│        0 │   0 │
│        1 │ 0.5 │
│        2 │   1 │
│        3 │   2 │
│        4 │   3 │
│        5 │   4 │
│        6 │   5 │
│        7 │   6 │
│        8 │   7 │
│        9 │   8 │
└──────────┴─────┘

ema is like avgweighted, so the sql sentence 'arraySum(arrayMap(pair -> pair.1 * pair.2, arrayZip(arrayResize(grouped_numbers, n, 0), weights))) / n as ema' not divided by n,but arraySum(weights), like this: arraySum(arrayMap(pair -> pair.1 * pair.2, arrayZip(arrayResize(grouped_numbers, n, 0), weights))) / arraySum(weights) as ema

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
easy task Good for first contributors feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants