Adjust eweights calculation to avoid precision issues #509

rofinn · 2019-08-02T04:58:25Z

The modified function is equivalent to dividing all weights by the largest value.

julia> x = [
        0.3
        0.42857142857142855
        0.6122448979591837
        0.8746355685131197
        1.249479383590171
        1.7849705479859588
        2.549957925694227
        3.642797036706039
        5.203995766722913
        7.434279666747019
        ]
10-element Array{Float64,1}:
 0.3
 0.42857142857142855
 0.6122448979591837
 0.8746355685131197
 1.249479383590171
 1.7849705479859588
 2.549957925694227
 3.642797036706039
 5.203995766722913
 7.434279666747019

julia> x ./ last(x)
10-element Array{Float64,1}:
 0.04035360699999998
 0.057648009999999965
 0.08235429999999996
 0.11764899999999996
 0.16806999999999994
 0.24009999999999995
 0.34299999999999997
 0.49
 0.7
 1.0

julia> using StatsBase
[ Info: Recompiling stale cache file /Users/rory/.julia/compiled/v1.0/StatsBase/EZjIG.ji for StatsBase [2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91]

julia> eweights(1:10, 0.3)
10-element Weights{Float64,Float64,Array{Float64,1}}:
 0.04035360699999998
 0.05764800999999997
 0.08235429999999996
 0.11764899999999996
 0.16806999999999994
 0.24009999999999995
 0.3429999999999999
 0.48999999999999994
 0.7
 1.0

codecov · 2019-08-02T06:39:58Z

Codecov Report

Merging #509 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #509      +/-   ##
==========================================
+ Coverage    90.1%   90.11%   +0.01%     
==========================================
  Files          21       21              
  Lines        2021     2024       +3     
==========================================
+ Hits         1821     1824       +3     
  Misses        200      200

Impacted Files	Coverage Δ
src/weights.jl	`84.69% <100%> (+0.25%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0faf210...eef7373. Read the comment docs.

ararslan

Stylistic issues aside (I don't like spaces around ^ or explicit returns inside of do blocks) this seems fine I guess?

nickrobinson251 · 2019-08-02T22:02:19Z

Why the double negation in (1 - λ) ^ (n - i) and not just λ^i?

rofinn · 2019-08-02T22:07:40Z

Cause those are different things? The goal was to minimize the potential impact on existing code. This approach simply scales the weights without changing the relative size, as noted above.

(1 - λ) ensures that λ holds the same meaning
(n - i) means we can still iterate forward through the elements.

julia> λ = 0.3
0.3

julia> [(1 - λ) ^ (10 - i) for i in 1:10]
10-element Array{Float64,1}:
 0.04035360699999998
 0.05764800999999997
 0.08235429999999996
 0.11764899999999996
 0.16806999999999994
 0.24009999999999995
 0.3429999999999999
 0.48999999999999994
 0.7
 1.0

julia> [λ ^ i for i in 1:10]
10-element Array{Float64,1}:
 0.3
 0.09
 0.026999999999999996
 0.0081
 0.0024299999999999994
 0.0007289999999999998
 0.00021869999999999995
 6.560999999999998e-5
 1.9682999999999994e-5
 5.9048999999999975e-6

nickrobinson251 · 2019-08-02T22:15:38Z

(1 - λ) ensures that λ holds the same meaning

I strongly dislike this meaning. λ^i is much easier to reason about... every previous observation is λ times as important. (I don't mind about the forward/reverse ordering, whatever makes sense for the most common use-case)

rofinn · 2019-08-02T22:28:28Z

every previous observation is λ times as important

FWIW, since λ should alway be between 0 and 1 I think this is a fairly intuitive way to think about that parameter:

- `λ::Real`: a smoothing factor or rate parameter such that ``0 < λ ≤ 1``.
  As this value approaches 0, the resulting weights will be almost equal, As this value approaches 0, the resulting weights will be almost equal, while values closer to 1 will put greater weight on the tail elements of the vector.

I don't think it's a big deal, but it seems weird to me that you'd want to have a lower λ for a faster decay.

nalimilan · 2019-08-03T15:18:53Z

How does this fix precision issues? Note that this is breaking, so we need a strong rationale to change the existing definition. Is the new one more standard?

rofinn · 2019-08-06T17:05:17Z

How does this fix precision issues?

By limiting the range of values between 0.0 and 1.0 we're minimizing the absolute floating point error. This is also providing a more consistent range of weights regardless of λ or n for users to work with.

Note that this is breaking, so we need a strong rationale to change the existing definition. Is the new one more standard?

That's true. Unfortunately, I've managed to find at least 3-4 different formulations and seems like the preference is largely dependent on the application / field. I'd argue that compressing the weights between 0.0 and 1.0 is better for the reasons mentioned above. Also, most people won't notice the change because the relative strength of the weights hasn't changed, so all the existing weighted statistics methods will return the same result:

julia> wv1 = eweights(1:10, 0.3)
10-element Weights{Float64,Float64,Array{Float64,1}}:
 0.04035360699999998
 0.05764800999999997
 0.08235429999999996
 0.11764899999999996
 0.16806999999999994
 0.24009999999999995
 0.3429999999999999
 0.48999999999999994
 0.7
 1.0

julia> wv2 = Weights([
               0.3
               0.42857142857142855
               0.6122448979591837
               0.8746355685131197
               1.249479383590171
               1.7849705479859588
               2.549957925694227
               3.642797036706039
               5.203995766722913
               7.434279666747019
               ])
10-element Weights{Float64,Float64,Array{Float64,1}}:
 0.3
 0.42857142857142855
 0.6122448979591837
 0.8746355685131197
 1.249479383590171
 1.7849705479859588
 2.549957925694227
 3.642797036706039
 5.203995766722913
 7.434279666747019

julia> X = rand(10, 5)
10×5 Array{Float64,2}:
 0.613516   0.478427  0.278105   0.971781   0.898878
 0.501727   0.48274   0.398964   0.555583   0.309536
 0.279534   0.431118  0.15033    0.871819   0.211906
 0.0699309  0.96875   0.307077   0.0186145  0.537571
 0.139164   0.75598   0.576918   0.10567    0.820161
 0.994798   0.936011  0.256395   0.0565743  0.789837
 0.766728   0.284988  0.310023   0.240307   0.867523
 0.943843   0.448423  0.913451   0.887282   0.83511
 0.827083   0.689755  0.932156   0.912732   0.860018
 0.534114   0.733628  0.0328417  0.447575   0.257983

julia> mean(X, wv1; dims=1)
1×5 Array{Float64,2}:
 0.674775  0.642861  0.457071  0.549602  0.626412

julia> mean(X, wv2; dims=1)
1×5 Array{Float64,2}:
 0.674775  0.642861  0.457071  0.549602  0.626412

julia> cor(X, wv1)
5×5 Array{Float64,2}:
  1.0       -0.305422   0.518979   0.445195   0.568971
 -0.305422   1.0       -0.254187  -0.326943  -0.325702
  0.518979  -0.254187   1.0        0.677978   0.800088
  0.445195  -0.326943   0.677978   1.0        0.246341
  0.568971  -0.325702   0.800088   0.246341   1.0

julia> cor(X, wv2)
5×5 Array{Float64,2}:
  1.0       -0.305422   0.518979   0.445195   0.568971
 -0.305422   1.0       -0.254187  -0.326943  -0.325702
  0.518979  -0.254187   1.0        0.677978   0.800088
  0.445195  -0.326943   0.677978   1.0        0.246341
  0.568971  -0.325702   0.800088   0.246341   1.0

If we want to consider this a fully breaking change then I'm open to switching the meaning of λ.

nalimilan · 2019-08-08T12:49:43Z

OK. I don't have a strong preference, but can you check what's the most common behavior and definition of λ in other implementations?

rofinn · 2019-08-08T18:15:55Z

From what I can tell, the current meaning of λ (sometimes called α) is more standard?

The parameter λ determines the rate at which "older" data enter into the calculation of the EWMA statistic. A value of λ=1 implies that only the most recent measurement influences the EWMA (degrades to Shewhart chart). Thus, a large value of λ (closer to 1) gives more weight to recent data and less weight to older data; a small value of λ (closer to 0) gives more weight to older data. The value of λis usually set between 0.2 and 0.3 (Hunter) although this choice is somewhat arbitrary. Lucas and Saccucci (1990) give tables that help the user select λ.

https://www.itl.nist.gov/div898/handbook/pmc/section3/pmc324.htm

For any α between 0 and 1, the weights attached to the observations decrease exponentially as we go back in time, hence the name “exponential smoothing”. If α is small (i.e., close to 0), more weight is given to observations from the more distant past. If α is large (i.e., close to 1), more weight is given to the more recent observations.

https://otexts.com/fpp2/ses.html

A few other implementations:

nalimilan · 2019-08-09T12:58:27Z

OK, cool. What about the normalization to 0:1 that this PR introduces?

rofinn · 2019-08-09T16:07:24Z

I'm having a harder time finding a consistent pattern for that. Many algorithms seem to just have large numbers and others normalize the weights so that they sum to 1.0. I'm not sure which is the best approach, but I feel like having large floating point weights probably isn't a good idea if we can compress them. I could be convinced that having them sum to 1.0 is a good idea.

nalimilan · 2019-08-11T21:35:49Z

Mmm, hard to decide then. Is there disagreement among major implementations (R, NumPy...)? An alternative possibility is to have a keyword argument to normalize, but keep it false by default.

rofinn · 2019-08-12T20:01:01Z

This issue is that most other libraries just have an ewma style method which combines the data with the weighting rather than having independent weights. I feel like bounding the weights to 0:1 results in more understandable output and doesn't cost us anything, but maybe I'm biased by the fact that I think of eweights as being analytic/reliability weights. I'd be okay with using the keyword as a deprecation path though.

nalimilan · 2019-08-18T17:07:14Z

OK, as you prefer. Having a keyword argument sounds good, whatever the default we retain.

src/weights.jl

Nosferican · 2019-08-22T10:41:04Z

It is important to document how does the sum of the weights gets impacted by the normalization. If not mistaken, the sum of weights is not consistent and will depend on whether that weight type normalizes or not. For now, it should at least be mentioned in the docs.

rofinn · 2019-08-23T17:46:53Z

FWIW, the initial version of this, that I wrote for our internal code base, would produce weights that summed to 1.0. I think @ararslan was the one who thought that was wrong, so maybe I can find the discussion?

rofinn · 2019-08-23T17:55:04Z

Alright, couldn't find any discussion, but the original code we had internal looked like this:

function ExponentialWeights(n::Integer, λ::Real)
    n > 0 || throw(ArgumentError("cannot construct weights of length < 1"))
    0 < λ <= 1 || throw(ArgumentError("smoothing factor must be between 0 and 1"))
    w0 = map(i -> λ * (1 - λ)^(1 - i), 1:n)
    s = sum(w0)
    ExponentialWeights(w0 / s)
end

nalimilan · 2021-10-01T08:12:50Z

Do you want to merge this in the end or not? In the perspective of moving features to Statistics (https://github.com/JuliaLang/Statistics.jl/issues/87) I'm checking which breaking changes we may want to do (as after that it will be too late).

rofinn · 2021-10-01T21:09:29Z

Sorry, I wasn't clear if further consensus was needed for this PR. Sure, I can fix this up for a breaking release. I think it's probably still worth having both the values and the smoothing factor stay between 0 and 1 given the links I posted above. That being said, we could do something like what pandas does and accept multiple parameters either the normal smoothing parameter 0 < α <= 1 (ratio to increase at each step) or a growth rate 𝑟 > 1 which is equivalent to 𝑟^i, but normalized by 𝑟^n (ie: α = 1 - 1 / 𝑟)? Supporting both might be a more breaking change, but also provide a clearer deprecation path?

nalimilan · 2021-10-02T15:16:23Z

AFAICT there are two different issues here, right?

Whether to normalize between 0 and 1. There doesn't seem to be any opposition to this, though it can't be merged before a breaking release (it would still be good to have the PR ready for the day we want to make such a release).
What parameterization to use for λ. Changing this would be really breaking, so instead what you propose about adding multiple keyword arguments like pandas sounds better. But this is better done in another PR, which could be merged before this one as it wouldn't be breaking at all.

rofinn · 2021-10-04T16:20:00Z

Alright, I've revised this PR to be non-breaking by using a scaled::DepBool=nothing kwarg. I'm just gonna rebase with master resolve the conflicts.

The modified function is equivalent to dividing all weights by the largest value. ``` julia> x = [ 0.3 0.42857142857142855 0.6122448979591837 0.8746355685131197 1.249479383590171 1.7849705479859588 2.549957925694227 3.642797036706039 5.203995766722913 7.434279666747019 ] 10-element Array{Float64,1}: 0.3 0.42857142857142855 0.6122448979591837 0.8746355685131197 1.249479383590171 1.7849705479859588 2.549957925694227 3.642797036706039 5.203995766722913 7.434279666747019 julia> x ./ last(x) 10-element Array{Float64,1}: 0.04035360699999998 0.057648009999999965 0.08235429999999996 0.11764899999999996 0.16806999999999994 0.24009999999999995 0.34299999999999997 0.49 0.7 1.0 julia> using StatsBase [ Info: Recompiling stale cache file /Users/rory/.julia/compiled/v1.0/StatsBase/EZjIG.ji for StatsBase [2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91] julia> eweights(1:10, 0.3) 10-element Weights{Float64,Float64,Array{Float64,1}}: 0.04035360699999998 0.05764800999999997 0.08235429999999996 0.11764899999999996 0.16806999999999994 0.24009999999999995 0.3429999999999999 0.48999999999999994 0.7 1.0 ```

…h a `scaled::DepBool` kwarg.

nalimilan · 2021-10-04T16:23:32Z

src/weights.jl

+- https://en.wikipedia.org/wiki/Exponential_smoothing
+ ```
+"""
+function eweights(t::AbstractVector{<:Integer}, λ::Real, n::Integer; scaled::DepBool=nothing)


Maybe call this scale? We already use this term for transformations (https://juliastats.org/StatsBase.jl/latest/transformations/).

nalimilan · 2021-10-04T16:24:42Z

src/weights.jl

-    eweights(t::AbstractVector{<:Integer}, λ::Real)
-    eweights(t::AbstractVector{T}, r::StepRange{T}, λ::Real) where T
-    eweights(n::Integer, λ::Real)
+    eweights(t::AbstractVector{<:Integer}, λ::Real, [n::Integer]; scaled=false)


Why do we need to allow passing n separately here? Isn't it always equal to length(t)?

That doesn't always hold for cases where you're skipping elements (ie: [1, 2, 4, 5], but you want n to be 5) The most common use case for us is passing a set of actual dates t and specifying our expect date range r. The previous code wouldn't handle missing entries in t correctly, hence why I added this addition and didn't document it on the first pass (used with eweights(t, r λ)).

rofinn · 2021-10-04T16:53:29Z

NOTE: Nightly seems to be failing for an unrelated reason.

nalimilan · 2021-10-14T20:49:39Z

src/weights.jl

@@ -227,37 +231,55 @@ For each element `i` in `t` the weight value is computed as:
  As this value approaches 0, the resulting weights will be almost equal,
  while values closer to 1 will put greater weight on the tail elements of the vector.

+# Keyword arguments
+
+- `scale::Bool`: Whether are not to return the scaled


Incomplete sentence. Also repeat the default value.

nalimilan · 2021-10-14T20:57:17Z

src/weights.jl

 If an integer `n` is provided, weights are generated for values from 1 to `n`
 (equivalent to `t = 1:n`).


Suggested change

If an integer `n` is provided, weights are generated for values from 1 to `n`

(equivalent to `t = 1:n`).

If an integer `n` is provided and `t` is omitted, weights are generated for values from 1 to `n`

(equivalent to `t = 1:n`).

Should also describe what's the default value for n when t is provided.

BTW, the order of t, n and λ is inconsistent across signatures, but I guess we can't do much about it without breaking backward compatibility?

Hmm, good point. I added the (t, λ, n) method more out of convenience, but if you think it would be more consistent I could merge it with the existing (t, λ) method which just calls extrema to get the default n value?

Okay, I believe the signatures are the same now, apart from the scaled kwarg.

OK, thanks. The value of n when only t is provided still needs documentation though.

Sure. FWIW, I'm starting to dislike the n and r convenience constructors. The range syntax is pretty terse already, and something.(indexin(t, r)) doesn't seem that verbose.

Feel free to make a deprecation PR if you would prefer dropping this in the next breaking release. Though I'm not sure they really hurt, as long as they don't introduce inconsistencies or ambiguities.

No, they don't hurt and they are logical extensions of the primary method, it's just that this docstring feels a little verbose now :)

…hod.

rofinn · 2021-11-08T21:34:28Z

FWIW, I think we can remove the breaking label for this PR. I'm pretty sure we're just introducing a deprecation now.

nalimilan · 2021-11-08T21:47:52Z

src/weights.jl

+The integer value of `n` represents the number of past observations to consider, which defaults to:
+
+1. `maximum(t) - minimum(t) + 1` if only `t` is passed in and the elements are integers
+2. `length(r)` if a superset range `r` is passed in
+3. `n` is explicitly passed instead of `t` (equivalent to `t = 1:n`)


This describes default values for both n and t, so I suggest changing the wording a bit. For example:

Suggested change

The integer value of `n` represents the number of past observations to consider, which defaults to:

1. `maximum(t) - minimum(t) + 1` if only `t` is passed in and the elements are integers

2. `length(r)` if a superset range `r` is passed in

3. `n` is explicitly passed instead of `t` (equivalent to `t = 1:n`)

The integer value `n` represents the number of past observations to consider.

`n` defaults to `maximum(t) - minimum(t) + 1` if only `t` is passed in

and the elements are integers, and to `length(r)` if a superset range `r`

is also passed in.

If `n` is explicitly passed instead of `t`, `t` defaults to `1:n`.

nalimilan

Thanks, let's merge this and then we can file an issue/PR making the breaking change for the next breaking release.

rofinn · 2021-11-18T20:17:21Z

Okay, if there aren't any other comments I can merge this tomorrow. Do we typically bump the release version in the PR here? I also appear to have merge permissions, but I don't know if I can register the change?

nalimilan · 2021-11-19T08:34:24Z

Feel free to bump the patch version in the PR. Registering it should also work.

rofinn requested a review from ararslan August 2, 2019 21:08

ararslan approved these changes Aug 2, 2019

View reviewed changes

ararslan requested a review from nalimilan August 2, 2019 22:09

nalimilan reviewed Aug 18, 2019

View reviewed changes

src/weights.jl Outdated Show resolved Hide resolved

src/weights.jl Outdated Show resolved Hide resolved

kleinschmidt added the breaking A breaking change that needs a semver major release label Feb 16, 2020

rofinn added 4 commits October 4, 2021 09:20

Fix eweight tests.

72762ed

Add a couple links to the docstring.

4c69f14

Address review comments and properly deprecate unscaled behaviour wit…

621a856

…h a `scaled::DepBool` kwarg.

nalimilan reviewed Oct 4, 2021

View reviewed changes

rofinn force-pushed the rf/normalize-eweights branch from cf18392 to 621a856 Compare October 4, 2021 16:31

Rename scaled => scale.

6b01596

nalimilan reviewed Oct 14, 2021

View reviewed changes

rofinn added 2 commits October 20, 2021 19:56

Fixup docstring slightly and don't make (t, λ, n) method a public met…

cec2db9

…hod.

More docstring updates.

b6beb74

nalimilan reviewed Nov 8, 2021

View reviewed changes

nalimilan removed the breaking A breaking change that needs a semver major release label Nov 8, 2021

Reword n explanation

7ddf07b

nalimilan approved these changes Nov 11, 2021

View reviewed changes

rofinn added 2 commits November 19, 2021 12:41

Bump patch release

2a17535

Merge branch 'master' into rf/normalize-eweights

c7eb4fe

rofinn merged commit 1678fd1 into master Nov 19, 2021

rofinn deleted the rf/normalize-eweights branch November 19, 2021 21:06

		If an integer `n` is provided, weights are generated for values from 1 to `n`
		(equivalent to `t = 1:n`).

-The integer value of `n` represents the number of past observations to consider, which defaults to:
-. `maximum(t) - minimum(t) + 1` if only `t` is passed in and the elements are integers
-. `length(r)` if a superset range `r` is passed in
-. `n` is explicitly passed instead of `t` (equivalent to `t = 1:n`)
+The integer value `n` represents the number of past observations to consider.
+`n` defaults to `maximum(t) - minimum(t) + 1` if only `t` is passed in
+and the elements are integers, and to `length(r)` if a superset range `r`
+is also passed in.
+If `n` is explicitly passed instead of `t`, `t` defaults to `1:n`.

Adjust eweights calculation to avoid precision issues #509

Adjust eweights calculation to avoid precision issues #509

Conversation

rofinn commented Aug 2, 2019

codecov bot commented Aug 2, 2019 • edited Loading

Codecov Report

ararslan left a comment

Choose a reason for hiding this comment

nickrobinson251 commented Aug 2, 2019

rofinn commented Aug 2, 2019 • edited Loading

nickrobinson251 commented Aug 2, 2019 • edited by ararslan Loading

rofinn commented Aug 2, 2019 • edited Loading

nalimilan commented Aug 3, 2019

rofinn commented Aug 6, 2019

nalimilan commented Aug 8, 2019

rofinn commented Aug 8, 2019

nalimilan commented Aug 9, 2019

rofinn commented Aug 9, 2019 • edited Loading

nalimilan commented Aug 11, 2019

rofinn commented Aug 12, 2019

nalimilan commented Aug 18, 2019

Nosferican commented Aug 22, 2019

rofinn commented Aug 23, 2019 • edited Loading

rofinn commented Aug 23, 2019

nalimilan commented Oct 1, 2021

rofinn commented Oct 1, 2021 • edited Loading

nalimilan commented Oct 2, 2021

rofinn commented Oct 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rofinn Oct 4, 2021 • edited Loading

Choose a reason for hiding this comment

rofinn commented Oct 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rofinn Oct 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rofinn commented Nov 8, 2021

Choose a reason for hiding this comment

nalimilan left a comment

Choose a reason for hiding this comment

rofinn commented Nov 18, 2021 • edited Loading

nalimilan commented Nov 19, 2021

codecov bot commented Aug 2, 2019 •

edited

Loading

rofinn commented Aug 2, 2019 •

edited

Loading

nickrobinson251 commented Aug 2, 2019 •

edited by ararslan

Loading

rofinn commented Aug 2, 2019 •

edited

Loading

rofinn commented Aug 9, 2019 •

edited

Loading

rofinn commented Aug 23, 2019 •

edited

Loading

rofinn commented Oct 1, 2021 •

edited

Loading

rofinn Oct 4, 2021 •

edited

Loading

rofinn Oct 21, 2021 •

edited

Loading

rofinn commented Nov 18, 2021 •

edited

Loading