Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

autocov incorrect? #273

Closed
tpapp opened this issue Jun 10, 2017 · 8 comments
Closed

autocov incorrect? #273

tpapp opened this issue Jun 10, 2017 · 8 comments
Labels

Comments

@tpapp
Copy link
Contributor

tpapp commented Jun 10, 2017

Consider an alternating sequence of 1 and -1, which has autocovariances 1 and -1 at even and odd lags.

using StatsBase
x = ones(20)
x[1:2:end] = -1.0
ac = autocov(x, (1:length(x))-1; demean = false)

But

julia> showcompact(ac)
[1.0, -0.95, 0.9, -0.85, 0.8, -0.75, 0.7, -0.65, 0.6, -0.55, 0.5, -0.45, 0.4, -0.35, 0.3, -0.25, 0.2, -0.15, 0.1, -0.05]                                             

I wonder if one should divide by the number of elements in the calculation instead of the length of the whole sequence.

@tpapp
Copy link
Contributor Author

tpapp commented Jun 10, 2017

I have looked at some textbooks, and both forms are in use (ie with denominators N and N-k, where N is the length of the sequence and k is the lag). The form with N has some nice definiteness properties when written as a matrix, but it is not unbiased. Some texts argue that since k << N, the difference should not matter. Variograms & related lit tend to use N-k. Perhaps the documentation could clarify.

@andreasnoack
Copy link
Member

This is on purpose. I think it is the standard thing to do, e.g. it is the estimator in Brockwell and Davis and it is what R does. Notice that the N-k version isn't unbiased. In general, autocovariance (or any other time series) estimators aren't unbiased. PRs with improvements to the documentation are always welcome.

@baggepinnen
Copy link

I noticed this weird (to me) behavior just today. The normalization has a large effect on estimates of the spectrum derived from the estimated ACF. Compare the two ACF estimates below
acf
The white line is the StatsBase behavior and the orange one is normalized by (N-k+1)
If estimating spectra based on these two ACFs, we obtain the following result
spectrum
I find the normalization by N as opposed to N-k odd since the name of the function is autocov but it does not give you cov between signal and it's lagged version. It's also not mentioned in the docstrings, so I think this issue deserves to remain open.

@andreasnoack
Copy link
Member

What is the data generating process and what is the sample size in those examples?

@baggepinnen
Copy link

T = 100
fs = 10
F = [1,3]
t = range(0,step=1/fs,length=T)
y = sum(sin.(2π*f .* t .+ 2π*rand()) for f in F) .+ 0.1 .* randn.()

@quebbs
Copy link

quebbs commented Oct 4, 2020

Brockwell and Davis does discuss this, and they propose the use of n in their covariance functions. They also mention n-k, which is biased, but is 'less' biased. For smaller datasets, this makes a significant difference. If you were doing this by hand, you only have n-k pairs; use of n wouldn't make sense.
The question is whether retaining n to guarantee the positive definite covariance matrix is critical? I'm needing to roll my own cov functions to allow the n-k approach.

@ParadaCarleton
Copy link
Contributor

The n-k-1 estimator is unbiased. The current implementation shrinks large lags to 0 by dividing by n, which usually reduces mean squared error. The major problem with the n-k estimator is that it is not even consistent -- as the dataset grows, the number of lags estimated increases as well. Because the second-to-last lag always has a variance based on two samples, the variance of the last lag. By contrast, dividing by n yields a consistent estimator whenever the true autocovariance goes to 0 (the usual case). It is especially bad when trying to estimate not particular lags, but instead the variance of a time series' mean. In that case, using n-k will frequently give inconsistent estimates, e.g. that a series' total autocorrelation exceeds 1.

The only real solution is something like #774, I think.

@ParadaCarleton
Copy link
Contributor

Closed as duplicate of #774

@ParadaCarleton ParadaCarleton closed this as not planned Won't fix, can't repro, duplicate, stale Oct 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants