Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding correlation dimension #20

Closed
aleksejs-fomins opened this issue Oct 23, 2020 · 4 comments
Closed

Understanding correlation dimension #20

aleksejs-fomins opened this issue Oct 23, 2020 · 4 comments
Labels
epic This is not a normal issue but an epic, containing multiple sub-issues

Comments

@aleksejs-fomins
Copy link

Hi. I installed your library (python3) and wrote a basic script to calculate the correlation dimension, plugging in some random parameters that came into my mind

import numpy as np
from nolds.measures import corr_dim

data = np.random.normal(0,3,1000))
corr_dim(data, emb_dim=5, debug_plot=True)

I get an answer 0.01. What does that mean? I thought that the correlation dimension is supposed to estimate the true dimension in which the points are located. Since an uncorrelated gaussian blob should have full rank, I expected a result somewhere close to the embedding dimension, namely, 5 in this case. Am I doing something wrong? When I try higher embedding dimensions, like 10, I get answers 10^{-15}, is this expected? Also, do you have a version of this algorithm that works for arbitary dimensions?

Below I provide an algorithm I have written myself to calculate correlation dimension for arbitrary dimension inputs (no embedding dimension, just providing 2D arrays as inputs). It behaves more like what I expect, but still underestimates the true dimension by a factor ~2. How stable are such estimators in general?

# Return flat 1D array of matrix off-diagonal entries
def offdiag_1D(M):
    idxs = ~np.eye(M.shape[0], dtype=bool)
    return M[idxs]

def corr_dim_homebrew(data2D):
    # Get all pairwise euclidean distances in N-dimensional scale
    nPoint, nDim = data2D.shape
    data3D = np.repeat(data2D[..., None], nPoint, axis=2).transpose((0,2,1))
    diff = offdiag_1D(np.linalg.norm(data3D - data3D.transpose((1,0,2)), axis=2))
    
    # Sort distances, calculate correlation integral
    diffSorted = np.sort(diff)
    corrIntegral = np.arange(1, len(diff)+1) / nPoint**2

    # Convert to log scale
    diffSortedLog = np.log10(diffSorted)
    corrIntegralLog = np.log10(corrIntegral)
    
    # Fit polynomial
    coeff = np.polyfit(diffSortedLog, corrIntegralLog, 1)
    plin = lambda x, c: (x, c[1] + c[0]*x)
    
    plt.figure()
    plt.plot(diffSortedLog, corrIntegralLog)
    plt.plot(*plin(diffSortedLog, coeff), 'r--' )
    
    print("Estimated dimension", coeff[0])
    plt.show()
    
corr_dim_homebrew(aaa[...])

Best,
Aleksejs

@CSchoel
Copy link
Owner

CSchoel commented Nov 7, 2020

Hi Aleksejs,

I have to admit that of all measures in nolds, the correlation dimension is the one where I am least confident in interpreting the results. This is also reflected by the fact that I currently have no good unit tests of "real" applications for this measure. This is one of the TODOs for version 1.0.

It took me so long to answer, because I wanted to at least perform some small experiments to be able to provide you with some meaningful insight. However, I currently struggle to find the time for that, so I will just tell you my current understanding.

I wrote a test script calculating the correlation dimension following the Grassberger-Procaccia-algorithm that is also implemented in nolds, but on the way I noticed some discrepancies between the implementation in nolds and the Scholarpedia article, which is written by Grassberger himself. I found two potential sources for errors in the nolds implementation:

  • The GP article excludes self-matches, which nolds currently does not.
  • The N used for normalizing the sum is the number of vectors, whose distances are compared. However, in nolds I use the length of the input array, not the number of orbit vectors. This should not introduce a large error, but should be corrected nonetheless.

So in conclusion I think that there is defenitely something off here, which I will need to investigate further when I find the time. In the meantime, please let me know, if you find out more about this issue or if you come across a nice test case with a known correlation dimension, which I could use for debugging.

I also drop a quick notice to @DominiqueMakowski here, since he uses the correlation dimension from nolds in Neurokit2 and has proven to be excellent in spotting discrepancies between different implementations of fractal measures. 😉

@aleksejs-fomins
Copy link
Author

Hi Christopher,

Thanks for paying attention to this. For me, it is not urgent. I just read about the correlation dimension and was fascinated by it. I really like the idea of metrics that can indicate that multivariate data may be occupying a low-dimensional subdomain/manifold, especially if it is nonlinear and does not get picked up by PCA. Unsurprisingly, I also work in neuroscience at the moment: where else would you find high-dimensional time series which most certainly must be implementing some sophisticated nonlinear low-dimensional code (for reasons like error correction and speed), but at the same time there is not much prior knowledge on how this is actually done, especially in highly plastic parts of the brain such as cortex and hippocampus. So inputs from Dominique are highly appreciated :)

As I understand, the tests for this metric should be able to converge to the theoretical result, namely, that the correlation dimension would converge to the real dimension of the subdomain/manifold for large number of datapoints:

  • Data sampled from a full-rank distribution (such as a multivariate gaussian ball) should result in estimated dimension $\nu$ being approximately equal to the dimension of domain.
  • One can generate some low-dimensional data (e.g. on a 2D plane), then rotate that plane in a high-dimensional space (e.g. 10D), then add a small amount of 10D noise. That should result in correlation dimension approximately equal to 2.

@CSchoel
Copy link
Owner

CSchoel commented Nov 8, 2020

Thank you very much for sharing your understanding and ideas for tests! That sounds indeed like a good approach. I will also go through the literature and see if there are some plots and/or results that I can try to replicate. I will report back here when I have found the time to do so.

As a side note, since you also asked about the stability of these measures: From my limited experience I get the impression that one has to be very careful to come up with a objective and well justified way of determining the parameters for these algorithms. You can get vastly different answers with different parameter settings, which allows you to tweak parameters to always get a result that "feels" right, but that is of course not scientific or objective at all. This makes me quite sceptic about papers that use these algorithms but do not provide a rationale for their choice of parameter values.

One possible counter-measure for this is to look at intermediate results, especially the plot in which the line-fitting takes place. In nolds this can be done by setting the parameter debug_plot to True.
More often than not this will show that the data does not form a line at all, or that you can obtain two entirely different slopes, depending on whether you focus on small values on the x-axis or on large values.

@aleksejs-fomins
Copy link
Author

Yes, I experience exactly what you suggest - the curve was not a line. I will now say a few things that are speculations, but I suspect that they might be true. I think there is a fundamental problem in the underlying theory actually. I assume that the theory is correct in that c ~ \eps^\nu for all reasonable $\eps$ given infinite number of measurements. But what I suspect to be true is that the rate with which c converges to its optimal value strongly depends on the value of $\eps$. As a result, using all available values of $\eps$ for estimating $\nu$ is inefficient, as it is biased by many regimes in which the convergence has not been reached. I hope some theoretician pays attention to this metric to derive some kind of stability criterion, namely, only fitting to $\eps$ is a narrow range where they are expected to be most stable.

@CSchoel CSchoel closed this as completed May 30, 2021
@CSchoel CSchoel added the epic This is not a normal issue but an epic, containing multiple sub-issues label Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic This is not a normal issue but an epic, containing multiple sub-issues
Projects
None yet
Development

No branches or pull requests

2 participants