Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility with Entropies v2: passing ProbabilitiesEstimators is broken #183

Closed
kahaaga opened this issue Dec 27, 2022 · 16 comments · Fixed by JuliaDynamics/ComplexityMeasures.jl#229

Comments

@kahaaga
Copy link
Member

kahaaga commented Dec 27, 2022

@Datseris

Because input data now are explicitly required to construct a ProbabilitiesEstimator, passing ProbabilitiesEstimators to upstream methods is no longer possible. This completely got lost on me due to the many and rapid changes the past weeks.

This currently breaks all upstream methods (on the dev branch) that depend on ProbabilitiesEstimators. BUT, it is not a problem for CausalityTools v1.X, so no worries there.

Example

x, y = rand(100), rand(100)
est = ValueHistogram(FixedRectangularBinning(0, 1, 5))
mutualinfo(est, x, y)

The above won't work any longer, because the estimator requires input data. Behind the scenes, mutualinfo computes three separate entropies, each now requiring its own, uniquely initialized instance of ValueHistogram. Therefore, naively just using ProbabilitiesEstimator, the user would have to supply three estimators, which makes no sense.

I absolutely do not have time to do a re-write of anything Entropies-related in the foreseeable future, and I don't think we should, because it is in a very good state. So here's what I propose:

Solution

There are only three or four (I think) probabilities estimators that handle multivariate data. Therefore, I will make some basic CausalityTools-only estimators that just wrap relevant concepts from Entropies (names below are completely arbitrary):

abstract type DiscreteEstimator end

struct HistogramEstimator{B}
    binning::B
    f(b::B; type = ...) # where type is transfer operator or value histogram )
end

struct KernelEstimator{K}
    kernel::K # where kernel is an argument indicating which type of kernel to be used
end

and so on

These structs accept relevant parameters and inititalises the needed entropy estimators behind the scenes, removing any confusion from the user side, and removing any need for the user to know about Entropies.jl.

This is not an ideal solution, and it might have to change in the long run, but I reeeeeally don't have time to start fresh if I want to get any of my applications ready before the deadlines in February. We also need to focus on getting a working V2 of CausalityTools out, so workshop preparations can start.

Therefore, I'm almost tempted to veto anything but minor changes to the proposal in this issue. If re-doing more stuff now, I think it will be the fourth time I'm rewriting everything 2.0-related based on Entropies changes, and I simply don't have more time to waste.

Take-away: introduce simple probability-type wrappers that are compatible with higher-level measures.
What do you think? Any immediate thoughts?

@kahaaga
Copy link
Member Author

kahaaga commented Dec 27, 2022

To demonstrate the concept, I will implement it solely for the mutual information and conditional mutual information and keep everything else for future PRs. That will be more efficient.

@Datseris
Copy link
Member

The above won't work any longer, because the estimator requires input data

Fixed binning doesn't require input data, I am not sure what you are saying.

@Datseris
Copy link
Member

Is the problem that you need one binning for x or y, and another binning for the joint x and y histogram?

@Datseris
Copy link
Member

By the way, this was exactly what I was trying to tell you in JuliaDynamics/ComplexityMeasures.jl#107 . To actually test the new interface in upsteam packages, so something like that didn't catch us after we finished writing a new API, but maybe during writing it so we can think of alternatives.

@Datseris
Copy link
Member

Datseris commented Dec 27, 2022

An alternative is to revert the decision of estimators need a well defined outcome space, and the probabilities function instantiates the binning encoding, instead of the binning being a field of the ValueHistogram. NaiveKernel should work out of the box, right? it only uses eachindex(x) which is the same as the multidimensional data. Which else do you use that wouldn't work out of the box?

Alternatively, just make a special dispatch method for mutualinformation(bin::Binning, x, y) that makes two encodings one for 1d and one for 2d data?

@kahaaga
Copy link
Member Author

kahaaga commented Dec 27, 2022

By the way, this was exactly what I was trying to tell you in JuliaDynamics/ComplexityMeasures.jl#107 . To actually test the new interface in upsteam packages, so something like that didn't catch us after we finished writing a new API, but maybe during writing it so we can think of alternatives.

That completely went over my head. Sorry about that.

Is the problem that you need one binning for x or y, and another binning for the joint x and y histogram?

Yes. To compute mutual info based on a sum of entropies, one would explicitly have to provide three pre-initialized ValueHistogram instances, one for x, one for y and one for the joint xy. This will happen behind the scenes anyway. However, as it is now, the user has no way of generically telling mutualinfo (or some other more complicated method) that they want to use ValueHistogram with a certain binning, because it also requires them to provide input data.

Nope, just one. Estimators need input data to deduce outcome space, and not all estimators need that. I still don't get the problem.

The problem is that the following doesn't work any longer.

julia> est = ValueHistogram(RectangularBinning(4))
ERROR: MethodError: no method matching ValueHistogram(::RectangularBinning{Int64})
Closest candidates are:
  ValueHistogram(::RectangularBinning, ::Any) at ~/.julia/packages/Entropies/BQq7h/src/probabilities_estimators/value_histogram.jl:43
  ValueHistogram(::FixedRectangularBinning) at ~/.julia/packages/Entropies/BQq7h/src/probabilities_estimators/value_histogram.jl:47
  ValueHistogram(::Union{Real, Vector}, ::Any) at ~/.julia/packages/Entropies/BQq7h/src/probabilities_estimators/value_histogram.jl:42
  ...
Stacktrace:
 [1] top-level scope
   @ REPL[3]:1

This is not a problem for FixedRectangularBinnings, because then input data is not required.

Alternatively, just make a special dispatch method for mutualinformation(bin::Binning, x, y) that makes two encodings one for 1d and one for 2d data?

Yes, dispatching here in a clever way is the idea, so that the example above becomes a non-issue. However, we need to keep it generic. Dispatching on RectangularBinning is not sufficient, because RectangularBinning is not synonymous with ValueHistogram. It can also mean TransferOperator, or any other binning-based entropy estimator that gets implemented in the future.

Also, mutual info is not the only method that can use binning-based entropies. There is conditional mutual info, relative entropy, conditional entropy, extropy - the list goes on. So the extra code is then N_methods * M_variants more specialized dispatch. That will become a lot as the package grows.

Therefore we either need to:

  • revert the decision to demand input data to probabilites estimators, or
  • make simple wrappers here that eliminates the issue.

I propose here to do the latter.

@kahaaga
Copy link
Member Author

kahaaga commented Dec 27, 2022

NaiveKernel should work out of the box, right?

Nope, not any longer.

julia> est = NaiveKernel(0.1)
ERROR: ArgumentError: NaiveKernel constructor requires input data as the first argument. Do `NaiveKernel(x, ϵ).`
Stacktrace:
 [1] NaiveKernel::Float64; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Entropies ~/.julia/packages/Entropies/BQq7h/src/probabilities_estimators/kernel_density.jl:52
 [2] NaiveKernel::Float64)
   @ Entropies ~/.julia/packages/Entropies/BQq7h/src/probabilities_estimators/kernel_density.jl:52
 [3] top-level scope
   @ REPL[5]:1

@Datseris
Copy link
Member

NaiveKernel should work out of the box, right?

It does work but you do have to provide NaiveKernel(x, 0.1). x and y must be same length anyways, and so will be their joint histogram, so actually calling the dispatch of probabilities(est, anything) will work with est = Naivekernel(x, 0.1)

Anyways, let me think about this for a moment before implementing anything, there might be an easy solution.

@kahaaga
Copy link
Member Author

kahaaga commented Dec 27, 2022

Which else do you use that wouldn't work out of the box?

Any probabilities estiamtor, now and in the future, that requires input data to instantiate it, will not work out of the box.

But that is not necessarily a problem in itself. For example, if any methods here are going to work with the spatial entropy estimators, we would have to figure out a way around it anyways, because they demand input data due to performance reasons.

An alternative is to revert the decision of estimators need a well defined outcome space, and the probabilities function instantiates the binning encoding, instead of the binning being a field of the ValueHistogram. NaiveKernel should work out of the box, right? it only uses eachindex(x) which is the same as the multidimensional data.

I don't think we should revert this decision. It is elegant, intuitive and there are good reasons for keeping it this way.

We can easily solve the issue by providing wrappers here.

It does work but you do have to provide NaiveKernel(x, 0.1). x and y must be same length anyways, and so will be their joint histogram, so actually calling the dispatch of probabilities(est, anything) will work with est = Naivekernel(x, 0.1)

Not all information methods necessarily demand equal-length input for the input variables.

@kahaaga
Copy link
Member Author

kahaaga commented Dec 27, 2022

If we both brain storm a bit, we should be able to come up with a good solution. Let's use mutual info as an example, because it extends easily to other methods.

I'll have to do this tomorrow though. My brain is fried, and I need to let this sink in and experiment a bit.

@kahaaga
Copy link
Member Author

kahaaga commented Dec 27, 2022

I'm just listing potential solutions here as they come to mind. I have no opinion on which one (if any) is the best yet:

  • Decide that ProbabilitiesEstimators have a well define outcome space if they are initialised with input data. If not, the outcome space is not well defined. But both methods work. This doesn't even need to change the public API for Entropies. We can simply just allow the outcome space to be undefined, for upstream use here. But the only documented behaviour in the public API for Entropies is that you need to provide input data. Not sure how this translates to actual code, though.

@Datseris
Copy link
Member

Alright, I've thought about this extensively for the last hour, and the solution appears simple to me.

First some general comments:

(from #183 (comment) ) But the only documented behaviour in the public API for Entropies is that you need to provide input data

I believe that during package development, downstream packages should use only the public API as if they know nothing about the internals of a dependencie. So, try to use a package as if you haven't written it yourself. Over time this will lead to clearer design and more contributions.

I don't think we should revert this decision of estimators needing a well defined outcome space. It is elegant, intuitive and there are good reasons for keeping it this way.

Well, as a user, I actually never found it elegant... I was always weird out by having to provide input data twice in practically the same function call. entropy(ValueHistogram(x, 0.1), x). Weird to give x twice, and doesn't seem much elegant. I thought it was a good idea when I first did this change...

The second question to ask is, are the reasons for making this really that good...? I think arguments favor both approaches. On the one you have the specificity of the outcome space. On the other one woul argue that a probability estimator may be a more abstract concept. The NaiveKernel highlights this well. You get as probabilities the nearest negibhros. This is a solid concept that shouldn't care about the length of the input trajectory.


I now realize that the change I implemented of estimators forcing a known outcome space was just a bad choice. It clearly clutters usage simplicity in realistic app-lication scenarios give the discussion of this issue.

So here is the solution I favor: we revert estimators demanding a well defined outcome space. Instead, we shift the focus of a well-defined outcome space to the outcome_space and total_outcomes function. They get a general level dispatch

total_outcomes(est::ProbEst, x) = total_outcomes(est)
total_outcomes(est::ProbEst) = error("total out not defined for est, try total_out(est, x) in case input is needed for knowing outcome space concretely.")

as before. This means that for estimators that have a well-defined outcome space by construction things work as is now. For others, they have to extend the two-argument method. If you think about this, there are only two functions of Entropies.jl actually affected by this: outcome_space and total_outcomes. Neither of which are as important as probabilities, entropies, which are the main functions used from this package.

This will allow everything you want here to work.

The changes for implementing this solutiion are very small, and will take me at most half an hour, so I'll go ahead and do this now.

@Datseris
Copy link
Member

For ValueHistogram: it becomes like before: it doesn't instantiate a binning unless it called by probabilities. Naturally, one should loop over calls of probability with ValueHistogram, but directly instantiate a binning instead.

@kahaaga
Copy link
Member Author

kahaaga commented Dec 28, 2022

The second question to ask is, are the reasons for making this really that good...? I think arguments favor both approaches. On the one you have the specificity of the outcome space. On the other one woul argue that a probability estimator may be a more abstract concept.

I guess it seemed like a good idea in the heat of the moment to me too. But perhaps it got a bit too heated. I see both sides, but the practical argument should severely outweigh any theoretical considerations, unless there are extremely good reasons for doing so. The major pros and cons of this approach are, in my opinion are:

  • Pros: A consistent way of getting the outcome space of ProbabilitiesEstimators.
  • Cons: Completely breaks the modularity of the envisioned information theoretic measure approach.

It seems obvious that it is better, both for users and package maintainers, to keep the simplicity of the modular approach. If that comes at the cost of having to manage two signatures of outcome_space, so be it. The latter is muuuch less important in my eyes.

I now realize that the change I implemented of estimators forcing a known outcome space was just a bad choice. It clearly clutters usage simplicity in realistic app-lication scenarios give the discussion of this issue.

Yes. And this basic issue probably is only the start. I haven't even started trying to incorporate the meta-methods, such as the automatic embedding stuff and null hypothesis testing (e.g. surrogates).

@kahaaga
Copy link
Member Author

kahaaga commented Dec 28, 2022

The changes for implementing this solutiion are very small, and will take me at most half an hour, so I'll go ahead and do this now.

Alright. I'll revert the state of the codebase again, and try out JuliaDynamics/ComplexityMeasures.jl#229.

@kahaaga
Copy link
Member Author

kahaaga commented Dec 28, 2022

Ok, I fixed some issues in the PR you linked, and everything works again. Thanks for the effort!

@kahaaga kahaaga closed this as completed Dec 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants