Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust GaussianCDFEncoding and related stuff for new encoding API #199

Merged
merged 2 commits into from
Dec 20, 2022

Conversation

kahaaga
Copy link
Member

@kahaaga kahaaga commented Dec 20, 2022

I have now adjusted the codebase so that GaussianCDFEncoding obeys the encoding API. This addresses another point in #190.

  • Previously, GaussianCDFEncoding was only applicable to vector-valued input, meaning that we had to encode an entire timeseries in bulk. By adjusting the type from GaussianCDFEncoding(; c::Int = 3) to GaussianCDFEncoding(; μ, σ, c::Int = 3), we now require the user to explicitly specify the mean and standard deviation of the desired normal CDF. Thus, encode and decode now both handle scalar-valued inputs (i.e. no need for entire timeseries, because μ and σ must be pre-computed).
  • In the Dispersion struct, instead of explicitly constructing an encoding (i.e. GaussianCDFEncoding), we just use c::Int, which dictates how many bins to divide the range of the CDF into. However, we add a field encoding, which specifies a type of encoding, and can be any encoding type that accepts the keyword c. The default is GaussianCDFEncoding. It is now trivial to replace GaussianCDFEncoding with WhateverCDFEncoding, by just doing e.g. Dispersion(; encoding = WhateverCDFEncoding, c = 5). Internally, this will instantiate a WhateverCDFEncoding(; c = 5) in the relevant location in the code. However, the encoding field is not yet part of the public API. I don't think it should be public yet either, because I have an idea to make this even more generic.
  • Empirical mean and standard deviation computations for Dispersion are delegated to symbolize_for_dispersion, which also instantiates the encoding.
  • I let the linear mapping from the (0, 1) range of the CDF to the integers 1, 2, ..., c to just use an equidistant binning. This is essentially a FixedRectangularBinning(0, 1, c) (but without explicitly constructing a `RectangularBinEncoding, since it would have to be constructed once per encoding, which is expensive). This differs slightly from the original paper, but makes much more sense. This choice is now well documented, and makes much more intuitive sense than the previous mapping.

The alternative to this entire approach would be to extend the encoding API to be defined for vector-valued inputs, which I don't want to do, because it complicates things unnecessarily. I think the approach here is more elegant anyways, because it is now trivial to user some other encoding by just specifying the encoding keyword to Dispersion (again, this is not in the public API atm).

Other stuff:

  • Fixed some more NaiveKernel tests.

  • Test pass.

  • Documentation generation is successful.

@Datseris I will merge this immediately when CI is done, so I can finish up #126 (which depends on this stuff). If you have any comments, just leave them here (or open issues if necessary), and I will address them while working on #126.

@kahaaga kahaaga merged commit 93686aa into main Dec 20, 2022
@kahaaga kahaaga deleted the update_gaussiancdfencoding branch December 20, 2022 18:05
@kahaaga kahaaga mentioned this pull request Dec 20, 2022
6 tasks
@Datseris
Copy link
Member

it's okay, I have planned that we have to do a full review of the entire code base before publish anyways. I just haven't told you about this plan yet xD xD xD

@kahaaga
Copy link
Member Author

kahaaga commented Dec 20, 2022

I figured as much 🤖

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants