Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLJ Integration #3

Open
ablaom opened this issue Sep 16, 2020 · 10 comments
Open

MLJ Integration #3

ablaom opened this issue Sep 16, 2020 · 10 comments

Comments

@ablaom
Copy link

ablaom commented Sep 16, 2020

Continuing the discussion in #2 (and on slack). Some thoughts on what the issues might be.

As I understand it, a point on an arbitrary Manifold object does not generally know what manifold to which it belongs, correct? This is fine as far the working with these points internal to your manifold-specific algorithms, but not ideal from the point of view of integration with the rest of the ML ecosystem. The problem is roughly analogous to categorical variables. Internally these are usually represented as integers, but algorithms still need to know the total number of possible classes to avoid problems, such as certain classes disappearing on resampling. Passing this information around is not as easy as it first appears. Life is much easier (for a tool box like MLJ) if we simply assume every point knows all the classes - and that is why we (and other packages) insist on the use of CategoricalArrays for representing such data (although ordinary arrays of some "categorical value" type would also have sufficed.)

In the future, we might have algorithms which deal with mixed data types, one or more or which is a manifold type (think of geophysical applications) and having to keep track of metadata for a subset of variables gets messy.

So my tentative suggestion would be that MLJ users would present input data for a supervised learning algorithm from the ManifoldML package as an abstract vector of "manifold points", where a "manifold point" is a point which combines the manifold to which the point belongs with some internal representation. This could be as simple as a tuple (M, p), for example. We define a new scientific type ManifoldPoint{M} where M is the concrete manifold type and declare scitype( (M, p)) = ManifoldPoint{typeof(M)}`. Then your input type declarations in the implementation of the MLJ interface would look something like:

input_scitype(::ManifoldKNNRegressor) = AbstractVector{<:ManifoldPoint{<:MetricManifold}}

And the rest would be straightforward, I should think.

Other random thoughts:

  • Maybe there is some way to "decorate" existing manifolds to enforce the kind of point representation we want. I don't really understand this decorating business enough to say, or if this is really an advantage.

  • Maybe we want to refine the scitype to include the number_type as type parameter

@kellertuer @mateuszbaran Your thoughts?

@mateuszbaran
Copy link
Member

I think this sounds like a good plan. We can definitely make a wrapper so that each point knows its manifold. In fact that's how my early prototypes worked but it turns out to be very inefficient for many algorithms. But an MLJ <-> Manifolds compatibility layer could just unwrap and wrap the result, this is fine.

  • Maybe there is some way to "decorate" existing manifolds to enforce the kind of point representation we want. I don't really understand this decorating business enough to say, or if this is really an advantage.

That decorator thing we have isn't particularly intuitive but works fine for our purposes. In this case however representation needs to be enforced at a different level. We will figure something out.

  • Maybe we want to refine the scitype to include the number_type as type parameter

How would that information be used in the MLJ ecosystem? Manifolds.jl is quite good at figuring out types of temporaries and results from types of arguments.

@ablaom
Copy link
Author

ablaom commented Sep 16, 2020

How would that information be used in the MLJ ecosystem? Manifolds.jl is quite good at figuring out types of temporaries and results from types of arguments.

We wouldn't need it. It would only be necessary if you can imagine an algorithm which would only work for manifolds with that given number_type. We only need to include what is necessary for you to articulate your requirements on the input.

@kellertuer
Copy link
Member

kellertuer commented Sep 17, 2020

Thanks for your ideas.

Concerning the “a point does not know which manifold it belongs to” – I see a small problem for efficiency attaching the manifold to any point (though our manifolds usually are only a few integers of information/storage). Maybe it would also be a good idea to store the manifold only with a batch of data? If we have a set of points (the training set for example) they all live on the same manifold. Would that be possible?

Concerning the decorator – it might take a while to carefully understand that approach we follow there, the rough idea is as follows:
For a manifold many things are assumed to exist – for example the metric. When people speak of the sphere they have the round metric in mind, often without specifically thinking about it.
So we implemented the (default) sphere exactly like that.
If now someone wants to “break out” this default assumption and implement another metric (yielding other geodesics, distances), most things stay the same – the manifold dimension for example.
So one can deecorate or wrap the default sphere in a MetricManifold which can be used to dispatch the distance to its new implementation. Everything unrelated to a metric is just taken from the default implementation.

Concerning your idea of enforcing a point representation. That should just be doable with <:MPoint and <:TVector types actually.
Even more – without storing the manifold explicitly one could – for these certain types of points and vectors provide a get_manifold(::MyRepresentationMPoint). What do you think? This would be more flexible than storing the manifold
Of course we do not necessarily have to take our point/vector types, our implementations are more flexible. Still from a type (and maybe the internal array size as a parameter) one can for sure determine the corresponding manifold.

@mateuszbaran
Copy link
Member

mateuszbaran commented Sep 17, 2020

Concerning the “a point does not know which manifold it belongs to” – I see a small problem for efficiency attaching the manifold to any point (though our manifolds usually are only a few integers of information/storage). Maybe it would also be a good idea to store the manifold only with a batch of data? If we have a set of points (the training set for example) they all live on the same manifold. Would that be possible?

Performance wouldn't be affected that much on Julia 1.5+ thanks to the memory layout changes of structs. I usually do care about performance and I don't think it would be a problem for this interface 🙂 . I will be perfectly fine with something like

struct PointAndManifold{TP,TM<:Manifold} <: MPoint
    p::TP
    M::TM
end

@kellertuer
Copy link
Member

Then I am also fine with that variant, for sure.

@ablaom
Copy link
Author

ablaom commented Sep 21, 2020

So there seem a few ways to move forward here:

  1. MLJ users provide input features as vectors of tuples of the form (p, M), where M is a manifold. We introduce a new scientific type ManifoldPoint{TM} and implement scitype((p,M::TM)) where TM<:Manifold = ManifoldPoint{TM}.

  2. A new struct is introduced as above (in the Manifolds.jl ecosystem), which is the type MLJ user present. I would reverse the order of the type parameters. Then the union type PointAndManifold{TM} could double duty as a scientific type (no need to add one to ScientifictTypes.jl) and we implement scitype(::PointAndManifold{TM}) where TM<:Manifold = TM. (For what it's worth, I would prefer the name ManifoldPoint or PointOnManifold to PointAndManifold.)

  3. We introduce both the new struct (in Manifolds*.jl) and a new scientific type (with different name).

  4. ?

@kellertuer @mateuszbaran Do you have a preference for how you want to proceed?

Side question: Do you have models where tangent vectors would be part of data presented by MLJ users? That is, do we need analogues of the above for tangent vectors?

@kellertuer
Copy link
Member

I would prefer ManifoldPoint with 1,
and we could provide an easy way to use those, i.e. define exp(p::ManifoldPoint, X) = exp(p.M, p.p, X) and such for ease of use.

Concerning the tangent vectors – we also thought about that, it's actually easy: A tangent vector X has to “know” its base point, but the tuple (p,X) is already a point on the Tangent bundle, so anyways a point on a manifold. This is already implemented, its a special case of a vector bundle https://juliamanifolds.github.io/Manifolds.jl/stable/manifolds/vector_bundle.html – we can surely highlight a tangent bundle more prominently.

@mateuszbaran
Copy link
Member

I'm fine with either variant 1 or 2. It would be nice if users could just add or multiply by scalars tangent vectors wrapped in ManifoldPoint but then the implementation has to be aware of our VectorBundle. I'm not quite sure now where such methods should be defined in variant 1.

i.e. define exp(p::ManifoldPoint, X) = exp(p.M, p.p, X) and such for ease of use.

That may not be the best example because X in this interface would not be an array. exp could just work on elements of the tangent bundle, right?

@kellertuer
Copy link
Member

Ah, but to distinguish that correctly we might need a TangentVector indeed? I don't think so, since for
p being a point on the manifold, X can be an array that's fine and for p being on the tangent bundle, Xwould be an array from the “product tangent space”, i.e. (TpM)^2 but could still be a simple array in your ProductRepr style?

@mateuszbaran
Copy link
Member

OK, after some discussion on Slack the conclusion is that MLJ could just do variant 1 and we will work out details of integration on the Manifolds side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants