Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow graph-based expressions instead of trees #135

Merged
merged 12 commits into from
Oct 5, 2022
Merged

Conversation

MilesCranmer
Copy link
Owner

@MilesCranmer MilesCranmer commented Sep 25, 2022

This minimal change allows one to define an expression using a graph, rather than a tree. This is to allow for trees to use a component of an expression multiple times, rather than needing to evolve such an expression for each subtree. Basically, this will help make SymbolicRegression.jl much more powerful! It lets it re-use atomic expressions, without having to even change the core data structure.

Why this is useful

For example, consider the expression: $$\cos(x - 3.2 y) + \cos(2 (x - 3.2 y))$$ . Normally, you would have to represent this expression with the following tree (view this online to see the rendered version). This is obviously very redundant, incurring greater expense as well as a larger complexity than seems correct.

graph TD;
plus[+]-->cos1[cos]
cos1-->minus1[-]
minus1-->x1[x]
minus1-->mult1[*]
mult1-->three1[3.2]
mult1-->y1[y]
plus-->cos2[cos]
cos2-->mult3[*]
mult3-->two[2]
mult3-->minus2[-]
minus2-->x2[x]
minus2-->mult2[*]
mult2-->three2[3.2]
mult2-->y2[y]
Loading

However, with this change, you could represent this tree as follows:

graph TD;
plus[+]-->cos1[cos]
cos1-->minus1[-]
minus1-->x1[x]
minus1-->mult1[*]
mult1-->three1[3.2]
mult1-->y1[y]
plus-->cos2[cos]
cos2-->mult3[*]
mult3-->two[2]
mult3-->minus1
Loading

The cool thing is that this doesn't really make any change to the core Node type. Since Node stores references to its children, all it needs to do is store a reference to the same child from two different parent nodes. The reason this doesn't work currently is that copy_node duplicates a child anytime it is referenced twice.

To preserve multiple parents referencing the same child, you basically need to do the copy operation with an IdDict - from the current node to the copied node. Whenever a node already exists in that IdDict, you simply return the existing node - hence not duplicating children nodes (and preserving the topology)!

What this PR doesn't do

See #113 (comment) for a roadmap.

  • add any genetic operators which link two nodes in the same tree. So the search won't actually be able to exploit this new copy_node operation yet. But we could add that next.
  • Modify the complexity operation to take into account re-used subtrees
  • Modify eval_tree_array (and similar) to do some sort of output caching when a tree is re-used. It seems tricky to do this in an efficient way... Perhaps would need to check beforehand which nodes are used by multiple parents, and then keep those evaluations in a cache.

Caveats

It seems like copy_node with reference preservation is about 2x more expensive. Maybe this warrants a Julia discourse posting?


Of interest to @kazewong @ChrisRackauckas @PatrickKidger @CharFox1 @johanbluecreek @Jgmedina95 - let me know if you have any thoughts.

@johanbluecreek would you be up for a review of this?

@MilesCranmer MilesCranmer marked this pull request as ready for review September 25, 2022 00:59
src/Equation.jl Outdated Show resolved Hide resolved
@Jgmedina95
Copy link

This is a really simple, yet cool idea. Thanks for tagging me. My only thoughts so far are more related to the genetic algorithmic effect.
If a child can have two or more parents a single mutation on that child would count double (or more)(?), completely removing that "gene" of the tree. This would make these kinds of trees (now graphs) more susceptible to losing possible valuable structures. In other words, tree structure changes less than graphs with the same mutation.
If my thought of the process is correct, during crossover something similar might happens. Not sure if this can affect the search capability of the algorithm.
This is my only thought so far, if anything else pops in my mind ill post it :)

src/Equation.jl Outdated Show resolved Hide resolved
@MilesCranmer
Copy link
Owner Author

If a child can have two or more parents a single mutation on that child would count double (or more)(?), completely removing that "gene" of the tree. This would make these kinds of trees (now graphs) more susceptible to losing possible valuable structures. In other words, tree structure changes less than graphs with the same mutation.
If my thought of the process is correct, during crossover something similar might happens. Not sure if this can affect the search capability of the algorithm.

I think this is actually a feature of re-used subtrees. If one subexpression is used throughout an expression, then you only have to mutate/optimize some part of that subexpression once, rather than having to update every copy (which could take a while for the mutations to converge!). e.g., in the example I posted above - you only need to optimize one copy of the 3.2 coefficient, rather than two separate copies.

However, note that the algorithm wouldn't necessarily have to share subexpressions unless it was beneficial (to the complexity) to do so. Perhaps one genetic operator could be added which randomly breaks one of the shared nodes, by duplicating the shared part (i.e., going from the second diagram to the first diagram). In general this wouldn't impact the expressiveness in any way though, it would just let the search optimize the complexity by re-using common components.

@MilesCranmer
Copy link
Owner Author

Roadmap to fully exploiting graph expressions here: #113 (comment)

@CharFox1
Copy link
Contributor

This looks really useful! I would recommend having both the full and compressed complexity calculations available to the user but only using compressed to guide the search. A mutation related to this that might be helpful (which you touch on in #113) could be to look for similar parts of an expression (by some metric) and combine them. This could also be useful for subexpressions that are just single nodes (for example you find a constant near pi in multiple places in your expression).
Maybe unnecessary but this could probably be implemented such that you could have nested shared subexpressions (although they'd be increasingly unlikely to happen). Overall, fun idea with a lot of potential!

@johanbluecreek
Copy link
Contributor

@johanbluecreek would you be up for a review of this?

Sure. I can take a shot at reviewing this. Never done one before but I will try and figure it out :-)

Copy link
Contributor

@johanbluecreek johanbluecreek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting new feature this. A common thing in physics, at least, is to work in strange coordinates/choice of variables, and this feature (when mutation operations are finally added to allow this in the search) opens up for the search to find such a coordinate change only once instead of every time it appears in an expression. So to me it looks to be a very useful feature.

I did not find any issues with the code, I just have some comments and questions in two places.

src/Equation.jl Outdated Show resolved Hide resolved
test/unittest.jl Show resolved Hide resolved
Copy link
Contributor

@johanbluecreek johanbluecreek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No further comments from me. I would recommend merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants