Skip to content

Conversation

@MilesCranmer
Copy link
Owner

@MilesCranmer MilesCranmer commented Sep 25, 2022

This minimal change allows one to define an expression using a graph, rather than a tree. This is to allow for trees to use a component of an expression multiple times, rather than needing to evolve such an expression for each subtree. Basically, this will help make SymbolicRegression.jl much more powerful! It lets it re-use atomic expressions, without having to even change the core data structure.

Why this is useful

For example, consider the expression: $$\cos(x - 3.2 y) + \cos(2 (x - 3.2 y))$$ . Normally, you would have to represent this expression with the following tree (view this online to see the rendered version). This is obviously very redundant, incurring greater expense as well as a larger complexity than seems correct.

graph TD;
plus[+]-->cos1[cos]
cos1-->minus1[-]
minus1-->x1[x]
minus1-->mult1[*]
mult1-->three1[3.2]
mult1-->y1[y]
plus-->cos2[cos]
cos2-->mult3[*]
mult3-->two[2]
mult3-->minus2[-]
minus2-->x2[x]
minus2-->mult2[*]
mult2-->three2[3.2]
mult2-->y2[y]
Loading

However, with this change, you could represent this tree as follows:

graph TD;
plus[+]-->cos1[cos]
cos1-->minus1[-]
minus1-->x1[x]
minus1-->mult1[*]
mult1-->three1[3.2]
mult1-->y1[y]
plus-->cos2[cos]
cos2-->mult3[*]
mult3-->two[2]
mult3-->minus1
Loading

The cool thing is that this doesn't really make any change to the core Node type. Since Node stores references to its children, all it needs to do is store a reference to the same child from two different parent nodes. The reason this doesn't work currently is that copy_node duplicates a child anytime it is referenced twice.

To preserve multiple parents referencing the same child, you basically need to do the copy operation with an IdDict - from the current node to the copied node. Whenever a node already exists in that IdDict, you simply return the existing node - hence not duplicating children nodes (and preserving the topology)!

What this PR doesn't do

See #113 (comment) for a roadmap.

  • add any genetic operators which link two nodes in the same tree. So the search won't actually be able to exploit this new copy_node operation yet. But we could add that next.
  • Modify the complexity operation to take into account re-used subtrees
  • Modify eval_tree_array (and similar) to do some sort of output caching when a tree is re-used. It seems tricky to do this in an efficient way... Perhaps would need to check beforehand which nodes are used by multiple parents, and then keep those evaluations in a cache.

Caveats

It seems like copy_node with reference preservation is about 2x more expensive. Maybe this warrants a Julia discourse posting?


Of interest to @kazewong @ChrisRackauckas @PatrickKidger @CharFox1 @johanbluecreek @Jgmedina95 - let me know if you have any thoughts.

@johanbluecreek would you be up for a review of this?

@MilesCranmer MilesCranmer marked this pull request as ready for review September 25, 2022 00:59
@Jgmedina95
Copy link

This is a really simple, yet cool idea. Thanks for tagging me. My only thoughts so far are more related to the genetic algorithmic effect.
If a child can have two or more parents a single mutation on that child would count double (or more)(?), completely removing that "gene" of the tree. This would make these kinds of trees (now graphs) more susceptible to losing possible valuable structures. In other words, tree structure changes less than graphs with the same mutation.
If my thought of the process is correct, during crossover something similar might happens. Not sure if this can affect the search capability of the algorithm.
This is my only thought so far, if anything else pops in my mind ill post it :)

@MilesCranmer
Copy link
Owner Author

If a child can have two or more parents a single mutation on that child would count double (or more)(?), completely removing that "gene" of the tree. This would make these kinds of trees (now graphs) more susceptible to losing possible valuable structures. In other words, tree structure changes less than graphs with the same mutation.
If my thought of the process is correct, during crossover something similar might happens. Not sure if this can affect the search capability of the algorithm.

I think this is actually a feature of re-used subtrees. If one subexpression is used throughout an expression, then you only have to mutate/optimize some part of that subexpression once, rather than having to update every copy (which could take a while for the mutations to converge!). e.g., in the example I posted above - you only need to optimize one copy of the 3.2 coefficient, rather than two separate copies.

However, note that the algorithm wouldn't necessarily have to share subexpressions unless it was beneficial (to the complexity) to do so. Perhaps one genetic operator could be added which randomly breaks one of the shared nodes, by duplicating the shared part (i.e., going from the second diagram to the first diagram). In general this wouldn't impact the expressiveness in any way though, it would just let the search optimize the complexity by re-using common components.

@MilesCranmer MilesCranmer linked an issue Sep 25, 2022 that may be closed by this pull request
@MilesCranmer
Copy link
Owner Author

Roadmap to fully exploiting graph expressions here: #113 (comment)

@CharFox1
Copy link
Contributor

This looks really useful! I would recommend having both the full and compressed complexity calculations available to the user but only using compressed to guide the search. A mutation related to this that might be helpful (which you touch on in #113) could be to look for similar parts of an expression (by some metric) and combine them. This could also be useful for subexpressions that are just single nodes (for example you find a constant near pi in multiple places in your expression).
Maybe unnecessary but this could probably be implemented such that you could have nested shared subexpressions (although they'd be increasingly unlikely to happen). Overall, fun idea with a lot of potential!

@johanbluecreek
Copy link
Contributor

@johanbluecreek would you be up for a review of this?

Sure. I can take a shot at reviewing this. Never done one before but I will try and figure it out :-)

Copy link
Contributor

@johanbluecreek johanbluecreek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting new feature this. A common thing in physics, at least, is to work in strange coordinates/choice of variables, and this feature (when mutation operations are finally added to allow this in the search) opens up for the search to find such a coordinate change only once instead of every time it appears in an expression. So to me it looks to be a very useful feature.

I did not find any issues with the code, I just have some comments and questions in two places.

Copy link
Contributor

@johanbluecreek johanbluecreek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No further comments from me. I would recommend merging.

@MilesCranmer MilesCranmer merged commit 7c8fc02 into master Oct 5, 2022
@MilesCranmer MilesCranmer deleted the shared-nodes-2 branch October 5, 2022 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants