Allow graph-based expressions instead of trees #135

MilesCranmer · 2022-09-25T00:59:05Z

This minimal change allows one to define an expression using a graph, rather than a tree. This is to allow for trees to use a component of an expression multiple times, rather than needing to evolve such an expression for each subtree. Basically, this will help make SymbolicRegression.jl much more powerful! It lets it re-use atomic expressions, without having to even change the core data structure.

Why this is useful

For example, consider the expression: $$\cos(x - 3.2 y) + \cos(2 (x - 3.2 y))$$ . Normally, you would have to represent this expression with the following tree (view this online to see the rendered version). This is obviously very redundant, incurring greater expense as well as a larger complexity than seems correct.

graph TD;
plus[+]-->cos1[cos]
cos1-->minus1[-]
minus1-->x1[x]
minus1-->mult1[*]
mult1-->three1[3.2]
mult1-->y1[y]
plus-->cos2[cos]
cos2-->mult3[*]
mult3-->two[2]
mult3-->minus2[-]
minus2-->x2[x]
minus2-->mult2[*]
mult2-->three2[3.2]
mult2-->y2[y]

However, with this change, you could represent this tree as follows:

graph TD;
plus[+]-->cos1[cos]
cos1-->minus1[-]
minus1-->x1[x]
minus1-->mult1[*]
mult1-->three1[3.2]
mult1-->y1[y]
plus-->cos2[cos]
cos2-->mult3[*]
mult3-->two[2]
mult3-->minus1

The cool thing is that this doesn't really make any change to the core Node type. Since Node stores references to its children, all it needs to do is store a reference to the same child from two different parent nodes. The reason this doesn't work currently is that copy_node duplicates a child anytime it is referenced twice.

To preserve multiple parents referencing the same child, you basically need to do the copy operation with an IdDict - from the current node to the copied node. Whenever a node already exists in that IdDict, you simply return the existing node - hence not duplicating children nodes (and preserving the topology)!

What this PR doesn't do

See #113 (comment) for a roadmap.

add any genetic operators which link two nodes in the same tree. So the search won't actually be able to exploit this new copy_node operation yet. But we could add that next.
Modify the complexity operation to take into account re-used subtrees
Modify eval_tree_array (and similar) to do some sort of output caching when a tree is re-used. It seems tricky to do this in an efficient way... Perhaps would need to check beforehand which nodes are used by multiple parents, and then keep those evaluations in a cache.

Caveats

It seems like copy_node with reference preservation is about 2x more expensive. Maybe this warrants a Julia discourse posting?

Of interest to @kazewong @ChrisRackauckas @PatrickKidger @CharFox1 @johanbluecreek @Jgmedina95 - let me know if you have any thoughts.

@johanbluecreek would you be up for a review of this?

src/Equation.jl

Jgmedina95 · 2022-09-25T03:06:11Z

This is a really simple, yet cool idea. Thanks for tagging me. My only thoughts so far are more related to the genetic algorithmic effect.
If a child can have two or more parents a single mutation on that child would count double (or more)(?), completely removing that "gene" of the tree. This would make these kinds of trees (now graphs) more susceptible to losing possible valuable structures. In other words, tree structure changes less than graphs with the same mutation.
If my thought of the process is correct, during crossover something similar might happens. Not sure if this can affect the search capability of the algorithm.
This is my only thought so far, if anything else pops in my mind ill post it :)

src/Equation.jl

MilesCranmer · 2022-09-25T17:49:45Z

If a child can have two or more parents a single mutation on that child would count double (or more)(?), completely removing that "gene" of the tree. This would make these kinds of trees (now graphs) more susceptible to losing possible valuable structures. In other words, tree structure changes less than graphs with the same mutation.
If my thought of the process is correct, during crossover something similar might happens. Not sure if this can affect the search capability of the algorithm.

I think this is actually a feature of re-used subtrees. If one subexpression is used throughout an expression, then you only have to mutate/optimize some part of that subexpression once, rather than having to update every copy (which could take a while for the mutations to converge!). e.g., in the example I posted above - you only need to optimize one copy of the 3.2 coefficient, rather than two separate copies.

However, note that the algorithm wouldn't necessarily have to share subexpressions unless it was beneficial (to the complexity) to do so. Perhaps one genetic operator could be added which randomly breaks one of the shared nodes, by duplicating the shared part (i.e., going from the second diagram to the first diagram). In general this wouldn't impact the expressiveness in any way though, it would just let the search optimize the complexity by re-using common components.

MilesCranmer · 2022-09-25T18:27:47Z

Roadmap to fully exploiting graph expressions here: #113 (comment)

CharFox1 · 2022-09-26T13:50:56Z

This looks really useful! I would recommend having both the full and compressed complexity calculations available to the user but only using compressed to guide the search. A mutation related to this that might be helpful (which you touch on in #113) could be to look for similar parts of an expression (by some metric) and combine them. This could also be useful for subexpressions that are just single nodes (for example you find a constant near pi in multiple places in your expression).
Maybe unnecessary but this could probably be implemented such that you could have nested shared subexpressions (although they'd be increasingly unlikely to happen). Overall, fun idea with a lot of potential!

johanbluecreek · 2022-09-26T15:11:01Z

@johanbluecreek would you be up for a review of this?

Sure. I can take a shot at reviewing this. Never done one before but I will try and figure it out :-)

johanbluecreek

Very interesting new feature this. A common thing in physics, at least, is to work in strange coordinates/choice of variables, and this feature (when mutation operations are finally added to allow this in the search) opens up for the search to find such a coordinate change only once instead of every time it appears in an expression. So to me it looks to be a very useful feature.

I did not find any issues with the code, I just have some comments and questions in two places.

src/Equation.jl

test/unittest.jl

johanbluecreek

No further comments from me. I would recommend merging.

MilesCranmer added 3 commits September 24, 2022 20:03

Create copy for Node which preserves topology

f164f47

Clean up syntax in copy_node

f46b723

Add test for topology-preserving copy

279c7f5

MilesCranmer marked this pull request as ready for review September 25, 2022 00:59

Fix formatting

dd6b1d5

MilesCranmer commented Sep 25, 2022

View reviewed changes

src/Equation.jl Outdated Show resolved Hide resolved

MilesCranmer commented Sep 25, 2022

View reviewed changes

src/Equation.jl Outdated Show resolved Hide resolved

MilesCranmer added 3 commits September 25, 2022 13:29

Clean up comments

f3c482b

Remove return statements from normal copy

de9a9d3

Use get! for topology-preserving copy

aabaefb

MilesCranmer linked an issue Sep 25, 2022 that may be closed by this pull request

[Feature] Reusing parts of equation #113

Open

MilesCranmer removed a link to an issue Sep 25, 2022

[Feature] Reusing parts of equation #113

Open

MilesCranmer mentioned this pull request Sep 25, 2022

[Feature] Reusing parts of equation #113

Open

Merge branch 'master' into shared-nodes-2

841f806

johanbluecreek reviewed Sep 27, 2022

View reviewed changes

src/Equation.jl Outdated Show resolved Hide resolved

test/unittest.jl Show resolved Hide resolved

MilesCranmer added 4 commits September 27, 2022 10:34

Remove unneeded constructor

c7dcd79

Fix convert for shared children nodes

2042a90

Fix formatting

14e65db

Merge branch 'master' into shared-nodes-2

dfbe663

johanbluecreek reviewed Oct 4, 2022

View reviewed changes

MilesCranmer merged commit 7c8fc02 into master Oct 5, 2022

MilesCranmer deleted the shared-nodes-2 branch October 5, 2022 17:55

MilesCranmer mentioned this pull request Feb 22, 2023

Possible extensions SymbolicML/DynamicExpressions.jl#14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow graph-based expressions instead of trees #135

Allow graph-based expressions instead of trees #135

MilesCranmer commented Sep 25, 2022 •

edited

Loading

Jgmedina95 commented Sep 25, 2022

MilesCranmer commented Sep 25, 2022

MilesCranmer commented Sep 25, 2022

CharFox1 commented Sep 26, 2022

johanbluecreek commented Sep 26, 2022

johanbluecreek left a comment

johanbluecreek left a comment

Allow graph-based expressions instead of trees #135

Allow graph-based expressions instead of trees #135

Conversation

MilesCranmer commented Sep 25, 2022 • edited Loading

Why this is useful

What this PR doesn't do

Caveats

Jgmedina95 commented Sep 25, 2022

MilesCranmer commented Sep 25, 2022

MilesCranmer commented Sep 25, 2022

CharFox1 commented Sep 26, 2022

johanbluecreek commented Sep 26, 2022

johanbluecreek left a comment

Choose a reason for hiding this comment

johanbluecreek left a comment

Choose a reason for hiding this comment

MilesCranmer commented Sep 25, 2022 •

edited

Loading