Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't interleave multiple edits/computes within an UndoContext #1971

Closed
danieldresser opened this issue Jan 30, 2017 · 13 comments · Fixed by #5362 or #5368
Closed

Can't interleave multiple edits/computes within an UndoContext #1971

danieldresser opened this issue Jan 30, 2017 · 13 comments · Fixed by #5362 or #5368
Labels
core Issues with core Gaffer functionality

Comments

@danieldresser
Copy link
Collaborator

This came up while someone was trying to use the ImageSampler node from a Python script by repeatedly changing the sample position and querying it. They found that the sample value appeared to be getting cached, and they would just get the same value over and over.

The simplest repro I've found is to create a Sphere, and run this:

script['Sphere']['radius'].setValue( 1 )
print script['Sphere']['out'].object( "/sphere" )["P"].data[0][2]

script['Sphere']['radius'].setValue( 2 )
print script['Sphere']['out'].object( "/sphere" )["P"].data[0][2]

The expected output is "-1 -2", but it will produce either "-1 -1", or "-2 -2" if run all at once in the Script Editor. If you run it one line at a time it works.

It looks to me like this is due to the hash cache. When you change the input plug, it should invalidate the hash cache, but according to this:

// We can't clear the cache now, because it is most likely

... invalidating the hash cache is potentially deferred into another thread. This appears to mean that immediately querying the output could pull the old cached hash incorrectly.

One workaround is instead of changing the input plug from a script, drive the input plug from a context variable, and change the context variable in a script. This seems to work fine, but it's a bit complicated to explain why the obvious way doesn't work.

@danieldresser danieldresser added the core Issues with core Gaffer functionality label Jan 30, 2017
@johnhaddon
Copy link
Member

It looks to me like this is due to the hash cache.

That sounds about right.

Invalidating the hash cache is potentially deferred into another thread. This appears to mean that immediately querying the output could pull the old cached hash incorrectly.

I don't think this is the problem. The hash cache is (currently) stored per-thread, so we're not deferring for another thread to clear the one used on this thread, we're just telling all threads "yo, clear yours before you use it next". Because you cannot edit graphs while concurrently running them, we know that no other thread is in the process of a compute while we're saying this.

If you run it one line at a time it works.

I think this is the clue to what's really happening. The ScriptEditor wraps the execution of the script within an UndoContext, so that a single undo event is generated for the whole script. The UndoContext in turn consolidates all the dirty propagation into a final step at the end (so observers of the graph don't recompute until all changes have been made). And it's only during the final dirty propagation that we invalidate the hash cache, hence why we're seeing stale cache values. As an aside, the UndoContext has nothing to do with Context, and should be renamed to UndoScope or some such thing to avoid confusion.

I guess this has gone unnoticed until now because most code either mutates the graph (most likely within an UndoContext), or runs the graph (most likely not within an UndoContext). It seems that this is the first instance of someone trying to do both at the same time.

One workaround is instead of changing the input plug from a script, drive the input plug from a context variable, and change the context variable in a script. This seems to work fine, but it's a bit complicated to explain why the obvious way doesn't work.

Although it's a rare case as I've explained above, I think we should just fix this to work no matter - as you say, it's confusing otherwise. I'm hoping this can be done by moving the plug->dirty() call from the emission stage to the insertion stage of the dependency tracking. I'm aiming to do a day's work tomorrow, so can look at it then if you don't get a chance today.

@johnhaddon
Copy link
Member

johnhaddon commented Jan 31, 2017

This came up while someone was trying to use the ImageSampler node from a Python script by repeatedly changing the sample position and querying it.

Out of interest, and totally tangentially, what was this for? It feels like this sort of loop might be quite slow in Python, so I'm wondering if the ImageSampler should have an additional array in/out pair to do multiple queries at once, or if we can provide some other API feature to improve things? Although actually, maybe they would have been better to use a Sampler directly instead of using a node?

@johnhaddon johnhaddon changed the title Changing Input Plug Followed By Immediately Querying Output Plug Gives Inconsistent Results Can't interleave multiple edits/computes within an UndoContext Jan 31, 2017
@johnhaddon
Copy link
Member

I've had a chance to take a look at this today, and although the basic problem is as I described above, the easy solution I was hoping for doesn't work. Here's what's going on :

  • We use Plug::dirtiedSignal() to notify observers of the graph (the UI basically) when individual plugs have been dirtied, so that the observers know they need to trigger a recompute to get up-to-date information.
  • When we change the value of an input plug, add or remove plugs, or connect plugs, we traverse downstream dirtying all the affected plugs.
  • In the case of making multiple edits, we use the DirtyPropagationScope and UndoScope classes to batch up graph edits so that we don't emit plugDirtiedSignal() until all edits have been completed. We don't want observers to make redundant intermediate computes or be exposed to a graph that is invalid because only a subset of the intended edits have been completed. Just prior to emitting the signal for a given plug, we also call Plug::dirty() for it, which is what the ValuePlug uses to clear the hash cache.
  • Deferring emission like this also allows us to make another important optimisation - when performing downstream traversals over dirty plugs, we can prune the traversal when revisiting a previously visited plug. This is essential for performance in graphs with many paths (see 39317b3).
  • This pruning is what prevents my trivial proposed fix from working - the second time around we prune the traversal immediately and never get the chance to insert a clearing of the hash cache.

As far as alternative approaches go, I've only got the following at the moment :

  1. Document that the DirtyPropagationScope/UndoScope classes are for scoping edits, and that all computation should be performed outside them (or at the very least, edits and computation should not be interleaved). This would actually feel perfectly natural to me if it weren't for the usage in the ScriptEditor which led to this bug report in the first place.
  2. Drop the use of virtual void Plug::dirty() for flushing the hash cache. The original intention behind this mechanism was twofold; it allowed any plug subclass to implement caching as it saw fit, and it allowed cache entries to be removed on a plug-by-plug basis as we propagated dirtiness, leaving the cache entries for unaffected plugs intact. In practice though, we've only implemented caching in the ValuePlug, and it clears all cache entries when any plug is dirtied. We could instead do some dirty great flushing of everything on every edit, independent of dirty propagation.

johnhaddon added a commit to johnhaddon/gaffer that referenced this issue Jan 31, 2017
johnhaddon added a commit to johnhaddon/gaffer that referenced this issue Jan 31, 2017
@johnhaddon
Copy link
Member

johnhaddon commented Jan 31, 2017

For reference, here's the branch where I've been working on this. It contains a binding for DirtyPropagationScope (we didn't have one before), a test case for this issue, and actually does have a hack to fix that test, but which I didn't feel happy enough with to present as an option 3).

@danieldresser
Copy link
Collaborator Author

If we go with 1), is there an easy way to escape the UndoContext in the Gaffer script editor if you want to do this? I think I would be OK with this as long as it's documented and there was something quick you could put in the script editor to get into a mode where you could do this if you want to.

( As for current use case, we don't have anything productiony. An artist had a bit of down time and was teaching themselves more about Gaffer by trying some silly things. )

@johnhaddon
Copy link
Member

johnhaddon commented Jan 31, 2017

If we go with 1), is there an easy way to escape the UndoContext in the Gaffer script editor if you want to do this?

There's nothing at the moment. It would be pretty easy to have some sort of "undoable-or-not" toggle button on the ScriptEditor, but I don't think I could explain that it's really an "i-want-to-interleave-edits-and-computes-in-one-execution" button with a straight face. I don't really want to resort to hacks of any sort for this, but if we do, I'd rather it was a behind-the-scenes one that didn't push the responsibility onto the user.

@danieldresser
Copy link
Collaborator Author

danieldresser commented Jan 31, 2017

I wasn't thinking of a toggle, more like a script thing:

with ModifyAndComputeContext():
    doStuff()

or

with NotUndoableContext():
   doStuff()

It's not the most clean, but it's also not the most common. As long as there is something there for technical users on special occasions, I don't see a problem.

@johnhaddon
Copy link
Member

Honestly, I'd be embarrassed if we had to push this responsibility on to the user. The branch I referenced above does have a workaround (albeit a bit rubbish), so unless I have a better idea soon, or we refactor the internals to make the fix more natural (and I suspect at the expense of flexibility in the future), I'd be more inclined to go with that.

@ivanimanishi
Copy link
Member

This came up again for us, when we are trying to update dynamic boxes that themselves contain dynamic boxes, given that which box to load is defined by a computation, which may change during the whole process.

That means that option 1 would not work for us, unfortunately.

@johnhaddon, how to you feel about option 2 now? It's been a long time that this issue was discussed.

@johnhaddon
Copy link
Member

Option 2) isn't great. The recent refactor of the hash cache has actually put us in a situation where we could do partial invalidation of the cache (just for dirty plugs rather than clearing everything). I have a test branch with that on, and it's a decent win in terms of performance for interactive work. So I'm reluctant to lose that possibility.

@danieldresser-ie
Copy link
Contributor

I'm still liking the idea of something explicit. Like if there was a ScriptNode.forceDirtyPropogation()?

You said you didn't want to make this the user's responsibility John - but I'm not sure it's really that bad? It doesn't seem to come up very often, and it doesn't seem that bad to explain "If you need to alternate between changing things and computing them within an UndoScope, then you need to do a forceDirtyPropogation before you compute things". That seems better than hurting performance in any of the common cases.

Otherwise, I guess we need to start looking at whatever your option 3) was that you didn't want to talk about?

@johnhaddon
Copy link
Member

I still feel like something explicit is a failure on my part. I'd need to look into it some more, but I think the only real overhead in an automatic forcing would be the TLS lookup in DirtyPlugs::local(). Since I've already consolidated most of our core TLS into ThreadState, I'm wondering if putting DirtyPlugs there too might be an option. Since Ivan has a workaround, I'm inclined to mull this a bit more rather than jump off the stuff I'm currently working on.

@johnhaddon
Copy link
Member

This came up again in conversation today. I think perhaps one way of fixing things would be to have a way to request that all pending calls to Plug::dirty() are made now (but not flush DirtyPlugs so they will also be made later). Then ValuePlug could make that request just prior to triggering a compute.

johnhaddon added a commit to johnhaddon/gaffer that referenced this issue Jun 17, 2020
johnhaddon added a commit to johnhaddon/gaffer that referenced this issue Jun 7, 2023
Would be nice to see if we can also fix GafferHQ#1971 though.
ericmehl pushed a commit to ericmehl/gaffer that referenced this issue Jun 7, 2023
Would be nice to see if we can also fix GafferHQ#1971 though.
ericmehl pushed a commit to ericmehl/gaffer that referenced this issue Jun 19, 2023
Would be nice to see if we can also fix GafferHQ#1971 though.
ericmehl pushed a commit to ericmehl/gaffer that referenced this issue Jun 22, 2023
Would be nice to see if we can also fix GafferHQ#1971 though.
johnhaddon added a commit to johnhaddon/gaffer that referenced this issue Jun 28, 2023
…ionScopes

This means calling `ValuePlug::dirty()` before the scope closes, so that the hash cache is invalidated before the compute is performed.

This was first requested way back in 2017 in issue GafferHQ#1971, where the use case was edit/compute loops in the PythonEditor. It has become a high priority more recently though, as it turned out we were actually interleaving edits and computes in the TransformTools, and noticed incorrect results while working on the LightTool in GafferHQ#5218. The problem was as follows, and is demonstrated in `testMultipleSelectionWithEditScope` :

1. We push a dirty propagation scope using UndoScope.
2. We create a transform edit for `/cube1`, which dirties some plugs, incrementing their dirty count.
3. We create a transform edit for `/cube`, but as part of doing that we need to compute the _current_ transform. This "bakes" that transform into the (hash) cache, keyed by the current dirty count.
4. We propagate dirtiness for this new transform edit, but don't increment the dirty count again, because we've already visited those plugs in step 2.
5. After closing the scope, we query the transform for `/cube` and end up with the cached value from step 3. This is incorrect, because it isn't accounting for the edits made subsequently.

The problem is demonstrated even more simply by `ComputeNodeTest.testInterleavedEditsAndComputes`.

There is an overhead to calling `flushDirtyPropagationScope()` for each `getValue()` call, and it is measurable in `ValuePlugTest.testCacheOverhead()`. But taking into account the performance improvements made in GafferHQ#5362, we still come out with an %18 reduction in runtime even with the overhead.

Fixes GafferHQ#1971
johnhaddon added a commit to johnhaddon/gaffer that referenced this issue Jun 28, 2023
…ionScopes

This means calling `ValuePlug::dirty()` before the scope closes, so that the hash cache is invalidated before the compute is performed.

This was first requested way back in 2017 in issue GafferHQ#1971, where the use case was edit/compute loops in the PythonEditor. It has become a high priority more recently though, as it turned out we were actually interleaving edits and computes in the TransformTools, and noticed incorrect results while working on the LightTool in GafferHQ#5218. The problem was as follows, and is demonstrated in `testMultipleSelectionWithEditScope` :

1. We push a dirty propagation scope using UndoScope.
2. We create a transform edit for `/cube1`, which dirties some plugs, incrementing their dirty count.
3. We create a transform edit for `/cube`, but as part of doing that we need to compute the _current_ transform. This "bakes" that transform into the (hash) cache, keyed by the current dirty count.
4. We propagate dirtiness for this new transform edit, but don't increment the dirty count again, because we've already visited those plugs in step 2.
5. After closing the scope, we query the transform for `/cube` and end up with the cached value from step 3. This is incorrect, because it isn't accounting for the edits made subsequently.

The problem is demonstrated even more simply by `ComputeNodeTest.testInterleavedEditsAndComputes`.

There is an overhead to calling `flushDirtyPropagationScope()` for each `getValue()` call, and it is measurable in `ValuePlugTest.testCacheOverhead()`. But taking into account the performance improvements made in GafferHQ#5362, we still come out with an %18 reduction in runtime even with the overhead.

Fixes GafferHQ#1971
johnhaddon added a commit to johnhaddon/gaffer that referenced this issue Aug 15, 2023
…ionScopes

This means calling `ValuePlug::dirty()` before the scope closes, so that the hash cache is invalidated before the compute is performed.

This was first requested way back in 2017 in issue GafferHQ#1971, where the use case was edit/compute loops in the PythonEditor. It has become a high priority more recently though, as it turned out we were actually interleaving edits and computes in the TransformTools, and noticed incorrect results while working on the LightTool in GafferHQ#5218. The problem was as follows, and is demonstrated in `testMultipleSelectionWithEditScope` :

1. We push a dirty propagation scope using UndoScope.
2. We create a transform edit for `/cube1`, which dirties some plugs, incrementing their dirty count.
3. We create a transform edit for `/cube`, but as part of doing that we need to compute the _current_ transform. This "bakes" that transform into the (hash) cache, keyed by the current dirty count.
4. We propagate dirtiness for this new transform edit, but don't increment the dirty count again, because we've already visited those plugs in step 2.
5. After closing the scope, we query the transform for `/cube` and end up with the cached value from step 3. This is incorrect, because it isn't accounting for the edits made subsequently.

The problem is demonstrated even more simply by `ComputeNodeTest.testInterleavedEditsAndComputes`.

There is an overhead to calling `flushDirtyPropagationScope()` for each `getValue()` call, and it is measurable in `ValuePlugTest.testCacheOverhead()`. But taking into account the performance improvements made in GafferHQ#5362, we still come out with an %18 reduction in runtime even with the overhead.

Fixes GafferHQ#1971
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues with core Gaffer functionality
Projects
None yet
4 participants