Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix signal events throw/catch race conditions in process instance scope #1590

Closed
wants to merge 10 commits into from

Conversation

igdianov
Copy link
Contributor

@igdianov igdianov commented Dec 2, 2017

Currently, throw/catch signal event handling in Activiti engine causes race conditions behavior to occur when throw/catch signals are used to synchronize parallel activities in the same process instance scope.

For example, after a parallel gateway, due to various task execution timings or non-deterministic execution path by Activiti engine, a signal on one path is emitted before subscribers on parallel paths can subscribe to it, causing subscribers to block and wait indefinitely:

image

To fix this problem, the PR introduces throw signal event registration in the process instance session execution context. This will allow subscribers to receive events at any time after the signal has been published, so the execution can proceed if the signal has been registered in the current session or, has already been recorded in the historic activity instance audit log in the previous sessions.

Global events handling stays the same as before, i.e. non-deterministic, because it requires global persistent event registration/cleanup mechanism in the command execution session to properly synchronize between parallel process instances for each tenant.

See test cases in Activiti Engine module for details: SignalEventSessionManagerTest.java

image

image

image

image

image

image

@CLAassistant
Copy link

CLAassistant commented Dec 2, 2017

CLA assistant check
All committers have signed the CLA.

@igdianov igdianov self-assigned this Dec 2, 2017
@igdianov igdianov requested a review from salaboy December 2, 2017 22:29
@igdianov igdianov added the bug label Dec 2, 2017
@igdianov igdianov requested a review from proth1 December 3, 2017 00:36
@salaboy
Copy link
Contributor

salaboy commented Dec 6, 2017

@daisuke-yoshimoto @balsarori can you check this issue? it is targeting 5.x. I'm not sure what the impacts of these changes are. I need to do an in-detailed review of the code in order to gain a deeper understanding.

In my head (without looking at the solution provided in this PR, and only looking at the issue description) the solution should be something like the following:

  1. signal catching events and signalling events should be deal with after closing the transaction, allowing all the paths to be finished before executing the reaction
  2. once the transaction is closed a new TX is created to deal with signalling and catching events and any impact that might cause

For Activiti 7 and cloud all events and signal catching will happen in an async way, so I'm expecting to see the previously described behavior.
If this PR provides an alternate solution to deal with the problem in version 5.x I'm ok, but I just want to state the future direction. @igdianov do you agree with my statements? @erdemedeiros it will be interesting to see your point of view on this matter.

@igdianov
Copy link
Contributor Author

igdianov commented Dec 7, 2017

@salaboy, The scheduling of throw signal events will work only partially. Please, consider the following scenarios:

  1. Currently, signal throwing behavior is executed like a short pulse broadcast, while subscriptions are created in lazy fashion in each transaction scope. This is the cause of unpredictable behavior. Deferring signal throwing execution to the next transaction will only help resolve subscribers in the current transaction scope. Any catch event signals in the future transaction scopes will be left waiting for signal indefinitely.

image

  1. Signal scheduling will introduce additional transaction cycle for each throw signal. This is not optimal for performance.

  2. Another problem of scheduling throw signals is related to interrupting boundary signal events on tasks or sub-processes. By deferring the signal throw execution, any activity will start without a chance to check for already thrown interrupting signal event in the current transaction. The execution will be completed and then cancelled (or missed) in the next transaction.

image

All these problems are solved in my PR by registering throw signals events as immutable facts in process instance scope to enable catch event subscribers query signal event by name as they enter execution. If the signal event is already registered, the catch signal behavior will leave the execution by taking outgoing transition in the same transaction. If there is no signal event yet registered, it will wait for the signal throw triggered by its subscription that may happen in the current or any future transactions.

This makes throw/catch signal behavior always predictable and useful in many applications. Besides, it passes all existing tests and does not require changes to database schema.

Copy link
Contributor

@salaboy salaboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do approve this changes. @erdemedeiros it will be great if you can review as well.

@salaboy
Copy link
Contributor

salaboy commented Dec 7, 2017

@igdianov totally agree with your comments, but for Cloud we will need to review our approach. These fix for 5.x is focused in the idea that the one that throws the event is the same that catches the event:

https://user-images.githubusercontent.com/20428629/33520617-6c719170-d773-11e7-8ad8-f3d578f1f4cb.png

Which I believe that your PR solved correctly and in a performant way (not adding unnecessary tx is the best way to go).

But imagine in that in a distributed setup (the ones that we are targeting in Activiti Cloud) this mechanism will work only for colocated processes (and even process instance scoped). We can have a chat about the future if you want, but I've approved this PR for 5.x. I asked @erdemedeiros to check it as well, so let's wait for his review.

@igdianov
Copy link
Contributor Author

igdianov commented Dec 7, 2017

@salaboy I agree. This PR addresses only processScope signal events. Global scope events are different beast. BTW, I did a quick spike to check scheduling of global signal events with Activiti execution job scheduler. It caused 14 existing tests to fail.... :(

Currently, globally scoped signal events are only useful to start another process....

public void registerThrowSignalEventByExecution(ActivityExecution execution, String eventName) {
String processInstanceId = execution.getProcessInstanceId();

synchronized (firedSignalEventsSesssionRegistry) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the object to be locked too wide?
It is better to lock only each process instance.

In the current implementation, all signal throw processing is synchronized.

Copy link
Contributor Author

@igdianov igdianov Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daisuke-yoshimoto Each transaction will lazily create a separate instance of SignalEventSessionManager. Throw signal event registration is synchronized in the working memory of a single process instance session just in case if there are any concurrent thread executions in the current transaction. It is already locked by process instance by design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@igdianov
If SignalEventSessionManager's instance is separate in one transaction, you do not have to synchronize in the first place? Only one thread works per transaction, is not it?

Copy link
Contributor Author

@igdianov igdianov Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daisuke-yoshimoto I put the synchronization block in registerThrowSignalEventByExecution as a precaution if Activiti spawns async executions in separate threads for the same transaction which may register throw signal events concurrently in the same SignalEventSessionManager instance. If it is always single thread per transaction, then the synchronization block is redundant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@igdianov

If it is always single thread per transaction

Yeah. I recognize that it is such behavior.

activiti:async="true" && activiti:exclusive="true"

  • Each parallel execution executes in serial and separate transactions.
  • Although transactions are separate, control of in-memory Map is valid because the same threads are used.

activiti:async="true" && activiti:exclusive="false"

  • Each of the parallel executions is executed in separate transactions and separate threads.
  • In the case of a cluster environment , since the execution of each execution is not executed by the same Activiti Engine, in-memory map control should not work.

For details, please check this document.
https://www.activiti.org/userguide/#bpmnConcurrencyAndTransactions

@daisuke-yoshimoto
Copy link
Contributor

@salaboy @igdianov

The intention of this PR itself can be understood. By merging this PR, even if signalling events come before subscription registration, if signalling events and signal catch events are handled in the same thread(activiti:exclusive="true"), they will operate normally.

However, the current implementation has performance problems. In the current implementation, it will synchronize all processing of the IntermediateThrowSignalEventActivityBehavior.class. It is better to narrow the range to be synchronized to within the process instance.

@igdianov
Copy link
Contributor Author

igdianov commented Dec 7, 2017

@salaboy @daisuke-yoshimoto See my comment above...

@erdemedeiros
Copy link
Member

Hi @igdianov,
I do understand the problem related to this diagram, but in my opinion that's a problem in the design of the process. From the BPMN specification (page 253):

A BPMN Signal is similar to a signal flare that shot into the sky for anyone who might be interested to notice and then react.

My understanding is that only catch signals that has already subscribed (the execution token reached this flow element) should catch the signal send by the throw signal.
By putting the throw and the catch signal in parallel paths means that the order in which they are executed is not important:

  • if the catch signal is executed first and then it will catch the signal send by the throw signal.
  • if the throw signal is executed before the catch signal, the last one need to wait for the next occurence of this signal (which can comme from another process instance / definition). This seems a normal behaviour for me.

In a similar way, in this diagram I'd not expect Catch Signal 2 to actually catch the "Throw signal 2" as the catch signal was not yet listening when the signal was triggered.

In addition, as you mentioned the signals are handled accros process levels, so in my opinion there should not be any difference between catching a signal sent by the same process instance or sent by another process instance / definition. In other words, if inside a given process instance a catch signal can see the signals thrown in the past, a catch signal from another process instance / definition should also have a visibility about that signal occurred in the past.

Thoughts?

@igdianov
Copy link
Contributor Author

igdianov commented Dec 7, 2017

@erdemedeiros The shot in the sky signal definition is very ambigious. There is no clear definition how long it needs to stay in the sky before Batman takes notice that Gotham city is in trouble. Does it make sense to shoot in the sky and immediately vanish before making it noticeable for interested parties that declare explicit intent to receive the signal when they are ready?

Currently, when using signals in Activiti engine, the users need to care about transaction scopes and variable task execution timings in parallel paths or concurrent activities to handle signals predictably in the process definition.

The other problem is that the signal handling in the process execution depend on the sequence of the user mouse clicks in modelling tool to create correct BPMN elements order in XML. For example, if I prioritize the order of catch signal XML elements in the diagram, the signals will work. There is no way to tell the user that valid BPMN model with signals will not work at run-time because the XML element order in BPMN is wrong.

Please, consider the following process:

image

With pulse signal behavior, if the Variable Task takes longer to complete than other Fast Tasks on the parallel paths, the Finish Task after event gateway will never start. By making throw signal behavior register event in the process context, any subscriber will receive the event regardless of the transaction scope, task timings, or variable XML element order.

IMO, for predictable signal execution, the implementation must take care of handling real world applications and hide complexity from users how to synchronize parallel or concurrent executions using signals with a simple completable future concept.

There is a clear separation between processInstance and global scoped signals. The global signal event registration needs to extend processInstance behavior by scheduling signal registration broadcast to synchronize state with dependent process instances across cluster. Currently, global signals work predictably only for start process instance events.

@salaboy
Copy link
Contributor

salaboy commented Dec 8, 2017

@igdianov @daisuke-yoshimoto @erdemedeiros guys great conversation here.
I've seen very similar problems in my previous life and it all comes to:

  1. The BPMN Spec has a lot of grey areas and we need to make a decision
    a) as far as I remember in my previous project this was solved in the way proposed by @igdianov
    b) @erdemedeiros how was it solved in your previous project?
    c) having a non deterministic behaviour "enabled" by the tool (by a design flaw) is not good, so I consider this PR as a fix to that problem.
  2. How this can bite us in the future?
    a) @daisuke-yoshimoto do you still have performance concerns about this approach? In which situations this will impact the engine performance? I do see the synchronisation happening but in order for this to have a huge impact we should be using signal and catch quite intensively. In such scenario the bottleneck will appear around those synchronised methods. Is there something else that I'm missing on the performance side? Is this going to impact other use cases as well? I just want to make sure that @daisuke-yoshimoto (and all the Activiti 5.x community) and their use cases will be not impacted by a change like this.
    b) For Activiti Cloud, as I mention before, I'm more interested in having different modes for BPMN "messages" and signals, so we can handle process to process interactions even if they are living in different nodes in the cluster. This will push us to implement new Activiti Behaviours.

@ryandawsonuk
Copy link
Contributor

ryandawsonuk commented Dec 8, 2017

@igdianov I think it would help us to know whether you're hitting concrete cases where this is causing is a problem for you or if it's more about general usability concerns.

I think everyone accepts that there are problem areas in the BPM specification and the general advice with signals seems to be to avoid them and to prefer messages if possible. See http://mainthing.ru/item/328/ . Bruce Silver in BPM Method and Style also says to prefer Message, Error or Escalation and has a good example at p.109.

Personally I'm finding it hard to see concrete cases where you would need to use signals for within a process rather than just restructuring the sequence of tasks or using sub-processes with catches.

If the problem is with the specification then perhaps we can consult with Bruce Silver or others that have a more direct connection to the specification before we attempt a unilateral correction for ourselves. But if you have a specific use-case that you can't address right now then that's a different matter - then we would be best to talk about that specific use-case.

Of course we would simply accept an improvement if we were sure there were no negatives in the trade-off. Our concerns (@erdemedeiros and myself) are mostly that it could be inconsistent for signals to be handled within a process scope differently from in the global scope. Making the change just for local scope could encourage people to use signals in the same way in the global scope and they would be surprised to find race-conditions in the global scope. For example with processes like:

screen shot 2017-12-08 at 11 22 49

In the implementation we're also a bit concerned about the use of the historyService as we're not sure what behaviour one would get if one turned the historyService off. We wouldn't want the behaviour to differ depending on whether the history is enabled or not.

@erdemedeiros
Copy link
Member

The shot in the sky signal definition is very ambigious. There is no clear definition how long it needs to stay in the sky before Batman takes notice that Gotham city is in trouble

@igdianov I totally agree, there is no clear definition about how long the signal should stay in the sky. My point is, it doesn't matter how long it stays in the sky, as long it stays it should be visible by Batman not matter on which Gotham's neighbourhood he is (see https://user-images.githubusercontent.com/22754674/33764014-9967c49c-dc0a-11e7-895d-200d9d4b4a6a.png).

@daisuke-yoshimoto
Copy link
Contributor

@salaboy

do you still have performance concerns about this approach?
In which situations this will impact the engine performance?

I have no performance concerns any longer.
According to @igdianov's explanation, the scope of synchronization is a process instance.
So, there is no impact for performance.

However, in that scope, I thought that synchronization is unnecessary because it will not be accessed simultaneously from multiple threads.

@igdianov
Copy link
Contributor Author

igdianov commented Dec 8, 2017

@ALL Thank you for feedback and thought provoking discussion to help understand the problem from different perspectives. I learned something new from all you.

@daisuke-yoshimoto I am grateful for your code review and reference link to Activiti concurrency and transaction handling.

@erdemedeiros Activiti already defines how long the signal should be visible with scope attribute. It would be logical for the signal throwing behavior to resolve its value for any subscribers that declared explicit intent to receive the signal while it is visible in the declared scope when they are ready to receive it.

@ryandawsonuk I am working on a tool that automates BPMN process model generation. I discovered the signal race conditions behavior in our tests and raised the issue to help fix it. I understand and share your concerns about consistency of signal handling between different scopes and using activity history log that could be turned off in the configuration.

@salaboy Perhaps it would be better to reduce the scope of this PR to fix signal race conditions only within single transaction scope of the process instance. It will resolve the immediate issue at hand. Any other improvements to signal handling behavior can be dealt with in separate PRs.

@salaboy
Copy link
Contributor

salaboy commented Dec 10, 2017

@igdianov totally agree with your comments. Let's move forward this PR to make sure that "fix signal race conditions only within single transaction scope of the process instance".
All ok with that (@erdemedeiros @daisuke-yoshimoto @ryandawsonuk )?

@erdemedeiros
Copy link
Member

@ALL that sounds good to me.
That was definitely an interesting discussion!

@igdianov
Copy link
Contributor Author

igdianov commented Dec 11, 2017

@salaboy Sounds good! How do you prefer me to approach this? I can create another PR that fixes race conditions in transaction scope either by scheduling throw signal in the next transaction or within the same transaction.

I think scheduling throw signal behavior will provide consistent catching behavior across all signal catching behaviors, but carries a transaction cycle penalty. I sensed that this would preferred way in Activiti7.

@salaboy
Copy link
Contributor

salaboy commented Dec 11, 2017

@igdianov yes that sounds like a good approach. A new PR will be required, please.
Should I close this one?

@igdianov
Copy link
Contributor Author

@salaboy Yes, please, close this PR

@salaboy salaboy closed this Dec 12, 2017
@mteodori mteodori deleted the igdianov-fix-signal-events branch January 21, 2020 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants