-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix signal events throw/catch race conditions in process instance scope #1590
Conversation
@daisuke-yoshimoto @balsarori can you check this issue? it is targeting 5.x. I'm not sure what the impacts of these changes are. I need to do an in-detailed review of the code in order to gain a deeper understanding. In my head (without looking at the solution provided in this PR, and only looking at the issue description) the solution should be something like the following:
For Activiti 7 and cloud all events and signal catching will happen in an async way, so I'm expecting to see the previously described behavior. |
@salaboy, The scheduling of throw signal events will work only partially. Please, consider the following scenarios:
All these problems are solved in my PR by registering throw signals events as immutable facts in process instance scope to enable catch event subscribers query signal event by name as they enter execution. If the signal event is already registered, the catch signal behavior will leave the execution by taking outgoing transition in the same transaction. If there is no signal event yet registered, it will wait for the signal throw triggered by its subscription that may happen in the current or any future transactions. This makes throw/catch signal behavior always predictable and useful in many applications. Besides, it passes all existing tests and does not require changes to database schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do approve this changes. @erdemedeiros it will be great if you can review as well.
@igdianov totally agree with your comments, but for Cloud we will need to review our approach. These fix for 5.x is focused in the idea that the one that throws the event is the same that catches the event: https://user-images.githubusercontent.com/20428629/33520617-6c719170-d773-11e7-8ad8-f3d578f1f4cb.png Which I believe that your PR solved correctly and in a performant way (not adding unnecessary tx is the best way to go). But imagine in that in a distributed setup (the ones that we are targeting in Activiti Cloud) this mechanism will work only for colocated processes (and even process instance scoped). We can have a chat about the future if you want, but I've approved this PR for 5.x. I asked @erdemedeiros to check it as well, so let's wait for his review. |
@salaboy I agree. This PR addresses only Currently, globally scoped signal events are only useful to start another process.... |
public void registerThrowSignalEventByExecution(ActivityExecution execution, String eventName) { | ||
String processInstanceId = execution.getProcessInstanceId(); | ||
|
||
synchronized (firedSignalEventsSesssionRegistry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the object to be locked too wide?
It is better to lock only each process instance.
In the current implementation, all signal throw processing is synchronized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daisuke-yoshimoto Each transaction will lazily create a separate instance of SignalEventSessionManager. Throw signal event registration is synchronized in the working memory of a single process instance session just in case if there are any concurrent thread executions in the current transaction. It is already locked by process instance by design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@igdianov
If SignalEventSessionManager's instance is separate in one transaction, you do not have to synchronize in the first place? Only one thread works per transaction, is not it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daisuke-yoshimoto I put the synchronization block in registerThrowSignalEventByExecution
as a precaution if Activiti spawns async executions in separate threads for the same transaction which may register throw signal events concurrently in the same SignalEventSessionManager
instance. If it is always single thread per transaction, then the synchronization block is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is always single thread per transaction
Yeah. I recognize that it is such behavior.
activiti:async="true" && activiti:exclusive="true"
- Each parallel execution executes in serial and separate transactions.
- Although transactions are separate, control of in-memory Map is valid because the same threads are used.
activiti:async="true" && activiti:exclusive="false"
- Each of the parallel executions is executed in separate transactions and separate threads.
- In the case of a cluster environment , since the execution of each execution is not executed by the same Activiti Engine, in-memory map control should not work.
For details, please check this document.
https://www.activiti.org/userguide/#bpmnConcurrencyAndTransactions
The intention of this PR itself can be understood. By merging this PR, even if signalling events come before subscription registration, if signalling events and signal catch events are handled in the same thread(activiti:exclusive="true"), they will operate normally. However, the current implementation has performance problems. In the current implementation, it will synchronize all processing of the IntermediateThrowSignalEventActivityBehavior.class. It is better to narrow the range to be synchronized to within the process instance. |
@salaboy @daisuke-yoshimoto See my comment above... |
Hi @igdianov,
My understanding is that only catch signals that has already subscribed (the execution token reached this flow element) should catch the signal send by the throw signal.
In a similar way, in this diagram I'd not expect In addition, as you mentioned the signals are handled accros process levels, so in my opinion there should not be any difference between catching a signal sent by the same process instance or sent by another process instance / definition. In other words, if inside a given process instance a catch signal can see the signals thrown in the past, a catch signal from another process instance / definition should also have a visibility about that signal occurred in the past. Thoughts? |
@erdemedeiros The shot in the sky signal definition is very ambigious. There is no clear definition how long it needs to stay in the sky before Batman takes notice that Gotham city is in trouble. Does it make sense to shoot in the sky and immediately vanish before making it noticeable for interested parties that declare explicit intent to receive the signal when they are ready? Currently, when using signals in Activiti engine, the users need to care about transaction scopes and variable task execution timings in parallel paths or concurrent activities to handle signals predictably in the process definition. The other problem is that the signal handling in the process execution depend on the sequence of the user mouse clicks in modelling tool to create correct BPMN elements order in XML. For example, if I prioritize the order of catch signal XML elements in the diagram, the signals will work. There is no way to tell the user that valid BPMN model with signals will not work at run-time because the XML element order in BPMN is wrong. Please, consider the following process: With pulse signal behavior, if the IMO, for predictable signal execution, the implementation must take care of handling real world applications and hide complexity from users how to synchronize parallel or concurrent executions using signals with a simple completable future concept. There is a clear separation between processInstance and global scoped signals. The global signal event registration needs to extend processInstance behavior by scheduling signal registration broadcast to synchronize state with dependent process instances across cluster. Currently, global signals work predictably only for start process instance events. |
@igdianov @daisuke-yoshimoto @erdemedeiros guys great conversation here.
|
@igdianov I think it would help us to know whether you're hitting concrete cases where this is causing is a problem for you or if it's more about general usability concerns. I think everyone accepts that there are problem areas in the BPM specification and the general advice with signals seems to be to avoid them and to prefer messages if possible. See http://mainthing.ru/item/328/ . Bruce Silver in BPM Method and Style also says to prefer Message, Error or Escalation and has a good example at p.109. Personally I'm finding it hard to see concrete cases where you would need to use signals for within a process rather than just restructuring the sequence of tasks or using sub-processes with catches. If the problem is with the specification then perhaps we can consult with Bruce Silver or others that have a more direct connection to the specification before we attempt a unilateral correction for ourselves. But if you have a specific use-case that you can't address right now then that's a different matter - then we would be best to talk about that specific use-case. Of course we would simply accept an improvement if we were sure there were no negatives in the trade-off. Our concerns (@erdemedeiros and myself) are mostly that it could be inconsistent for signals to be handled within a process scope differently from in the global scope. Making the change just for local scope could encourage people to use signals in the same way in the global scope and they would be surprised to find race-conditions in the global scope. For example with processes like: In the implementation we're also a bit concerned about the use of the historyService as we're not sure what behaviour one would get if one turned the historyService off. We wouldn't want the behaviour to differ depending on whether the history is enabled or not. |
@igdianov I totally agree, there is no clear definition about how long the signal should stay in the sky. My point is, it doesn't matter how long it stays in the sky, as long it stays it should be visible by Batman not matter on which Gotham's neighbourhood he is (see https://user-images.githubusercontent.com/22754674/33764014-9967c49c-dc0a-11e7-895d-200d9d4b4a6a.png). |
I have no performance concerns any longer. However, in that scope, I thought that synchronization is unnecessary because it will not be accessed simultaneously from multiple threads. |
@ALL Thank you for feedback and thought provoking discussion to help understand the problem from different perspectives. I learned something new from all you. @daisuke-yoshimoto I am grateful for your code review and reference link to Activiti concurrency and transaction handling. @erdemedeiros Activiti already defines how long the signal should be visible with scope attribute. It would be logical for the signal throwing behavior to resolve its value for any subscribers that declared explicit intent to receive the signal while it is visible in the declared scope when they are ready to receive it. @ryandawsonuk I am working on a tool that automates BPMN process model generation. I discovered the signal race conditions behavior in our tests and raised the issue to help fix it. I understand and share your concerns about consistency of signal handling between different scopes and using activity history log that could be turned off in the configuration. @salaboy Perhaps it would be better to reduce the scope of this PR to fix signal race conditions only within single transaction scope of the process instance. It will resolve the immediate issue at hand. Any other improvements to signal handling behavior can be dealt with in separate PRs. |
@igdianov totally agree with your comments. Let's move forward this PR to make sure that "fix signal race conditions only within single transaction scope of the process instance". |
@ALL that sounds good to me. |
@salaboy Sounds good! How do you prefer me to approach this? I can create another PR that fixes race conditions in transaction scope either by scheduling throw signal in the next transaction or within the same transaction. I think scheduling throw signal behavior will provide consistent catching behavior across all signal catching behaviors, but carries a transaction cycle penalty. I sensed that this would preferred way in Activiti7. |
@igdianov yes that sounds like a good approach. A new PR will be required, please. |
@salaboy Yes, please, close this PR |
Currently, throw/catch signal event handling in Activiti engine causes race conditions behavior to occur when throw/catch signals are used to synchronize parallel activities in the same process instance scope.
For example, after a parallel gateway, due to various task execution timings or non-deterministic execution path by Activiti engine, a signal on one path is emitted before subscribers on parallel paths can subscribe to it, causing subscribers to block and wait indefinitely:
To fix this problem, the PR introduces throw signal event registration in the process instance session execution context. This will allow subscribers to receive events at any time after the signal has been published, so the execution can proceed if the signal has been registered in the current session or, has already been recorded in the historic activity instance audit log in the previous sessions.
Global events handling stays the same as before, i.e. non-deterministic, because it requires global persistent event registration/cleanup mechanism in the command execution session to properly synchronize between parallel process instances for each tenant.
See test cases in Activiti Engine module for details: SignalEventSessionManagerTest.java