-
Notifications
You must be signed in to change notification settings - Fork 923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Subscription to invoke callbacks sequentially and synchronously #1493
Enable Subscription to invoke callbacks sequentially and synchronously #1493
Conversation
…ssing incoming messages
… workers wait for a proper +1 sequence of messages before dispatching to callbacks
…ssage workers while subscription is active. Unified toggling sequential publishing to also automatically limit active message workers to 1. (Can't have sequential publishing if two threads can call callbacks out of order)
This pull request introduces 1 alert when merging 35bce2c into 4a2f13e - view on LGTM.com new alerts:
|
Hi @AvenDonn , thanks for your contribution and the thorough explanations. Since your contribution could be a breaking change we need to reserve some time to review and test. Please stay tuned... |
This pull request introduces 1 alert when merging efc6f34 into 76afe82 - view on LGTM.com new alerts:
|
Codecov Report
@@ Coverage Diff @@
## master #1493 +/- ##
=======================================
Coverage 51.93% 51.93%
=======================================
Files 307 307
Lines 58332 58332
=======================================
Hits 30292 30292
Misses 28040 28040 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @mregen, I have implemented your requested changes. However, I think there may be some missing coverage in testing this change for compliance with the protocol. However, I'm not sure it's very relevant to test arbitrary limits on max message workers. |
This pull request introduces 1 alert when merging 349a20b into 76afe82 - view on LGTM.com new alerts:
|
Hi @AvenDonn , currently our code coverage is just trying to execute as much code as possible, but is mostly not trying to execute special cases. I added a case to |
This pull request introduces 1 alert when merging d30979e into 76afe82 - view on LGTM.com new alerts:
|
Hey @mregen, sounds good and glad you like the code. Glad to hear this is a candidate for the next release! |
Hi @AvenDonn , while discussing the fix a question came up. Thanks, |
Good question @mregen. The ability to control the number of worker tasks spawned by Since I am providing the mechanism via a semaphore, it was trivial to provide both features via the same mechanism and the same relatively simple code. So why not? However, this makes no sense for sequential publishing. I've intended Sequential Publishing mode to be a promise to the client to receive callbacks in order. Thus parallelism makes no sense for it and it is set to a max of 1. So maybe there are clients that wish to only limit the number of workers, not use sequential publishing with all of its caveats. (For example, sequential publishing only make sense over high-reliability networks where loss of messages a very rare occurrence, but limiting the number of workers applies to anything which may receive a high load of messages but not wish to dedicate so many tasks to processing them) If however you or others feel this feature unnecessarily complicates the mechanism, it can be trivially removed by just removing the public property, or properly removed entirely. Not very difficult to do if it is deemed necessary. Let me know the verdict and I'll remove it if needed. |
Hi @AvenDonn, thanks for your explanation... we had a discussion here and I think if we can reduce the test coverage for now its a good thing, so we can focus on testing the sequential publishing case only. Even better were if you had been able to provide a test case. At a certain point we would like to be in a state where we only accept contributions with a corresponding test. Way to go... |
One more thing @AvenDonn : I was wondering what happens if a user sets |
@mregen I'm working on removing the max message workers feature now. I'm also going to see if I can write a test case. If I can mock a session to always provide messages out of order, I can verify sequential publishing fixes the order. As for your question, Tracking of sequence happens in When in sequential publishing mode, the condition to enter this branch is true only if the sequence number is under +1 of the last seen sequence number. If sequential publishing was enabled mid-way, all messages lower than the always-tracked This is done by this code:
In sequential publishing mode, both conditions can only be true if the sequence number is +1. |
Hello @mregen. I have removed the arbitrary message workers control feature and I have also attempted to include some kind of unit test. I am not satisfied with what I came up with for the unit test, as it is not deterministic and simply relies on running long enough for out-of-sequence to happen purely due to callbacks taking a long time. This test case does somewhat cover the multi-threading issue causing scenario, but doesn't very well illustrate the benefit of the strict +1 sequence number enforcement. I'm not sure what's the best course of action here. If you feel the test is too convoluted, please remove it. I've made it an easily revertible dedicated commit. |
This pull request introduces 2 alerts when merging db666fd into 2b54136 - view on LGTM.com new alerts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debugged through the new code and I think its allworking as expected with minimal risk for regression, I may update the tests slightly, as it seems to be of not much value when we cannot inject publishrequests with non consecutive sequence number pattern. Great work @AvenDonn !
Actually the new test is quite handy to validate high load and fast data callbacks... |
This pull request introduces 1 alert when merging 6cd2de8 into 2b54136 - view on LGTM.com new alerts:
|
Hi @AvenDonn , there is still some flakyness with the test case, as if on macOS we seem to hit the trigger condition sometimes which means there is a non sequential callback. e.g. here https://opcfoundation.visualstudio.com/opcua-netstandard/_build/results?buildId=2340&view=results any idea? |
Hey @mregen, one of your changes "broke" a metric I'm relying on, the number of outstanding message workers. It did that because it disabled publishing before measuring it. That's not related to why it's failing though. I see you've also removed the Thread.Sleep I've added in the callback code. This was meant to slow down the callback enough such that several callback invocations are running simultaneously. That was what was supposed to cause it to fail. I see each subscription tracks its items independently, so this isn't some odd collision of client handles somehow. But I wouldn't put it past some kind of possible aliasing issue coming from the server. Could you please test more with the original test I submitted? |
Thanks @AvenDonn , I will check your original tests again. They were definitely running for too long and it was hard to tell if they 'catched' an issue, so I basically repurposed them to become sort of a load tests. Then the sleeps are counterproductive. |
Thanks @mregen. The server is not really the problem and adding more subscriptions won't add to the chance of reproducing since the ordering is on a per subscription basis anyway. The trouble we've encountered came from the client, and was most likely stemming from concurrent execution of callbacks and possibly weird context switching and lock handling. |
Pertaining to the discussion started in Issue #957, this Pull Request seeks to add capabilities for Subscription to sequentially invoke its MonitoredItem and FastDataChange callbacks to maintain order properly. The key goal is to avoid out-of-order monitored items. To this effect, the Sequential Publishing capability is added to Subscription
Resolves #957
SequentialPublishing
Subscription.SequentialPublishing
(default false as before)A toggle which forces Subscription to raise callbacks only in a sequential order by the SequenceNumber of the incoming message.
It limits the number of tasks that can process publish responses to 1, and enforces a strict +1 Sequence Number requirement before releasing a message.
How to activate: On a
Subscription
instance, setSequentialPublishing
totrue
.Technical details
I have written some notes below on my technical considerations. I recommend examining the code before reading this section, as this section serves to answer some of the questions you may have while reviewing the code.
Backwards compatibility
The new properties have been added as DataMembers at the end to maintain backwards compatibility with the DataContract, where such may be used. The relevant copy constructor has also been updated.
Both features are disabled by default, so the existing behavior is the default. Users must explicitly set these properties on the Subscription object to enable these new features.
Leveraging existing "time machine" in SaveMessageInCache
There is already a mechanism for re-ordering messages that entered in the wrong order. This change makes use of that existing "time machine" which sorts messages into chronological order (by the sequence number) to only pull off messages with a proper +1 sequence number each time.
KeepAlive messages advancing sequence number
KeepAlive messages do not contain notifications, and do not enter the proper branch in the code to be pulled out. They will not interrupt sequential flow. Since the next message in the sequence with data will "re-use" that sequence number, it can be expected to maintain sequence.
Delayed messages
When sequential publishing is enabled, if a message is genuinely missing from the sequence, it will "hold up" the messages until it either arrives or is pulled out of the incoming messages list by being too old. In either case, it is considered "processed" for the sequence purposes at that point and the rest of the messages may proceed.
The automatic republish request mechanism is also leveraged here for that purpose as well.
SequenceNumber roll-over after max value
As specified in the OPC UA spec, at 1ms publishing, this would take almost 50 days of uptime. Naturally at any "reasonable" publishing interval this time is even longer. The matter itself is also not addressed throughout the whole existing code that handled it before this change. (Though some consideration can be seen in some places, such as placing messages at the end of the incoming message linked list if no better place was found)
Locking on m_cache to get the semaphore
I've considered making a separate locking object for the semaphore management, but decided it is not needed. The message worker would need to obtain m_cache lock later anyway, and momentarily obtaining it is not harmful for the overall flow.
Only in cases of serious contention between new PublishResponses coming in and workers attempting to work would this cause any problems.
Callbacks that "take a long time"
This change will naturally suffer from callbacks that take a long time to invoke. In our use case, each callback message is passed into an ActionBlock for handling, so our callbacks are fast. But Other clients may invoke much more code on these callbacks.
Sequential publishing means this capacity is possibly bottlenecked. Sequential publishing should be used for its intended purpose and with callbacks that do not hold up their calling thread for a long time. To ensure proper sequential callbacks, they cannot just be started in order, they have to be processed fully in order.
Changing this property while the Subscription is active
There is limited support to on-the-fly changes to this property, and this is not the intended use case. It is meant to be set once and ideally never changed while messages are being processed.
If users need certain items properly sequenced and some items they can accept out-of-order or ignore outdated messages, they can do this by defining two separate Subscription objects, same as they would if they needed different Publishing Intervals.
Why semaphore?
At first I had considered using ActionBlock from TPL DataFlow and limiting its concurrency to the number of desired message workers, but I had opted not to add a dependency on DataFlow only for this purpose. (Referencing all of DataFlow just for ActionBlock's automatic concurrency limiting is wasteful.
I had also considered using a Limited Concurrency Task Scheduler implementation and queueing the message workers on that instead of the normal ThreadPool via Task.Run, but that is similarly more complex than it needs to be.
Why async Task? What was wrong with Void?
While being on a new Task from the ThreadPool anyway, the worker would still occupy a task from the ThreadPool if it was using the synchronous Wait() method, which is a blocking call. By using WaitAsync, the ThreadPool task goes "back" to the ThreadPool until the semaphore is released.
Disclaimer - I am not an OPC UA expert
Please do not assume I have "done all my homework" with regard to these changes. I have attempted to learn as much as I can in the time I was working on this change, but I fully expect to have missed a few spots and specifics of the OPC UA protocol.
I have done some testing in our application to verify the out-of-order problem is solved, and that no significant delay is introduced by the additional work done to facilitate this.