Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock with OmmConsumer #218

Closed
charanyarajagopalan opened this issue Nov 8, 2022 · 7 comments
Closed

Deadlock with OmmConsumer #218

charanyarajagopalan opened this issue Nov 8, 2022 · 7 comments

Comments

@charanyarajagopalan
Copy link

charanyarajagopalan commented Nov 8, 2022

To give some background, we are creating an EMA consumer, and then opening item streams (100 items batched in each ReqMsg) with the consumer. We simply stop seeing new messages a few minutes after the consumer is established, without any exceptions or errors. This happens while the loop of opening item streams is ongoing. Debugging shows that there is potentially a deadlock situation occurring, as illustrated by the stack traces below. Is there some kind of rate/limit as to how batched requests should be created (we have a single Machine ID and hence a single EMA consumer). Version used is 3.6.7.1

image
deadlock-trace.txt

@charanyarajagopalan
Copy link
Author

Update: I increased the batch size to 5000, which means our item list gets through in 3 batched requests, I am still running into the same issue, though not every single time.

@dmykhailishen
Copy link
Contributor

@charanyarajagopalan, Thank you for reporting this issue. We will investigate it and fix it in future releases.

@charanyarajagopalan
Copy link
Author

Update: Switched to using the USER_DISPATCH operational model instead of API_DISPATCH. This helps control preventing message dispatch from being triggered during the creation of the item streams, and prevents deadlock from happening.

@L-Karchevska
Copy link
Contributor

@charanyarajagopalan Thank you for additional details!
The stack traces indeed indicate a potential deadlock. The USER_DISPATCH mode eliminates the issue because it makes both requests sending and callbacks be executed on the same thread. However, as for now, we were not able to reproduce the deadlock in our environment. Could you send a code example that reproduces the issue?

@charanyarajagopalan
Copy link
Author

@L-Karchevska thanks, https://github.com/charanyarajagopalan/issue-218-repro
Will need credentials updated into application.yml to run.

@wasin-waeosri
Copy link

Hello

I can replicate the same kind of deadlock with EMA Java 3.6.7 L2 by modifying the Consumer ex100_MP_Streaming example to subscribe 10K invalids RICs as follows:

for(int index =0; index <=10000; index++){
   System.out.println("Subscribing TEST_" + index + ".RIC");
   consumer.registerClient(reqMsg.serviceName("ELEKTRON_DD").name("TEST_"+ index + ".RIC"), appClient);
}
System.out.println("Done");

Result: The example shows one "The record could not be found" and stop. The "Done" message is not printed. The stack trace shows a similar pattern as the client above:

Found one Java-level deadlock:
=============================
"main":
  waiting for ownable synchronizer 0x00000006044dda40, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
  which is held by "pool-2-thread-1"
"pool-2-thread-1":
  waiting for ownable synchronizer 0x0000000604b0d398, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
  which is held by "main"

Java stack information for the threads listed above:
===================================================
"main":
	at jdk.internal.misc.Unsafe.park(java.base@11/Native Method)
	- parking to wait for  <0x00000006044dda40> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(java.base@11/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11/AbstractQueuedSynchronizer.java:885)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11/AbstractQueuedSynchronizer.java:917)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11/AbstractQueuedSynchronizer.java:1240)
	at java.util.concurrent.locks.ReentrantLock.lock(java.base@11/ReentrantLock.java:267)
	at com.refinitiv.eta.valueadd.reactor.ReactorChannel.submit(ReactorChannel.java:767)
	at com.refinitiv.ema.access.SingleItem.rsslSubmit(ItemCallbackClient.java:3009)
	at com.refinitiv.ema.access.SingleItem.open(ItemCallbackClient.java:2869)
	at com.refinitiv.ema.access.ItemCallbackClient.registerClient(ItemCallbackClient.java:2192)
	at com.refinitiv.ema.access.OmmBaseImpl.registerClient(OmmBaseImpl.java:532)
	at com.refinitiv.ema.access.OmmConsumerImpl.registerClient(OmmConsumerImpl.java:255)
	at com.refinitiv.ema.examples.training.consumer.series100.ex100_MP_Streaming.Consumer.main(Consumer.java:65)
"pool-2-thread-1":
	at jdk.internal.misc.Unsafe.park(java.base@11/Native Method)
	- parking to wait for  <0x0000000604b0d398> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(java.base@11/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11/AbstractQueuedSynchronizer.java:885)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11/AbstractQueuedSynchronizer.java:917)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11/AbstractQueuedSynchronizer.java:1240)
	at java.util.concurrent.locks.ReentrantLock.lock(java.base@11/ReentrantLock.java:267)
	at com.refinitiv.ema.access.ItemCallbackClient.removeFromMap(ItemCallbackClient.java:2477)
	at com.refinitiv.ema.access.SingleItem.remove(ItemCallbackClient.java:2932)
	at com.refinitiv.ema.access.ItemCallbackClient.processStatusMsg(ItemCallbackClient.java:1893)
	at com.refinitiv.ema.access.ItemCallbackClient.defaultMsgCallback(ItemCallbackClient.java:1630)
	at com.refinitiv.eta.valueadd.reactor.Reactor.sendDefaultMsgCallback(Reactor.java:2787)
	at com.refinitiv.eta.valueadd.reactor.Reactor.sendAndHandleDefaultMsgCallback(Reactor.java:2896)
	at com.refinitiv.eta.valueadd.reactor.WlItemHandler.callbackUser(WlItemHandler.java:3029)
	at com.refinitiv.eta.valueadd.reactor.WlItemHandler.readMsg(WlItemHandler.java:2002)
	at com.refinitiv.eta.valueadd.reactor.Watchlist.readMsg(Watchlist.java:302)
	at com.refinitiv.eta.valueadd.reactor.Reactor.processRwfMessage(Reactor.java:4662)
	at com.refinitiv.eta.valueadd.reactor.Reactor.performChannelRead(Reactor.java:5001)
	at com.refinitiv.eta.valueadd.reactor.Reactor.dispatchAll(Reactor.java:7674)
	at com.refinitiv.ema.access.OmmBaseImpl.rsslReactorDispatchLoop(OmmBaseImpl.java:1759)
	at com.refinitiv.ema.access.OmmBaseImpl.run(OmmBaseImpl.java:1889)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11/Thread.java:834)

Found 1 deadlock.

I did a quick test with EMA Java 3.6.0 L1 and the same code. It works fine, the example shows 10k status messages for "The record could not be found" and then "Done" message.

@vlevendel
Copy link
Contributor

@charanyarajagopalan This is addressed with tag Real-Time-SDK-2.0.8.L1. Please let us know if there are further concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants