-
Notifications
You must be signed in to change notification settings - Fork 0
trgemu process seems to crash when given start command after ruemu-df process #22
Comments
It looks as if the df application starts generating time syncs, that the network layer on the other side receives/buffers.
When the trigemu starts the network module receives all time syncs and tries to push them to the queue. I see three issues:
1) the queues are probably very short
2) there is no try/catch around the push method in the nw adapter module
3) the behaviour of what we expect to happen for messages sent before an application is ready to take them is to be discussed; is the present behaviour ok? Should all messages available/queued for reception be dropped at start?
… On 12 Mar 2021, at 21:00, bieryAtFnal ***@***.***> wrote:
I ran a test with a two-process system and each of the processes was on a separate host.
Here is the confgen command that I used:
python -m minidaqapp.rudf_trg -n10 -s 10 -t 5 --host-ip-df 192.168.157.13 --host-ip-trigemu 192.168.157.14 -d /scratch/biery/data/frames.bin -o "." minidaq_2proc_noFelix
(I'm using two computers in the mu2edaq pilot cluster at Fermilab)
I used the stdin command handler to start the processes:
daq_application --name mdapp-df -c stdin://./minidaq_2proc_noFelix-ruemu_df.json
daq_application --name mdapp-trigemu -c stdin://./minidaq_2proc_noFelix-trgemu.json
In the test, I specify the init and conf commands to the ruemu-df process, then do the same for the trgemu process.
Then, I give the start command to the ruemu-df process, wait ~10 seconds, and then give the start command to the trgemu process.
At that point, I see a crash of the trgemu process with a trace that looks like the following:
2021-Mar-12 13:50:20,919 ERROR [static void ers::ErrorHandler::terminate_handler() at /scratch/workdir/sourcecode/ers/src/ErrorHandler.cpp:135] Unhandled 'appfwk::QueueTimeoutExpired' exception has been thrown
Parameters = 'name=appfwk::QueueTimeoutExpired'
Qualifiers = 'unknown'
host = mu2edaq14
user = biery (2310)
process id = 234860
thread id = 234920
process wd = /home/biery/dunedaq/12MarB
stack trace of the crashing thread:
#0 /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/ers/v1_1_1/slf7.x86_64.e19.prof/ers/lib64/libers.so(+0x11e6f) [0x7f464faede6f]
#1 /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0x91db6) [0x7f464f7e9db6]
#2 /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0x91df1) [0x7f464f7e9df1]
#3 /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0x92024) [0x7f464f7ea024]
#4 /home/biery/dunedaq/12MarB/build/trigemu/plugins/libtrigemu_TriggerDecisionEmulator_duneDAQModule.so(dunedaq::appfwk::FollyQueue<dunedaq::dfmessages::TimeSync, folly::DSPSCQueue>::push(dunedaq::dfmessages::TimeSync&&, std::chrono::duration<long, std::ratio<1l, 1000l> > const&)+0x280) [0x7f463e3c6000]
#5 /home/biery/dunedaq/12MarB/build/dfmessages/plugins/libdfmessages_TimeSyncNQ_duneNetworkQueue.so(dunedaq::nwqueueadapters::NetworkToQueueImpl<dunedaq::dfmessages::TimeSync>::push()+0x82) [0x7f461b3b2d32]
#6 /home/biery/dunedaq/12MarB/build/nwqueueadapters/plugins/libnwqueueadapters_NetworkToQueue_duneDAQModule.so(dunedaq::nwqueueadapters::NetworkToQueue::do_work(std::atomic<bool>&)+0x2c) [0x7f463e622bcc]
#7 /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0xbb02f) [0x7f464f81302f]
#8 /lib64/libpthread.so.0(+0x7ea5) [0x7f46501acea5]
#9 /lib64/libc.so.6(clone+0x6d) [0x7f464f2708dd]
was caused by: 2021-Mar-12 13:50:20,918 ERROR [FollyQueueType>::push(...) at /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/appfwk/v2_2_0/slf7.x86_64.e19.prof/appfwk/include/appfwk/FollyQueue.hpp:54] time_sync_from_netq: Unable to push within timeout period (timeout period was 10 milliseconds)
Signal received: 6
Aborted (core dumped)
If I give the start command to the trgemu process first, then I can get beyond this issue.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#22>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD6SRU5TBGNCN5ZTAFYNXALTDJQDDANCNFSM4ZC3Y54A>.
|
Regarding point 1... |
Regarding point 3) one must make a distinction between data messages and status messages. The latter are not supposed to be queued by definition, as the only the last message is relevant. I remember we discussed this at some point. |
It looks like that the |
The idea of somehow only keeping a most recent status message sounds good to me, but I don't know if we have examples of how to do that (with either Queues or Queues+NWQueueAdapters)... |
I agree with all the analysis in the comments above. Not catching the push timeout exception in nwqueueadapters is definitely a bug, fixed in this PR, ready for testing: Moving the timestamp estimation to "conf until scrap" would be ideal, but readout resets the timestamp at each new run start, which confuses the timestamp estimation logic. If we really need the estimation to work for "conf until scrap", we can look into ways to make the timestamp estimation logic more sophisticated, or look into changing readout to not reset timestamps at run start. I don't think we have any way in the framework to handle "status" message streams for which only the most recent message is relevant, but at the application-logic level, we get somewhere near it: in |
My take is that |
With the updated code in nwqueueadapters, I see a burst of messages like the following when the ruemu-df process is given the "start" command:
Is this OK/expected? It seems like it might be nicer to have no warnings. The command sequence that I'm using in my tests is:
|
Ignoring the warning messages, things seem to work. I don't know if this is related to Alessandro's point, but looking at the timestamps in the data files from the two runs, they go back to "2020-07-19 13:05:02" at the start of the second run, but that may just be a feature of the FakeCardReader. |
The warnings are expected: they occur when the The |
Roland and I took a look at |
Is it correct that at the present moment we can start the trigger before the rudf app, since triggers actually start flowing only at resume? This is a fully valid solution, no? Like we enforced an order in the modules.
… On 15 Mar 2021, at 17:19, Philip Rodrigues ***@***.***> wrote:
Roland and I took a look at readout and it looks like it'll be nontrivial to change readout to have the desired behaviour. I've filed an issue in readout so we don't forget. For the current release, I think we will have to live with the TDE/readout behaviour as it is. We could make it not appear for end users by just making all of the queues much larger (but of course, for testing, small queues are great for helping surface issues like this one).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#22 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD6SRU4VN3IBIZCLQU5PGEDTDYXP5ANCNFSM4ZC3Y54A>.
|
Yes, starting the trgemu process before the rudf process works fine, and that is a completely reasonable solution. |
I ran a test with a two-process system and each of the processes was on a separate host.
Here is the confgen command that I used:
python -m minidaqapp.rudf_trg -n10 -s 10 -t 5 --host-ip-df 192.168.157.13 --host-ip-trigemu 192.168.157.14 -d /scratch/biery/data/frames.bin -o "." minidaq_2proc_noFelix
I used the stdin command handler to start the processes:
daq_application --name mdapp-df -c stdin://./minidaq_2proc_noFelix-ruemu_df.json
daq_application --name mdapp-trigemu -c stdin://./minidaq_2proc_noFelix-trgemu.json
In the test, I specify the init and conf commands to the ruemu-df process, then do the same for the trgemu process.
Then, I give the start command to the ruemu-df process, wait ~10 seconds, and then give the start command to the trgemu process.
At that point, I see a crash of the trgemu process with a trace that looks like the following:
If I give the start command to the trgemu process first, then I can get beyond this issue.
The text was updated successfully, but these errors were encountered: