trgemu process seems to crash when given start command after ruemu-df process #22

bieryAtFnal · 2021-03-12T19:59:44Z

I ran a test with a two-process system and each of the processes was on a separate host.

Here is the confgen command that I used:

python -m minidaqapp.rudf_trg -n10 -s 10 -t 5 --host-ip-df 192.168.157.13 --host-ip-trigemu 192.168.157.14 -d /scratch/biery/data/frames.bin -o "." minidaq_2proc_noFelix
(I'm using two computers in the mu2edaq pilot cluster at Fermilab)

I used the stdin command handler to start the processes:

daq_application --name mdapp-df -c stdin://./minidaq_2proc_noFelix-ruemu_df.json
daq_application --name mdapp-trigemu -c stdin://./minidaq_2proc_noFelix-trgemu.json

In the test, I specify the init and conf commands to the ruemu-df process, then do the same for the trgemu process.
Then, I give the start command to the ruemu-df process, wait ~10 seconds, and then give the start command to the trgemu process.

At that point, I see a crash of the trgemu process with a trace that looks like the following:

2021-Mar-12 13:50:20,919 ERROR [static void ers::ErrorHandler::terminate_handler() at /scratch/workdir/sourcecode/ers/src/ErrorHandler.cpp:135] Unhandled 'appfwk::QueueTimeoutExpired' exception has been thrown
	Parameters = 'name=appfwk::QueueTimeoutExpired' 
	Qualifiers = 'unknown' 
	host = mu2edaq14
	user = biery (2310)
	process id = 234860
	thread id = 234920
	process wd = /home/biery/dunedaq/12MarB
	stack trace of the crashing thread:
	  #0  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/ers/v1_1_1/slf7.x86_64.e19.prof/ers/lib64/libers.so(+0x11e6f) [0x7f464faede6f]
	  #1  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0x91db6) [0x7f464f7e9db6]
	  #2  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0x91df1) [0x7f464f7e9df1]
	  #3  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0x92024) [0x7f464f7ea024]
	  #4  /home/biery/dunedaq/12MarB/build/trigemu/plugins/libtrigemu_TriggerDecisionEmulator_duneDAQModule.so(dunedaq::appfwk::FollyQueue<dunedaq::dfmessages::TimeSync, folly::DSPSCQueue>::push(dunedaq::dfmessages::TimeSync&&, std::chrono::duration<long, std::ratio<1l, 1000l> > const&)+0x280) [0x7f463e3c6000]
	  #5  /home/biery/dunedaq/12MarB/build/dfmessages/plugins/libdfmessages_TimeSyncNQ_duneNetworkQueue.so(dunedaq::nwqueueadapters::NetworkToQueueImpl<dunedaq::dfmessages::TimeSync>::push()+0x82) [0x7f461b3b2d32]
	  #6  /home/biery/dunedaq/12MarB/build/nwqueueadapters/plugins/libnwqueueadapters_NetworkToQueue_duneDAQModule.so(dunedaq::nwqueueadapters::NetworkToQueue::do_work(std::atomic<bool>&)+0x2c) [0x7f463e622bcc]
	  #7  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0xbb02f) [0x7f464f81302f]
	  #8  /lib64/libpthread.so.0(+0x7ea5) [0x7f46501acea5]
	  #9  /lib64/libc.so.6(clone+0x6d) [0x7f464f2708dd]
	was caused by: 2021-Mar-12 13:50:20,918 ERROR [FollyQueueType>::push(...) at /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/appfwk/v2_2_0/slf7.x86_64.e19.prof/appfwk/include/appfwk/FollyQueue.hpp:54] time_sync_from_netq: Unable to push within timeout period (timeout period was 10 milliseconds)
Signal received: 6
Aborted (core dumped)

If I give the start command to the trgemu process first, then I can get beyond this issue.

The text was updated successfully, but these errors were encountered:

glehmannmiotto · 2021-03-12T20:33:43Z

It looks as if the df application starts generating time syncs, that the network layer on the other side receives/buffers. When the trigemu starts the network module receives all time syncs and tries to push them to the queue. I see three issues: 1) the queues are probably very short 2) there is no try/catch around the push method in the nw adapter module 3) the behaviour of what we expect to happen for messages sent before an application is ready to take them is to be discussed; is the present behaviour ok? Should all messages available/queued for reception be dropped at start?

…

On 12 Mar 2021, at 21:00, bieryAtFnal ***@***.***> wrote: I ran a test with a two-process system and each of the processes was on a separate host. Here is the confgen command that I used: python -m minidaqapp.rudf_trg -n10 -s 10 -t 5 --host-ip-df 192.168.157.13 --host-ip-trigemu 192.168.157.14 -d /scratch/biery/data/frames.bin -o "." minidaq_2proc_noFelix (I'm using two computers in the mu2edaq pilot cluster at Fermilab) I used the stdin command handler to start the processes: daq_application --name mdapp-df -c stdin://./minidaq_2proc_noFelix-ruemu_df.json daq_application --name mdapp-trigemu -c stdin://./minidaq_2proc_noFelix-trgemu.json In the test, I specify the init and conf commands to the ruemu-df process, then do the same for the trgemu process. Then, I give the start command to the ruemu-df process, wait ~10 seconds, and then give the start command to the trgemu process. At that point, I see a crash of the trgemu process with a trace that looks like the following: 2021-Mar-12 13:50:20,919 ERROR [static void ers::ErrorHandler::terminate_handler() at /scratch/workdir/sourcecode/ers/src/ErrorHandler.cpp:135] Unhandled 'appfwk::QueueTimeoutExpired' exception has been thrown Parameters = 'name=appfwk::QueueTimeoutExpired' Qualifiers = 'unknown' host = mu2edaq14 user = biery (2310) process id = 234860 thread id = 234920 process wd = /home/biery/dunedaq/12MarB stack trace of the crashing thread: #0 /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/ers/v1_1_1/slf7.x86_64.e19.prof/ers/lib64/libers.so(+0x11e6f) [0x7f464faede6f] #1 /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0x91db6) [0x7f464f7e9db6] #2 /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0x91df1) [0x7f464f7e9df1] #3 /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0x92024) [0x7f464f7ea024] #4 /home/biery/dunedaq/12MarB/build/trigemu/plugins/libtrigemu_TriggerDecisionEmulator_duneDAQModule.so(dunedaq::appfwk::FollyQueue<dunedaq::dfmessages::TimeSync, folly::DSPSCQueue>::push(dunedaq::dfmessages::TimeSync&&, std::chrono::duration<long, std::ratio<1l, 1000l> > const&)+0x280) [0x7f463e3c6000] #5 /home/biery/dunedaq/12MarB/build/dfmessages/plugins/libdfmessages_TimeSyncNQ_duneNetworkQueue.so(dunedaq::nwqueueadapters::NetworkToQueueImpl<dunedaq::dfmessages::TimeSync>::push()+0x82) [0x7f461b3b2d32] #6 /home/biery/dunedaq/12MarB/build/nwqueueadapters/plugins/libnwqueueadapters_NetworkToQueue_duneDAQModule.so(dunedaq::nwqueueadapters::NetworkToQueue::do_work(std::atomic<bool>&)+0x2c) [0x7f463e622bcc] #7 /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0xbb02f) [0x7f464f81302f] #8 /lib64/libpthread.so.0(+0x7ea5) [0x7f46501acea5] #9 /lib64/libc.so.6(clone+0x6d) [0x7f464f2708dd] was caused by: 2021-Mar-12 13:50:20,918 ERROR [FollyQueueType>::push(...) at /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/appfwk/v2_2_0/slf7.x86_64.e19.prof/appfwk/include/appfwk/FollyQueue.hpp:54] time_sync_from_netq: Unable to push within timeout period (timeout period was 10 milliseconds) Signal received: 6 Aborted (core dumped) If I give the start command to the trgemu process first, then I can get beyond this issue. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#22>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD6SRU5TBGNCN5ZTAFYNXALTDJQDDANCNFSM4ZC3Y54A>.

bieryAtFnal · 2021-03-12T21:00:23Z

Regarding point 1...
Yes, if I increase the capacity of the time_sync_from_netq Queue in the *-trgemu.json file from 100 to 10000, that seems to help...

alessandrothea · 2021-03-13T03:35:27Z

Regarding point 3) one must make a distinction between data messages and status messages. The latter are not supposed to be queued by definition, as the only the last message is relevant. I remember we discussed this at some point.
When does trigemu start draining the timesync queues? At configure or start?

alessandrothea · 2021-03-13T03:43:52Z

It looks like that the TimestampEstimator object only exists between start and stop, which explains why timesyncs are piling up. My suggestion is to create it at conf and reset it at scrap. I would also be good to publish the current estimated timestamp as with the monitoring information.

bieryAtFnal · 2021-03-15T00:53:12Z

The idea of somehow only keeping a most recent status message sounds good to me, but I don't know if we have examples of how to do that (with either Queues or Queues+NWQueueAdapters)...

philiprodrigues · 2021-03-15T11:30:51Z

I agree with all the analysis in the comments above. Not catching the push timeout exception in nwqueueadapters is definitely a bug, fixed in this PR, ready for testing:

DUNE-DAQ/nwqueueadapters#8

Moving the timestamp estimation to "conf until scrap" would be ideal, but readout resets the timestamp at each new run start, which confuses the timestamp estimation logic. If we really need the estimation to work for "conf until scrap", we can look into ways to make the timestamp estimation logic more sophisticated, or look into changing readout to not reset timestamps at run start.

I don't think we have any way in the framework to handle "status" message streams for which only the most recent message is relevant, but at the application-logic level, we get somewhere near it: in TimestampEstimator, all the TimeSyncs on the input queue are popped, and the largest-timestamp one is used for the estimation.

alessandrothea · 2021-03-15T13:56:39Z

Moving the timestamp estimation to "conf until scrap" would be ideal, but readout resets the timestamp at each new run start, which confuses the timestamp estimation logic. If we really need the estimation to work for "conf until scrap", we can look into ways to make the timestamp estimation logic more sophisticated, or look into changing readout to not reset timestamps at run start.

My take is that readout should not reset the timing estimation at start...

bieryAtFnal · 2021-03-15T13:57:28Z

With the updated code in nwqueueadapters, I see a burst of messages like the following when the ruemu-df process is given the "start" command:

2021-Mar-15 08:52:35,555 WARNING [dunedaq::nwqueueadapters::NetworkToQueue::do_work(...) at /home/biery/dunedaq/14Mar/sourcecode/nwqueueadapters/plugins/NetworkToQueue.cpp:70] Push timed out: Message of type dunedaq::dfmessages::TimeSync to queue time_sync_from_netq
	was caused by: 2021-Mar-15 08:52:35,555 ERROR [FollyQueueType>::push(...) at /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/appfwk/v2_2_0/slf7.x86_64.e19.prof/appfwk/include/appfwk/FollyQueue.hpp:54] time_sync_from_netq: Unable to push within timeout period (timeout period was 10 milliseconds)

Is this OK/expected? It seems like it might be nicer to have no warnings.

The command sequence that I'm using in my tests is:

ruemu_df init
trgemu init
ruemu_df conf
trgemu conf
ruemu-df start

bieryAtFnal · 2021-03-15T14:10:24Z

Ignoring the warning messages, things seem to work.
I can start the trgemu process, resume triggers, and see them written to disk.
And, it seems like I can stop the first run and start a second one.

I don't know if this is related to Alessandro's point, but looking at the timestamps in the data files from the two runs, they go back to "2020-07-19 13:05:02" at the start of the second run, but that may just be a feature of the FakeCardReader.

philiprodrigues · 2021-03-15T14:32:20Z

The warnings are expected: they occur when the NetworkToQueue module can't push onto its output queue because the queue is full. So the code in NetworkToQueue that produces the warning is correct, but I agree it'll be nicer to not have the situation arise in the first place. That will require the changes to TimestampEstimator and readout that Alessandro suggested.

The FakeCardReader reads the first timestamp from its input data file and adds 25 ticks to each subsequent frame. The input data file was taken on 2020-07-19, which is why you see that date in the output file

philiprodrigues · 2021-03-15T16:18:48Z

Roland and I took a look at readout and it looks like it'll be nontrivial to change readout to have the desired behaviour. I've filed an issue in readout so we don't forget. For the current release, I think we will have to live with the TDE/readout behaviour as it is. We could make it not appear for end users by just making all of the queues much larger (but of course, for testing, small queues are great for helping surface issues like this one).

glehmannmiotto · 2021-03-15T17:41:20Z

Is it correct that at the present moment we can start the trigger before the rudf app, since triggers actually start flowing only at resume? This is a fully valid solution, no? Like we enforced an order in the modules.

…

On 15 Mar 2021, at 17:19, Philip Rodrigues ***@***.***> wrote: Roland and I took a look at readout and it looks like it'll be nontrivial to change readout to have the desired behaviour. I've filed an issue in readout so we don't forget. For the current release, I think we will have to live with the TDE/readout behaviour as it is. We could make it not appear for end users by just making all of the queues much larger (but of course, for testing, small queues are great for helping surface issues like this one). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#22 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD6SRU4VN3IBIZCLQU5PGEDTDYXP5ANCNFSM4ZC3Y54A>.

bieryAtFnal · 2021-03-15T17:50:56Z

Yes, starting the trgemu process before the rudf process works fine, and that is a completely reasonable solution.
Of course, we somehow need to document and/or enforce that ordering of commands so that others know about it, if that becomes the "solution".

philiprodrigues mentioned this issue Mar 15, 2021

Data timestamps should continue increasing after "stop" command DUNE-DAQ/readout#56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trgemu process seems to crash when given start command after ruemu-df process #22

trgemu process seems to crash when given start command after ruemu-df process #22

bieryAtFnal commented Mar 12, 2021

glehmannmiotto commented Mar 12, 2021 via email

bieryAtFnal commented Mar 12, 2021

alessandrothea commented Mar 13, 2021

alessandrothea commented Mar 13, 2021

bieryAtFnal commented Mar 15, 2021

philiprodrigues commented Mar 15, 2021

alessandrothea commented Mar 15, 2021

bieryAtFnal commented Mar 15, 2021

bieryAtFnal commented Mar 15, 2021

philiprodrigues commented Mar 15, 2021

philiprodrigues commented Mar 15, 2021

glehmannmiotto commented Mar 15, 2021 via email

bieryAtFnal commented Mar 15, 2021 •

edited

Loading

trgemu process seems to crash when given start command after ruemu-df process #22

trgemu process seems to crash when given start command after ruemu-df process #22

Comments

bieryAtFnal commented Mar 12, 2021

glehmannmiotto commented Mar 12, 2021 via email

bieryAtFnal commented Mar 12, 2021

alessandrothea commented Mar 13, 2021

alessandrothea commented Mar 13, 2021

bieryAtFnal commented Mar 15, 2021

philiprodrigues commented Mar 15, 2021

alessandrothea commented Mar 15, 2021

bieryAtFnal commented Mar 15, 2021

bieryAtFnal commented Mar 15, 2021

philiprodrigues commented Mar 15, 2021

philiprodrigues commented Mar 15, 2021

glehmannmiotto commented Mar 15, 2021 via email

bieryAtFnal commented Mar 15, 2021 • edited Loading

bieryAtFnal commented Mar 15, 2021 •

edited

Loading