Skip to content
This repository has been archived by the owner on May 24, 2022. It is now read-only.

trgemu process seems to crash when given start command after ruemu-df process #22

Open
bieryAtFnal opened this issue Mar 12, 2021 · 13 comments

Comments

@bieryAtFnal
Copy link
Contributor

I ran a test with a two-process system and each of the processes was on a separate host.

Here is the confgen command that I used:

  • python -m minidaqapp.rudf_trg -n10 -s 10 -t 5 --host-ip-df 192.168.157.13 --host-ip-trigemu 192.168.157.14 -d /scratch/biery/data/frames.bin -o "." minidaq_2proc_noFelix
  • (I'm using two computers in the mu2edaq pilot cluster at Fermilab)

I used the stdin command handler to start the processes:

  • daq_application --name mdapp-df -c stdin://./minidaq_2proc_noFelix-ruemu_df.json
  • daq_application --name mdapp-trigemu -c stdin://./minidaq_2proc_noFelix-trgemu.json

In the test, I specify the init and conf commands to the ruemu-df process, then do the same for the trgemu process.
Then, I give the start command to the ruemu-df process, wait ~10 seconds, and then give the start command to the trgemu process.

At that point, I see a crash of the trgemu process with a trace that looks like the following:

2021-Mar-12 13:50:20,919 ERROR [static void ers::ErrorHandler::terminate_handler() at /scratch/workdir/sourcecode/ers/src/ErrorHandler.cpp:135] Unhandled 'appfwk::QueueTimeoutExpired' exception has been thrown
	Parameters = 'name=appfwk::QueueTimeoutExpired' 
	Qualifiers = 'unknown' 
	host = mu2edaq14
	user = biery (2310)
	process id = 234860
	thread id = 234920
	process wd = /home/biery/dunedaq/12MarB
	stack trace of the crashing thread:
	  #0  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/ers/v1_1_1/slf7.x86_64.e19.prof/ers/lib64/libers.so(+0x11e6f) [0x7f464faede6f]
	  #1  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0x91db6) [0x7f464f7e9db6]
	  #2  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0x91df1) [0x7f464f7e9df1]
	  #3  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0x92024) [0x7f464f7ea024]
	  #4  /home/biery/dunedaq/12MarB/build/trigemu/plugins/libtrigemu_TriggerDecisionEmulator_duneDAQModule.so(dunedaq::appfwk::FollyQueue<dunedaq::dfmessages::TimeSync, folly::DSPSCQueue>::push(dunedaq::dfmessages::TimeSync&&, std::chrono::duration<long, std::ratio<1l, 1000l> > const&)+0x280) [0x7f463e3c6000]
	  #5  /home/biery/dunedaq/12MarB/build/dfmessages/plugins/libdfmessages_TimeSyncNQ_duneNetworkQueue.so(dunedaq::nwqueueadapters::NetworkToQueueImpl<dunedaq::dfmessages::TimeSync>::push()+0x82) [0x7f461b3b2d32]
	  #6  /home/biery/dunedaq/12MarB/build/nwqueueadapters/plugins/libnwqueueadapters_NetworkToQueue_duneDAQModule.so(dunedaq::nwqueueadapters::NetworkToQueue::do_work(std::atomic<bool>&)+0x2c) [0x7f463e622bcc]
	  #7  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0xbb02f) [0x7f464f81302f]
	  #8  /lib64/libpthread.so.0(+0x7ea5) [0x7f46501acea5]
	  #9  /lib64/libc.so.6(clone+0x6d) [0x7f464f2708dd]
	was caused by: 2021-Mar-12 13:50:20,918 ERROR [FollyQueueType>::push(...) at /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/appfwk/v2_2_0/slf7.x86_64.e19.prof/appfwk/include/appfwk/FollyQueue.hpp:54] time_sync_from_netq: Unable to push within timeout period (timeout period was 10 milliseconds)
Signal received: 6
Aborted (core dumped)

If I give the start command to the trgemu process first, then I can get beyond this issue.

@glehmannmiotto
Copy link
Collaborator

glehmannmiotto commented Mar 12, 2021 via email

@bieryAtFnal
Copy link
Contributor Author

Regarding point 1...
Yes, if I increase the capacity of the time_sync_from_netq Queue in the *-trgemu.json file from 100 to 10000, that seems to help...

@alessandrothea
Copy link

Regarding point 3) one must make a distinction between data messages and status messages. The latter are not supposed to be queued by definition, as the only the last message is relevant. I remember we discussed this at some point.
When does trigemu start draining the timesync queues? At configure or start?

@alessandrothea
Copy link

It looks like that the TimestampEstimator object only exists between start and stop, which explains why timesyncs are piling up. My suggestion is to create it at conf and reset it at scrap. I would also be good to publish the current estimated timestamp as with the monitoring information.

@bieryAtFnal
Copy link
Contributor Author

The idea of somehow only keeping a most recent status message sounds good to me, but I don't know if we have examples of how to do that (with either Queues or Queues+NWQueueAdapters)...

@philiprodrigues
Copy link
Collaborator

I agree with all the analysis in the comments above. Not catching the push timeout exception in nwqueueadapters is definitely a bug, fixed in this PR, ready for testing:

DUNE-DAQ/nwqueueadapters#8

Moving the timestamp estimation to "conf until scrap" would be ideal, but readout resets the timestamp at each new run start, which confuses the timestamp estimation logic. If we really need the estimation to work for "conf until scrap", we can look into ways to make the timestamp estimation logic more sophisticated, or look into changing readout to not reset timestamps at run start.

I don't think we have any way in the framework to handle "status" message streams for which only the most recent message is relevant, but at the application-logic level, we get somewhere near it: in TimestampEstimator, all the TimeSyncs on the input queue are popped, and the largest-timestamp one is used for the estimation.

@alessandrothea
Copy link

Moving the timestamp estimation to "conf until scrap" would be ideal, but readout resets the timestamp at each new run start, which confuses the timestamp estimation logic. If we really need the estimation to work for "conf until scrap", we can look into ways to make the timestamp estimation logic more sophisticated, or look into changing readout to not reset timestamps at run start.

My take is that readout should not reset the timing estimation at start...

@bieryAtFnal
Copy link
Contributor Author

With the updated code in nwqueueadapters, I see a burst of messages like the following when the ruemu-df process is given the "start" command:

2021-Mar-15 08:52:35,555 WARNING [dunedaq::nwqueueadapters::NetworkToQueue::do_work(...) at /home/biery/dunedaq/14Mar/sourcecode/nwqueueadapters/plugins/NetworkToQueue.cpp:70] Push timed out: Message of type dunedaq::dfmessages::TimeSync to queue time_sync_from_netq
	was caused by: 2021-Mar-15 08:52:35,555 ERROR [FollyQueueType>::push(...) at /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/appfwk/v2_2_0/slf7.x86_64.e19.prof/appfwk/include/appfwk/FollyQueue.hpp:54] time_sync_from_netq: Unable to push within timeout period (timeout period was 10 milliseconds)

Is this OK/expected? It seems like it might be nicer to have no warnings.

The command sequence that I'm using in my tests is:

  • ruemu_df init
  • trgemu init
  • ruemu_df conf
  • trgemu conf
  • ruemu-df start

@bieryAtFnal
Copy link
Contributor Author

Ignoring the warning messages, things seem to work.
I can start the trgemu process, resume triggers, and see them written to disk.
And, it seems like I can stop the first run and start a second one.

I don't know if this is related to Alessandro's point, but looking at the timestamps in the data files from the two runs, they go back to "2020-07-19 13:05:02" at the start of the second run, but that may just be a feature of the FakeCardReader.

@philiprodrigues
Copy link
Collaborator

The warnings are expected: they occur when the NetworkToQueue module can't push onto its output queue because the queue is full. So the code in NetworkToQueue that produces the warning is correct, but I agree it'll be nicer to not have the situation arise in the first place. That will require the changes to TimestampEstimator and readout that Alessandro suggested.

The FakeCardReader reads the first timestamp from its input data file and adds 25 ticks to each subsequent frame. The input data file was taken on 2020-07-19, which is why you see that date in the output file

@philiprodrigues
Copy link
Collaborator

Roland and I took a look at readout and it looks like it'll be nontrivial to change readout to have the desired behaviour. I've filed an issue in readout so we don't forget. For the current release, I think we will have to live with the TDE/readout behaviour as it is. We could make it not appear for end users by just making all of the queues much larger (but of course, for testing, small queues are great for helping surface issues like this one).

@glehmannmiotto
Copy link
Collaborator

glehmannmiotto commented Mar 15, 2021 via email

@bieryAtFnal
Copy link
Contributor Author

bieryAtFnal commented Mar 15, 2021

Yes, starting the trgemu process before the rudf process works fine, and that is a completely reasonable solution.
Of course, we somehow need to document and/or enforce that ordering of commands so that others know about it, if that becomes the "solution".

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants