latency-test, latency-histogram: warn when rtapi_app lacks RT privileges#4107
latency-test, latency-histogram: warn when rtapi_app lacks RT privileges#4107grandixximo wants to merge 3 commits into
Conversation
Without 'sudo make setuid' (or 'sudo make setcap') rtapi_app runs unprivileged: no SCHED_FIFO, no locked memory, so latency readings are wildly inflated and easy to mistake for a code regression. Warn, for a non-root user, when rtapi_app is neither setuid root nor carries the cap_sys_nice capability. Closes LinuxCNC#4044
77c1e3e to
8274d2d
Compare
|
These tests are moot when you run on a non-RT kernel (like I do in dev). I'm not sure the noise is really necessary in that case. |
Only warn under PREEMPT_RT or RTAI; on a non-RT kernel the privileges do not matter, so the check would be noise.
|
silenced on non-rt |
|
Great. Every little bit helps. Reduces user frustration and more importantly less developer time wasted on spurious issues. |
|
Hmm, in #4044 in the images in the background you clearly see: So the new warning will not help that much. If desired, you can add it in the place where |
|
Connected to this: A general way for GUI's to show "You don't have realtime" warnings that you can not overlook would help the most. When milling, I start linuxcnc with the link. So no console. If I accidentally start the wrong kernel, bad luck. |
|
On a production system, you may want to warn all the time when this is amiss. Maybe something with a background color turning from gray into gray-red tinted and do something similar consistently in all GUIs. For dev builds you don't really want this because you want to see what the operator sees while you are working on stuff. I'd go for an opt-in choice by setting a value in the INI file. Maybe something like a boolean |
|
I was just checking the code. @grandixximo You already added a better warning in the C code in your nonroot patch, so this was probably not even applied in the screenshot. Now there is a double warning in the console. The note from C++ and the Warning from this PR: With latency-histogram there is also a pop-up. With latency-test, this doesn't work. And the most important app, linuxcnc, also just shows noting if you don't start it in a console. Anyone has a good idea to check easily and globally for real time capability? There is already a function rtapi_is_realtime(). This is not 100% reliable, if harden_rt() fails, it will return true, even if it runs in SCHED_OTHER. But this can be fixed. rtapi_is_realtime() is also linked to userspace apps but there it will not work, it checks if the userspace app has realtime... ;-) I could add a halcmd that checks for realtime. Or a pin that is true when all is ok, false otherwise. This could then be used in all gui's for an opt-in or opt-out warning. But I am not that deep into all these various gui's and how they communicate with the RT part. |
Might be an option that is default on when you deploy an default off in rip-mode? But this is annoying to test. Otherwise, I would tend for default on, dev's will manage it better to switch it off than users will fight not knowing that they don't have real time enabled. Instead of in ini, might be an environment variable LINUXCNC_NO_RT_WARN. Dev's can set it on their dev machine in .profile if they are annoyed and it works for all test configs. BTW, just brainstorming options. |
|
Just a POC, if you think a halcmd getrt (or better name) would help I can create a PR. Was easy. make setuid You find it on my fork: |
|
Thanks both, this is more useful than my original per-tool heuristic. I have pivoted the PR to use @hdiethelm's A few things I would like your input on, since they touch the broader direction in #4118:
This now depends on the |
|
Instantiating a HAL memory segment on invocation may be problematic. It will surely confuse because you have to remember to call halrun -U afterwards. That is not a good design. Opt-out policies are generally designed to force you to do a thing, even if you do not want to. That is why they should be avoided. |
Query realtime status with 'halcmd getrt' (hdiethelm's PR) rather than probing the setuid bit. latency-histogram asks inside its own running session, so no stray HAL segment is created. latency-test drops its check and relies on the existing "POSIX non-realtime" note.
The way
You see that in the following test where I added a message when rtapi_app exits: vs I don't see any big downside in doing it like that. But i also do not 100% like starting up rtapi_app just to exit right away. Better would be running it with or after halrun in the script. Or might be an approach using a signal / parameter. Of course, also RTAI / doc and so on needs to be checked / updated before I will call that ready. Better ideas are welcome. But I prefer using a check executing the same code path for realtime checks always instead the variant before where you then have most likely diverging real time checks spread in all possible apps. |
228e6ef to
a3df81c
Compare
|
@hdiethelm I have restructured to avoid the standalone HAL instantiation @BsAtHome flagged:
That keeps the scripts honest, but @BsAtHome's deeper point lands on |
|
@grandixximo Nice. I have to test it. I created a PR, feel free to rebase: @BsAtHome Any hint's what should be done in this case? Before my pr rtapi_app rework pr, even |
|
Crossed posts, @hdiethelm. Good, your "run it with/after halrun in the script" is exactly what I did for latency-histogram (getrt after |
|
Hmm, just an idea: @BsAtHome |
|
A (forced) popup is the equivalent of slapping someone in the face. The error message added to the GUI is actually not an error. Not running RT on a production system may be considered an error. If you look closely in AXIS' status bar you see "Kein Werkzeug". That is also the place where you want to warn the user. Add a status bar field that is obvious (light red background) yet not invasive. |
|
You have a point. I also get annoyed of all this popup's when using the good old microslop... :-D Now on the status bar: Good idea. How to do that? I am already somewhat deep in the C code, so I can add any needed support there but for GUI's, someone else has to take over. @grandixximo Can you do this based on whatever from the hal? halcmd / parameter / pin is easy to add for me. |
|
@hdiethelm thanks, I will rebase this onto #4132 once it settles. @BsAtHome agreed the popup is too much; I will drop the On the cross-GUI status bar, that is the right shape and I am happy to do the AXIS side (a status-bar field with a light-red background, like the existing tool slot) once @hdiethelm's "realtime ok" signal from #4132 exists. The question is where it lives. My preference is to keep #4107 narrow: rebased on getrt, popup gone, scoped to the latency tools. It can ship as soon as #4132 lands. The GUI-wide warning cannot be written until the HAL pin exists and touches AXIS, gmoccapy and qtvcp, so I would do it as a separate PR tracking #4118 rather than make this small change wait on the slowest part. That said, if you would rather have one PR own the whole intent, I am fine rescoping #4107 to the GUI-wide warning and retitling it; it just becomes larger and slower. Either way works for me. Which do you prefer? |
|
@hdiethelm to answer your "halcmd / parameter / pin" question directly: a |
|
I think we first have to agree on the proper conceptual design of how to detect in the different scenarios and what to do with it. |
|
@BsAtHome agreed, let me put a concrete proposal on the table. Detection: one source of truth. @hdiethelm's What to do with it: per UI, not one mechanism. The right surface differs by app, so each owns its own rendering rather than forcing a single widget everywhere:
One thing to rule out: coloring the window title bar / decoration is not reliable. Plenty of setups have no title bar at all (fullscreen/kiosk panels, some Wayland/WM configs), so the signal has to live inside the app window, not in the chrome. Still open (defer): production-vs-dev suppression policy. That can ride with #4118 once the pin and the per-UI rendering exist. Does that match how you see the scenarios? |
|
Sounds like a plan. I will create new PR with a signal. Then we can test how this feels and continue from there. I can do that tomorrow, right now I have other things to do. About the title bar: The idea was to only use this for the two test apps. If this is cumbersome, might be just modify the text that is already displayed in them. @grandixximo Can you mark this PR as a draft until we are done? Sorry about the for- and back. If i dont have a good solution yet, this is often my way of brainstoming. Try things and discard until it is good. Hope this is ok for you. |
That is only partly satisfactory because for this realtime needs to be running. You want to know in advance whether your system will be capable of running RT without starting any of it. Then, when you are running, you want to know from various applications what the actual status is by using generic API call or/and HAL pin. |
No problem, we brain storm it and come up with something that sticks. |
|
Hmm, like so often, more difficult than estimated. So basically two options are needed.
New concept: Check with script in advance / during runtime:
A pin that shows the status:
Some quirks where needed:
Ideas:
Tested on uspace / rtai, it works. Alternatives:
Feedback? |
|
Hmm, the tests really don't like my persistent hal_status component. Options:
|
Yes, that is why we need a plan first :-) There are three levels where you want to be able to query the RT state:
These levels of interaction may use different methods to give you the result. Using HAL while the system is up may not be an entirely bad idea. However, I do not think that most programs would or should use it. Most programs already interact with libraries and the information must be available through the C HAL library and python hal module. If you want a "persistent" component, then it needs to be HAL itself who does the work and it must be created when the HAL library initializes the shared memory segment. For the namespace, I'd suggest either All that said, if the data is available through hal_lib and halmodule, then why bother with a component? |




What
Both latency tools now warn, for a non-root user, when
rtapi_appis neither setuid root nor carrying thecap_sys_nicefile capability. In that state the realtime threads cannot get SCHED_FIFO scheduling or locked memory, so the reported latency is wildly inflated and unrepresentative.Why
In #4044 a run-in-place build was used without
sudo make setuid.rtapi_appran unprivileged, the latency numbers blew up, and this was initially mistaken for a code regression. It was not; the setuid/setcap step had simply been skipped. A plain warning would have saved the back-and-forth. @rodw-au suggested ageteuid()/setuid check in the issue thread.How
A small check runs before the test starts:
rtapi_app; if not found, stay silent.sudo make setuid): silent.cap_sys_nicefile capability present (sudo make setcap): silent.A pure euid check was deliberately avoided: a normal deb install runs the scripts as an unprivileged user yet works fine via setuid-root
rtapi_app, so euid alone would false-warn on every correct install. The real signal is the privilege state ofrtapi_appitself.latency-test(bash) prints to stderr;latency-histogram(tcl) prints to stderr and, in GUI mode, also shows atk_messageBox.Closes #4044