Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linuxcncrsh: fixes & cleanup #2739

Merged
merged 3 commits into from
Dec 14, 2023
Merged

linuxcncrsh: fixes & cleanup #2739

merged 3 commits into from
Dec 14, 2023

Conversation

heeplr
Copy link
Contributor

@heeplr heeplr commented Nov 7, 2023

This fixes the current linuxcncrsh segfault bug and a lot of other issues.
I also cleaned/modernized parts of the code, added "get update" command and re-enabled the test (disabled in eca9685), which should pass now.
The individual commits provide verbose details.

@heeplr heeplr changed the title linuxcncrsh: fixes & partial cleanup linuxcncrsh: fixes & cleanup Nov 7, 2023
@andypugh
Copy link
Collaborator

andypugh commented Dec 11, 2023

Thanks for giving linuxcncrsh some long overdue attention.
I have not gone through the code in detail, but I am a little concerned that you seem to have changed the name of one of the commands (from set_wait to wait_mode). My worry is that this will possibly break existing scripts in user configs.

@heeplr
Copy link
Contributor Author

heeplr commented Dec 11, 2023

Thanks for giving linuxcncrsh some long overdue attention.

I think it's a great way to automate linuxcnc and do testing. I also created a quick, uncleaned proof of concept for a rewrite using libcli that should provide a modern UX and cleaner code.

you seem to have changed the name of one of the commands (from set_wait to wait_mode)

Yes. I couldn't find out what "set" refers to other than "set wait mode". I suspected it's an artefact from times when the set command didn't exist and someone forgot to remove the "set_" from the name. That resulted in the weird set set_mode ... construction.

My worry is that this will possibly break existing scripts in user configs.

Indeed there is a possibility. But since linuxcncrsh was almost completly broken beyond sending mdi commands and set_wait, the chance should be pretty low. I'm quite sure that it should fail gracefully, tho unless users don't check for NAKs/errors. So mentioning it in the ChangeLog/Release notes should probably be fine.

I'd prefer to keep it if my assumption is right and the name is "just wrong". But I have no strong feelings about changing it back since it wouldn't exactly hurt to rename it later (e.g. on a new major release).

@andypugh
Copy link
Collaborator

As an interim measure, could the code respond to both the new and old commands identically? (possibly with a deprecation warning?)
We have a script to auto-update configs, but it only works with HAL and INI files. (Because you can find the HAL files by parsing the INI, and the INI is passed as a parameter to the linuxcnc startup script.

@heeplr
Copy link
Contributor Author

heeplr commented Dec 11, 2023

As an interim measure, could the code respond to both the new and old commands identically? (possibly with a deprecation warning?)

That's a good solution. I'll add that.

We have a script to auto-update configs

Autoupdate would be nice, but I'd argue that it's not a config value but an API change and updating user scripts/code/3rd party applications probably is out of scope of linuxcnc.

…ion parameters)

* rename pch -> s uniformly
* fix double-tokenize e.g. in setDebug()
* lots of cleanup for get/setSpindle(), get/setSpindleOverride() and get/setBrake()
* fixed segfault of "get spindle" without argument
* fixed snprintf() compiler warnings
* support multiple spindles
* make spindlenum the last input argument
* output NML errors after NML failures (untested)
* fix off-by-one in max-spindle test
* replace EMCMOT_MAX_SPINDLES with actual number of spindles
* add missing "zero input items parsed" check for sscanf()
* fix cppcheck warnings
* handle NML errors in setEstop()
* add "get update" command
* replace write() and sockWrite() with dprintf() where applicable
* eliminate sockWrite()
* fix all \n\r -> \r\n (remember "ReturN")
* pass string literals if there's no format string
* prefer using string literals
* use spindle 0 by default, not all spindles
* fix helptext
* handle errors in all send*() functions
* rename *_s -> *Str according to previous style
* fix missing argument documentation
* make getTeleopEnable() output ON/OFF instead of YES/NO according to setTeleopEnable() using checkOnOff(); fix help text
* rename set_wait -> wait_mode
* eliminate initMain()
* rename initSockets -> initSocket
* remove unneeded includes
* replace strncpy() with rtapi_strlcpy(); commenting
* use sizeof() to loop over array instead of last array element; prefer for over while
* introduce OUT() helper macro
* replace snprintf(context->outBuf,...) with OUT() macro; don't use string substitution excessively
* output TOOL_OFFSET as double
* make axisnumber() parse all possible notations
* treat set/getFeedOverride percent as double
* fix commandSet() returns
* rewrite getAbs/Rel*Pos, getJointPos, getPosOffset
* fix commandGet return values
* cleanup parseCommand
* formatting
* more error handling
* more error user feedback
* more input validation
* minor cleanups
@heeplr heeplr force-pushed the linuxcncrsh branch 2 times, most recently from 89da543 to eafefff Compare December 11, 2023 14:45
@heeplr
Copy link
Contributor Author

heeplr commented Dec 11, 2023

I've added the alias and encountered two race conditions when homing and probing in the tests.
The issue is likely unrelated to linuxcncrsh and I disabled the test for now.

@andypugh andypugh merged commit c61d428 into LinuxCNC:master Dec 14, 2023
11 checks passed
@andypugh
Copy link
Collaborator

andypugh commented Jan 2, 2024

@heeplr Can you look at why linuxcnc builds on the buildbot are failing? It's the linuxcncrsh test. But it doesn't always fail, and I have not been able to reproduce it.
Example: https://github.com/LinuxCNC/linuxcnc/actions/runs/7380505624/job/20077951641?pr=2823
(It might be due to a bad merge on my part, but I don't think so)

@heeplr
Copy link
Contributor Author

heeplr commented Jan 2, 2024

It's failing because of a race condition. The test does set teleop_enable on but a subsequent get teleop_enable still returns off.
The initial set wait_mode done should have prevented that if I understand correctly. But it seems that it's not working. I couldn't find a mechanism that would check when a command is actually done executing so the call could block until then.

I encountered multiple of those race conditions with probing and homing aswell. I ended up disabling test commands to prevent the test from failing.
This roots way deeper than linuxcncrsh but I can't tell exactly where the culprit is. It probably exists for a long time but hasn't been tested.

Maybe just comment out failing commands for now?

PS: Is there another integration test that does something like "run a complete real-world job on simulated hardware"? I couldn't find any (but maybe missed it).

@petterreinholdtsen
Copy link
Collaborator

petterreinholdtsen commented Jan 2, 2024 via email

@heeplr
Copy link
Contributor Author

heeplr commented Jan 2, 2024

to loop for a few seconds

Adding a few "sleep" commands came to my mind but with all the other racing commands, that would be a lot of "few seconds" and would probably increase total runtime of the test significantly. From a testing POV, this would certainly be better than disabling the tests completely.

Fixing the race conditions or removing the set_wait (now wait_mode) command - i.e. not giving guarantees to the user - would be best.

Note that the test isn't complete. I'm not familiar enough to create a representative "real world job" test and couldn't find a template to copy. But in theory, it should test a lot more different modes/operations/combinations.

@petterreinholdtsen
Copy link
Collaborator

petterreinholdtsen commented Jan 2, 2024 via email

@heeplr
Copy link
Contributor Author

heeplr commented Jan 2, 2024

@petterreinholdtsen yes, such a loop would help. Feel free to add it. I'd rather spend my time hunting down the actual issue tho (edit: after #2760 is resolved) , since there must be some mechanism for "block-until-cmd-done" already, which is broken.

petterreinholdtsen added a commit to petterreinholdtsen/linuxcnc that referenced this pull request Jan 2, 2024
As can be seen in several test failures on github and discussed in
LinuxCNC#2739, the linuxcncrsh test
is unstable.  It is believed to be caused by a race condition.  Until
the race condition is fixed, I believe it is best to skip the test.
@petterreinholdtsen
Copy link
Collaborator

petterreinholdtsen commented Jan 2, 2024 via email

@heeplr
Copy link
Contributor Author

heeplr commented Jan 2, 2024

to avoid blocking other patches from getting a successful test.

@petterreinholdtsen are we 100% sure that this isn't caused by some recent patch? Since the test did run fine for three weeks until now. Might be worth confirming before merging #2824 to silence the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants