-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: reduce open files due to dispatcher #2740
Conversation
bummer, i cannot access jenkins so its unclear to me what went wrong. I will test locally tomorrow. |
Hi @zack-vii, Thanks for submitting this proposed fix for GA's leaking sockets, Issue #2731. In addition to the files you've changed, I've also found it useful to change some additional files. My conjecture, perhaps wrong, is that the final solution will be a melding of our two proposed fixes. I am now building your PR locally on my dev system and running it through my suite of tests. Will post results here within an hour or two. |
Hi @zack-vii, Testing of this PR demonstrated that this PR fixes the primary case, but fails on an edge case. Pro -- This PR is a sever-side fix for the primary cause of leaked sockets (i.e., the situation GA encountered). It is thus a more elegant solution than the client-only fix that I created. I agree that this PR should be used. On Thursday or Friday, I will complete a review of this PR. Con -- If Other -- Looks like the Next Steps -- I appreciate the assistance and collaboration. Here is my suggestion on how we should proceed:
Addendum |
Regarding the Con. I think the expected behavior seems to be that the server hold one connection to each action server. That is even after the phase, the shot is over and the cycle begins anew the connection may be reused. My simple but effective tdi script for testing is;
and can be invoked from the development environment # after checkout; usual setup
./bootstrap
./configure --debug # . . .
# enter development environment with all env vars set
make tests-env
# ch dir to root of repo
cd $MDSPLUS_DIR
# create folder for test tree
mkdir test_path
# update bins
make
# run test
gdb --args tditest dispatch-test.tdi
# update source ; goto update bins During the Interesting to see would be how this hold if you add an active monitor server or more involved action_servers. |
I may have found the issue with the python |
Hi @zack-vii, Thanks for the additional detail. My test harness is comparable, and I do see one retained connection that you describe. That works well and is not a concern. The edge cases I am investigating will likely not arise often during practical use of The current edge case I have been investigating consists of slow actions (e.g., Thanks for mentioning the |
Hi @zack-vii, I've studied every line of this PR -- it is a nice fix! And the code refactoring also adds clarity. After my lunch break, I will submit a review and approval for this PR. |
Hi @zack-vii, Regarding
In the Note though that on the client-side, I encountered a different race condition. After the client calls We won't be using my client-side fix, thus the specific problem I created with that fix vanishes. However, I do wonder what will happen if the server-side fix kills both sockets (i.e., cleans up the client) while the client's main thread is in the midst of the |
I will approve this PR after Jenkins is able to build it on all platforms. It presently fails on on RHEL7 and Windows. RHEL7
Windows
|
the server should never cancel out of a regular mdsip request unless it times out between the messages of the same request or is terminated or interrupted of some sort out of the ordinary. hence the result of a request should be independant of the state of the task (scheduled, executing, done). moreover if a job is scheduled it should return success. a race condition may arise only if the reply is lost. due to tcp we would probably run into a timeout assuming the dispatch was incomplete but actually was not. that should be very raw and requires an underperforming network considering the trafic. |
@mwinkel-dev thanks for pointing out the issues with the jenkins checks.. looks like they are knowen issues that i simply payed not attention to. i will see if i can sort them out over the weekend... the power of docker ;) |
Hi @zack-vii, Thanks for answering my questions. And also for coming up with a much better fix than I had on the client-side. Monday is a holiday for us. So we'll resume work on this next week. |
fingers crossed, it went through on my machine but if it fails please hint me the failing platforms. |
@zack-vii @mwinkel-dev It looks like it passed. :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This server-side fix should be viewed as a partial fix of Issue #2731.
- Testing shows that during normal usage scenarios, this PR prevents sockets from leaking.
- However, it does not address some "edge cases" that can cause many sockets to leak when
mdstcl
and/or theaction server
are overloaded.
The summary of this code review is in the following post.
#2740 (comment)
The full details consist of this post and all following posts.
#2740 (comment)
Note: -- It is probable that the complete fix of Issue #2731 will require additional PRs.
Hi @zack-vii, I have just approved this PR. It successfully passed the Jenkins build. And also passed much (but not all) of my testing. It can now be merged to the "alpha" branch. Hi @smithsp, This PR is a partial fix of Issue #2731. In my opinion, it is robust enough to handle GA's normal workflows. However, there are some "edge cases" that sill leak sockets (albeit at a much slower rate than when GA had to reboot a server on 25-Mar-2024). You should decide if you want this partial fix now, or instead wish to wait until the full fix is available. If you want this partial fix now, it will take us a day or so to cherry-pick it into the GA branch, build the packages, do another round of testing, and distribute the build to GA. |
@mwinkel-dev : can you give me some details about the edge cases that are still leaking. |
Hi @zack-vii, I am testing the "action server" as I would any web server -- normal load (which is A-OK), spike load (fails), heavy continuous load (still to do) and so forth. For the spike load test, I have I have also noticed in the code that the action server has a limited port range of 100 to 200 ports. It is possible (probable?) that the spike load test consumes all of those ports. Although the spike load condition is unlikely to arise during normal workflows at GA, if it ever does then it might force GA to reboot the physical server. Which disrupts the work of many users and thus is a serious problem. My hunch is that the fix for the spike load test will involve a client-side ( Now that your client-side fix (i.e., this PR) has been merged to alpha, I will continue my investigation / testing. When I find the root cause of the problem, I will report my findings via GitHub (i.e., in a new issue). |
@mwinkel-dev We would like to take you up on your offer to get a cherry-pick version of this partial fix to our GA version with RPM kits. Thank you for your efforts. |
Hi @smithsp, OK, will do. Here are the steps.
Note: -- Unless GA objects, we will also include PR #2735 in the cherry-pick to the GA branch. That is a simple change that can be useful when troubleshooting multi-threaded code. It would be useful to have that feature in the GA branch too. |
Hi @smithsp and @sflanagan, Before this PR was merged to the alpha branch, it passed all automated tests that Jenkins runs (on all platforms). After it was merged to alpha, it also passed the following manual tests (performed on Ubuntu 20.04).
Caveats
Summary |
the 100 ports are given by the default of the MDSIP_REPLY_PORT_RANGE (? or similar) env var which i thing ranges 8800-8899. it was used as one port per actionserver. I think this is not the case anymore as there is on listening port that handles all replies. I will try to replicate your setup. Do you dispatch the actions in a single thread or multi-thread? |
Hi @zack-vii, This post has three topics: PR #2740 behavior (versus prior), client-side cleanup, and my test harness. 1) With PR 2740 versus Without 2) Client-Side Cleanup 3) My Test Harness However, my test harness is also based on the client-side only fix that I created for Issue #2731. I have made numerous changes to my test harness to eliminate features that are now handled by your PR #2740. It is thus possible (probable?) that I am observing a bug in my test harness and not in the Summary Addendum For 2), I experimented with an approach that allowed the "receiver" threads to access the "thread-static information" in the "main" thread. My experimental code worked OK when it was configured to just read the linked list of connections. However, if it was configured to delete connections, then it caused deadlock when under heavy load. My hope is that with this PR #2740, we can ignore the client-side data structures -- however more testing should be done to confirm that doing so is OK. Regarding 3), my test harness presently consists of a single instance of |
Hi @zack-vii -- Likely root cause of the edge case that leaks sockets is simply the difference between re-using an existing connection and creating new connections.
I will do a few more experiments this evening to confirm if this conjecture is indeed true. |
Hi @zack-vii, Major mystery solved. And yes, my test harness was too extreme. The crux of the matter is that the architectures of the Now that I have a better understanding of the services, I can answer my own questions. 1) With PR 2740 and Without 2) Client-side Cleanup 3) My Test Harness Addendum |
Prior to this PR, the sockets were being leaked in the This PR fixes that by changing The other changes made by this PR are also useful (refactoring for clarity, using mutex to protect data structures, and so forth). |
* Fix: reuse action_server connection id in ServerConnect; avoid duplicates in list * Fix: set dispatched early; unset if dispatching failed; prevent race on fast actions * Fix: lock Clients in ServerQAction; cleanup and check before use * Fix: reconnect dropped connections * Fix: use correct windows SOCKET print format * Fix: satisfy rhel7 c standard
* Fix: reuse action_server connection id in ServerConnect; avoid duplicates in list * Fix: set dispatched early; unset if dispatching failed; prevent race on fast actions * Fix: lock Clients in ServerQAction; cleanup and check before use * Fix: reconnect dropped connections * Fix: use correct windows SOCKET print format * Fix: satisfy rhel7 c standard
* Gm apd java (#2729) * Improve APD support for Java interface * Improve APD support for Java - forgotten files * Commit packages * When activate debug trace, now compiles without error. (#2735) This fixes Issue 2734. * Fix: reduce open files due to dispatcher (#2740) * Fix: reuse action_server connection id in ServerConnect; avoid duplicates in list * Fix: set dispatched early; unset if dispatching failed; prevent race on fast actions * Fix: lock Clients in ServerQAction; cleanup and check before use * Fix: reconnect dropped connections * Fix: use correct windows SOCKET print format * Fix: satisfy rhel7 c standard * Gm apd thin cpp (#2742) * Added ADP support in C++ thin client * Added tdi fun * Added TDI FUn * Fix commands * Gm new marte (#2743) * more parameters for marte2_simulink_generic * Proceed with the new implementation * Proceed * Proceed * Proceed * Proceed * Proceed * proceed * Proceed * Proceed * Partially tested version * Added execution times recording * Proceed * Procced with debugging * Proceed * Proceed * Proceed * Fixes for multisampled acquisition * Remove quotes from string parameters * Minor fixes * Procced debugging * Debugging * More channels * Debug Distributed configuration * Fix sognal recording for synchronized inputs * Further debug * Further debug * Small fixes * Close ti final version * Forgotten fix * Make port visible, fix parameter name * unaligned nids * Increase DiscontinuityFactor * Discontinuityfactor * More channels * Proceed with the new implementation * Proceed * Proceed * Proceed * Proceed * Proceed * proceed * Proceed * Proceed * Partially tested version * Added execution times recording * Proceed * Procced with debugging * Proceed * Proceed * Proceed * Fixes for multisampled acquisition * Remove quotes from string parameters * Minor fixes * Procced debugging * Debugging * More channels * Debug Distributed configuration * Fix sognal recording for synchronized inputs * Further debug * Further debug * Small fixes * Close ti final version * Forgotten fix * Make port visible, fix parameter name * unaligned nids * Increase DiscontinuityFactor * Discontinuityfactor * More channels * Packages updated * Remove print * Remove error messages --------- Co-authored-by: mdsplus <mdsplus@roactive2.rfx.local> * Docs: Improve documentation for getSegment* python wrappers (#2732) Add explanation and rename parameters for: * getSegmentLimits * getSegmentList * Fix: Update JAVASOURCE to 8 to support JDK 17 (#2747) * Fix: improve mdstcl's error handling and add comments (#2746) * add comments regarding action service * send_reply() now does cleanup_client() on bad socket * explain mdstcl's receiver thread cannot access main thread's connection list * Improve handling of non-MDSplus error codes * add comments regarding action dispatch * add comment explaining receiver thread select loop * Fix: multiple string escape warnings thrown by python 12 (#2748) ``` mdsplus/pydevices/RfxDevices/FAKECAMERA.py:40: SyntaxWarning: invalid escape sequence '\C' {'path': ':EXP_NODE', 'type': 'text', 'value': '\CAMERATEST::FLIR:FRAMES'}, mdsplus/pydevices/RfxDevices/PLFE.py:220: SyntaxWarning: invalid escape sequence '\#' '^(\#[0-5][01]([01][0-9][0-9]|2[0-4][0-9]|25[0-5])){6}$', msg) mdsplus/pydevices/RfxDevices/CYGNET4K.py:361: SyntaxWarning: invalid escape sequence '\E' self.serialIO(b'\x55\x99\x66\x11\x50\EB', None) mdsplus/pydevices/RfxDevices/CYGNET4K.py:461: SyntaxWarning: invalid escape sequence '\8' return self.setValue(b'\81\x82', min(0xFFF, value), True) mdsplus/pydevices/MitDevices/dt100.py:161: SyntaxWarning: invalid escape sequence '\.' regstr = '([0-9\.]*) [0-9] ST_(.*)\r\n' ``` The \CAMERATEST became \\CAMERATEST The regex strings should be python r-strings `r""`, but to maintain backwards compatibility, we're using \\ The broken hex-codes now have x in them * Build: Resolve linker error after updating the windows builder to Fedora 39 (#2749) * Build: Resolve linker error after updating the windows builder to Fedora 39 This appeared after updating the mdsplus/builder:windows docker image to Fedora 39, and Wine to 9.0 The newer libxml2 tried to link dynamically unless we explicitly set LIBXML_STATIC * Hopefully fix the MdsTreeNodeTest It turns out that this was failing previously, but we weren't properly catching the error * Fix errors in windows build from newer gcc * Docs: Update sites.csv (#2615) add Startorus Fusion in Xi'an, China * Fix: mdsip now sends proper auth status back to the client (#2752) Fixes issues #2750 and #2652 * Fix: mdstcl's `show current` no longer segfaults when no tree paths defined (#2754) * Fix: "show current" no longer segfaults when no tree paths defined * Fix: corrected typo in error message * Use original error message so tests pass * Fix: Add Debian 12 and Ubuntu 24.04 and support GCC 12+ (#2753) * Build: Add Debian 12 and Ubuntu 24.04 * Add extra flags for GCC 12+ and stub imp for Python 3.12 GCC 12+ triggers a bunch of false positive warnings (which we treat as errors) This adds AX_C_FLAGS to configure those `-Wno-*` flags for GCC 12+ `cmdExecute.c` now uses snprintf to avoid buffer overflow warnings, also generated by GCC 12+ `compound.py.in` now supports Python 3.12+ * compound.py now supports Python 2.7.. again --------- Co-authored-by: Stephen Lane-Walsh <slwalsh@psfc.mit.edu> * Fix: Improve error messaging when calling Setup Device in jTraverser (#2744) * Improve error messaging when calling Setup Device in jTraverser e.getMessage() sometimes returned null, but just e will always print something Add a printStackTrace() for InvocationTargetException exceptions to show the encapsulated error * Add import for InvocationTargetException * Build: Fix off-by-one versions produced by Jenkins (#2756) This fixes the bug where `--os=bootstrap` wasn't receiving the version from `--version=x.y.z` However, confusingly, this also changes the Jenkinsfile to not use that feature, and instead use `git tag` in order to embed the proper git information as well as the proper version information The `--os=bootstrap` and `--version` fix is still included just so that it doesn't break if someone else tries to use it * Build: Increase default test timeout to 1h (#2757) When the build server(s) are at capacity, it's not unreasonable for a test to take more than 10 seconds, which was the old default timeout This sets the default to 1h, and removes the overrides in various tests * Gm fix filter (#2755) * Allow filtering data from MinMax resampling; remove useless thread in jServer * Fix compile error * Remove debug message * Make Windows Compiler happy * Build: Fix 'HEAD' in `show version` and tag error (#2758) Jenkins builds in a detached HEAD state, which caused bootstrap to use HEAD as the branch name We pass --branch= to the bootstrap call in Jenkins, but $BRANCH wasn't being passed into the bootstrap docker container Also, attempts to build alpha versions with tags that already existed failed * Fix: mdstcl show version tag and links (#2760) Fixes Issue #2759 * Feature: CompileTree will exit with non-zero status code for error messages. (#2446) And error message should go to stderr. * Build: Add package override for ubuntu and debian (#2761) Override sections for Ubuntu 24 and Debian Bookworm were added. * Fix: Python release version tag (#2764) * Feature: Add "Date:" to show version output (#2767) Implements #2766 Example: ``` $ mdstcl sho ver MDSplus version: 7.140.75 ---------------------- Release: alpha_release-7-140-75 Date: Thu May 16 17:43:14 UTC 2024 Browse: https://github.com/MDSplus/mdsplus/tree/alpha_release-7-140-75 Download: https://github.com/MDSplus/mdsplus/releases/tag/alpha_release-7-140-75 ``` * Fix: remove abort flag from RfxDevices DIO2 initialization (#2769) Fixes issue #2768 * Fix: Missing repo metadata signing (#2770) This will hopefully fix the lack of signed metadata files that are preventing us from automatically publishing releases --------- Co-authored-by: GabrieleManduchi <gabriele.manduchi@igi.cnr.it> Co-authored-by: mwinkel-dev <122583770+mwinkel-dev@users.noreply.github.com> Co-authored-by: Timo Schroeder <zack-vii@users.noreply.github.com> Co-authored-by: mdsplus <mdsplus@roactive2.rfx.local> Co-authored-by: Josh Stillerman <jas@psfc.mit.edu> Co-authored-by: Fernando Santoro <44955673+santorofer@users.noreply.github.com> Co-authored-by: Louwrensth <Louwrensth@users.noreply.github.com>
This is related to issue #2731 and may fix some if not all of the related problems.