Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: reduce open files due to dispatcher #2740

Merged
merged 6 commits into from
Apr 16, 2024
Merged

Fix: reduce open files due to dispatcher #2740

merged 6 commits into from
Apr 16, 2024

Conversation

zack-vii
Copy link
Contributor

This is related to issue #2731 and may fix some if not all of the related problems.

  • reuse action server 's connection id (prevent extra sockets if connection already exists)
  • set dispatched flag before dispatching to prevent possible race condition
  • lock Clients list in ServerQAction and check socket before reusing

@zack-vii zack-vii changed the title Fix: reduce open files bue to dispatcher Fix: reduce open files due to dispatcher Apr 10, 2024
@zack-vii
Copy link
Contributor Author

bummer, i cannot access jenkins so its unclear to me what went wrong. I will test locally tomorrow.

@mwinkel-dev
Copy link
Contributor

Hi @zack-vii,

Thanks for submitting this proposed fix for GA's leaking sockets, Issue #2731.

In addition to the files you've changed, I've also found it useful to change some additional files. My conjecture, perhaps wrong, is that the final solution will be a melding of our two proposed fixes.

I am now building your PR locally on my dev system and running it through my suite of tests. Will post results here within an hour or two.

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Apr 11, 2024

Hi @zack-vii,

Testing of this PR demonstrated that this PR fixes the primary case, but fails on an edge case.

Pro -- This PR is a sever-side fix for the primary cause of leaked sockets (i.e., the situation GA encountered). It is thus a more elegant solution than the client-only fix that I created. I agree that this PR should be used. On Thursday or Friday, I will complete a review of this PR.

Con -- If mdstcl dispatches actions so quickly that it overloads the action server, then some actions will be marked as "failed" by the client. Failed actions remain connected to the action server, but never receive a SrvJobFINISHED so are zombie actions. Thus they leak sockets. Even with this PR 2740, my testing was able to cause ~100 sockets to leak. (I am working on a client-side fix for that edge case. However, you are more familiar with this code than I am. A-OK for us both to work on it and then compare our solutions.)

Other -- Looks like the Client_do_message() of Client.h needs a Client_remove(c, factive) call in the case SrvJobFINISHED section.

Next Steps -- I appreciate the assistance and collaboration. Here is my suggestion on how we should proceed:

  • I'll do a detailed review of this PR, plus do some more testing.
  • So that after Jenkins builds this PR, I can approve it.
  • Whereupon you can merge it into alpha.
  • We'll then add a new PR for the "failed action" edge case.
  • I'll add a third PR to clean up some other cruft that I've spotted in the dispatcher.
  • After we are satisfied that all is OK, then we'll cherry-pick the associated PRs from alpha to GA's branch.
  • And then GA's branch will be built and the RPMs distributed to GA.

Addendum
Many of the following posts by @mwinkel-dev are conjectures and thus wrong. For the summary of the investigation, refer to the post in Issue #2731 referenced by this link.
#2731 (comment)

@mwinkel-dev mwinkel-dev added bug An unexpected problem or unintended behavior US Priority tool/tcl Relates to the Tree Control Language or mdstcl prompt labels Apr 11, 2024
@zack-vii
Copy link
Contributor Author

Regarding the Con. I think the expected behavior seems to be that the server hold one connection to each action server. That is even after the phase, the shot is over and the cycle begins anew the connection may be reused.
The issue on the client side was that it did not properly detect if the socket was disconnected. I far as I oversaw the current implementation there is only one detached thread that handles the listening socket for incoming action_server replies as well as established connection.

My simple but effective tdi script for testing is;

_root=getenv('MDSPLUS_DIR')
setenv(CONCAT('test_path=',_root,'/test_path'))
_dispatch=BUILD_DISPATCH(2, 'DUMMY', 'INIT', 50, "")
_task=BUILD_PROGRAM(1, "echo test")
_action=BUILD_ACTION(`_dispatch, `_task)
treeopennew('test', 1)
treeaddnode("act", _n, 'ACTION')
treeputrecord("act", `_action)
FOR (_i=1;_i<=3000;_i++) EXECUTE("treeaddnode($1, _n, 'ACTION');treeputrecord(`$1, `$2);", EXECUTE("DECOMPILE(`$)", _i), _action)
treewrite()
treeclose()
setenv('DUMMY=localhost:30000')
tcl('set tree test/shot=1')
tcl('dispatch act')
tcl('dispatch/build')
SPAWN(CONCAT('mdsip -s -p 30000 -h ', _root, '/testing/mdsip.hosts&'))
WRITE(1, _out)
WAIT(1)
tcl('dispatch/phase INIT')
WAIT(1)
SPAWN('killall mdsip')
SPAWN(CONCAT('mdsip -s -p 30000 -h ', _root, '/testing/mdsip.hosts&'))
WAIT(1)
tcl('dispatch/phase INIT')
WAIT(1)
tcl('dispatch/close')
SPAWN('killall mdsip')

and can be invoked from the development environment

# after checkout; usual setup
./bootstrap
./configure --debug  # . . .
# enter development environment with all env vars set
make tests-env
# ch dir to root of repo
cd $MDSPLUS_DIR
# create folder for test tree
mkdir test_path
# update bins
make
# run test
gdb --args tditest dispatch-test.tdi
# update source ; goto update bins

During the WAIT(1) are good spots where you can inspect (gdb: Ctrl+C) p *Clients, p *Clients->next and will hopefully find: Clients->next becomes NULL soon after the mdsip service was terminated. You can comment # or move around the spawn calls or even fire up an external server (make sure the env vars are set) to observe its behavior when the dispatch server restarts or stays alive in-between dispatches.

Interesting to see would be how this hold if you add an active monitor server or more involved action_servers.

@zack-vii
Copy link
Contributor Author

I may have found the issue with the python dcl_dispatcher_test.py. When checking for an existing connection, we need to check if the conid is still valid.

@mwinkel-dev
Copy link
Contributor

Hi @zack-vii,

Thanks for the additional detail.

My test harness is comparable, and I do see one retained connection that you describe. That works well and is not a concern.

The edge cases I am investigating will likely not arise often during practical use of mdstcl. However, it is a big deal if leaked sockets force GA to reboot the entire physical server. Thus, I have been stress testing the dispatch feature of mdstcl to characterize its failure modes. (My goal is to confirm that if the edge cases do arise, that it won't be necessary to reboot the server every day.)

The current edge case I have been investigating consists of slow actions (e.g., wait(2.0);1;) with mdstcl dispatching at 10 actions per second or faster. That is how I ended up with ~100 leaked sockets.

Thanks for mentioning the action monitor and different dispatch phases. I will also do experiments with those features.

@mwinkel-dev
Copy link
Contributor

Hi @zack-vii,

I've studied every line of this PR -- it is a nice fix! And the code refactoring also adds clarity.

After my lunch break, I will submit a review and approval for this PR.

@mwinkel-dev
Copy link
Contributor

Hi @zack-vii,

Regarding ServerQAction.c file, I like the changes you made.

  • The send_reply() function now always calls cleanup_client() thereby preventing the leaking sockets. (The old code only cleaned up when an error occurred.)
  • Refactoring AttachPort() into two functions makes the code easier to understand.
  • Refactoring RemoveClient() into two functions also adds clarity.
  • Excellent that a mutex now protects the manipulations of the Clients linked list in these functions: add_client(), find_client() and remove_client().

In the ServerDispatchPhase.c file, moving the actions[i].dispatched = 1 line eliminates a race condition.

Note though that on the client-side, I encountered a different race condition. After the client calls SendArg(), it then immediately calls GetAnswerInfoTS() to read the status handshake (i.e., the value 1 for success). However, the action was completed by the server before the client had finished processing the GetAnswerInfoTS(). Normally that would not occur. But the stress test of hundreds of actions dispatched at once caused that scenario to arise. With my client-only fix (and actions that evaluate very quickly), occasionally one of the receiver threads was tearing down the connection / socket at the same time that the main thread's GetAnswerInfoTS() was using the connection, which usually resulted in a crash.

We won't be using my client-side fix, thus the specific problem I created with that fix vanishes.

However, I do wonder what will happen if the server-side fix kills both sockets (i.e., cleans up the client) while the client's main thread is in the midst of the GetAnswerInfoTS(). I believe that the client will detect that as an error and will correctly handle that error. However, I will do additional testing to confirm that is true.

@mwinkel-dev
Copy link
Contributor

I will approve this PR after Jenkins is able to build it on all platforms. It presently fails on on RHEL7 and Windows.

RHEL7

/opt/jenkins/workspace/MDSplus_PR-2740/rhel7/servershr/ServerQAction.c: In function 'find_client':

/opt/jenkins/workspace/MDSplus_PR-2740/rhel7/servershr/ServerQAction.c:897:3: error: 'for' loop initial declarations are only allowed in C99 mode

   for (ClientList **p = &Clients; *p != NULL; p = &(*p)->next)

   ^

/opt/jenkins/workspace/MDSplus_PR-2740/rhel7/servershr/ServerQAction.c:897:3: note: use option -std=c99 or -std=gnu99 to compile your code

/opt/jenkins/workspace/MDSplus_PR-2740/rhel7/servershr/ServerQAction.c: In function 'remove_client':

/opt/jenkins/workspace/MDSplus_PR-2740/rhel7/servershr/ServerQAction.c:924:3: error: 'for' loop initial declarations are only allowed in C99 mode

   for (ClientList **p = &Clients; *p != NULL; p = &(*p)->next)

   ^

Windows

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c: In function 'setup_client':

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:949:14: error: format '%d' expects argument of type 'int', but argument 3 has type 'SOCKET' {aka 'long long unsigned int'} [-Werror=format=]

  949 |       MDSMSG("setup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |              ^~~~~~~~~~~~~~~~~~~~~~             ~~~~

      |                                                 |

      |                                                 SOCKET {aka long long unsigned int}

/opt/jenkins/workspace/MDSplus_PR-2740/windows/_include/mdsmsg.h:79:25: note: in definition of macro '__MDSMSG'

   79 |     pos += sprintf(pos, __VA_ARGS__);              \

      |                         ^~~~~~~~~~~

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:949:7: note: in expansion of macro 'MDSMSG'

  949 |       MDSMSG("setup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |       ^~~~~~

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:949:33: note: format string is defined here

  949 |       MDSMSG("setup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |                                ~^

      |                                 |

      |                                 int

      |                                %I64d

In file included from /opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:46:

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:949:14: error: format '%d' expects argument of type 'int', but argument 3 has type 'SOCKET' {aka 'long long unsigned int'} [-Werror=format=]

  949 |       MDSMSG("setup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |              ^~~~~~~~~~~~~~~~~~~~~~             ~~~~

      |                                                 |

      |                                                 SOCKET {aka long long unsigned int}

/opt/jenkins/workspace/MDSplus_PR-2740/windows/_include/mdsmsg.h:79:25: note: in definition of macro '__MDSMSG'

   79 |     pos += sprintf(pos, __VA_ARGS__);              \

      |                         ^~~~~~~~~~~

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:949:7: note: in expansion of macro 'MDSMSG'

  949 |       MDSMSG("setup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |       ^~~~~~

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:949:33: note: format string is defined here

  949 |       MDSMSG("setup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |                                ~^

      |                                 |

      |                                 int

      |                                %I64d

In file included from /opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:46:

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c: In function 'cleanup_client':

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:962:12: error: format '%d' expects argument of type 'int', but argument 3 has type 'SOCKET' {aka 'long long unsigned int'} [-Werror=format=]

  962 |     MDSMSG("cleanup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |            ^~~~~~~~~~~~~~~~~~~~~~~~             ~~~~

      |                                                 |

      |                                                 SOCKET {aka long long unsigned int}

/opt/jenkins/workspace/MDSplus_PR-2740/windows/_include/mdsmsg.h:79:25: note: in definition of macro '__MDSMSG'

   79 |     pos += sprintf(pos, __VA_ARGS__);              \

      |                         ^~~~~~~~~~~

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:962:5: note: in expansion of macro 'MDSMSG'

  962 |     MDSMSG("cleanup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |     ^~~~~~

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:962:33: note: format string is defined here

  962 |     MDSMSG("cleanup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |                                ~^

      |                                 |

      |                                 int

      |                                %I64d

In file included from /opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:46:

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:962:12: error: format '%d' expects argument of type 'int', but argument 3 has type 'SOCKET' {aka 'long long unsigned int'} [-Werror=format=]

  962 |     MDSMSG("cleanup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |            ^~~~~~~~~~~~~~~~~~~~~~~~             ~~~~

      |                                                 |

      |                                                 SOCKET {aka long long unsigned int}

/opt/jenkins/workspace/MDSplus_PR-2740/windows/_include/mdsmsg.h:79:25: note: in definition of macro '__MDSMSG'

   79 |     pos += sprintf(pos, __VA_ARGS__);              \

      |                         ^~~~~~~~~~~

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:962:5: note: in expansion of macro 'MDSMSG'

  962 |     MDSMSG("cleanup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |     ^~~~~~

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:962:33: note: format string is defined here

  962 |     MDSMSG("cleanup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |                                ~^

      |                                 |

      |                                 int

      |                                %I64d

@zack-vii
Copy link
Contributor Author

However, I do wonder what will happen if the server-side fix kills both sockets (i.e., cleans up the client) while the client's main thread is in the midst of the GetAnswerInfoTS(). I believe that the client will detect that as an error and will correctly handle that error. However, I will do additional testing to confirm that is true.

the server should never cancel out of a regular mdsip request unless it times out between the messages of the same request or is terminated or interrupted of some sort out of the ordinary. hence the result of a request should be independant of the state of the task (scheduled, executing, done). moreover if a job is scheduled it should return success. a race condition may arise only if the reply is lost. due to tcp we would probably run into a timeout assuming the dispatch was incomplete but actually was not. that should be very raw and requires an underperforming network considering the trafic.

@zack-vii
Copy link
Contributor Author

@mwinkel-dev thanks for pointing out the issues with the jenkins checks.. looks like they are knowen issues that i simply payed not attention to. i will see if i can sort them out over the weekend... the power of docker ;)

@mwinkel-dev
Copy link
Contributor

Hi @zack-vii,

Thanks for answering my questions. And also for coming up with a much better fix than I had on the client-side.

Monday is a holiday for us. So we'll resume work on this next week.

@zack-vii
Copy link
Contributor Author

fingers crossed, it went through on my machine but if it fails please hint me the failing platforms.

@smithsp
Copy link

smithsp commented Apr 16, 2024

@zack-vii @mwinkel-dev It looks like it passed. :-)

Copy link
Contributor

@mwinkel-dev mwinkel-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This server-side fix should be viewed as a partial fix of Issue #2731.

  • Testing shows that during normal usage scenarios, this PR prevents sockets from leaking.
  • However, it does not address some "edge cases" that can cause many sockets to leak when mdstcl and/or the action server are overloaded.

The summary of this code review is in the following post.
#2740 (comment)

The full details consist of this post and all following posts.
#2740 (comment)

Note: -- It is probable that the complete fix of Issue #2731 will require additional PRs.

servershr/ServerDispatchPhase.c Show resolved Hide resolved
servershr/ServerQAction.c Show resolved Hide resolved
@mwinkel-dev
Copy link
Contributor

Hi @zack-vii,

I have just approved this PR. It successfully passed the Jenkins build. And also passed much (but not all) of my testing. It can now be merged to the "alpha" branch.

Hi @smithsp,

This PR is a partial fix of Issue #2731. In my opinion, it is robust enough to handle GA's normal workflows. However, there are some "edge cases" that sill leak sockets (albeit at a much slower rate than when GA had to reboot a server on 25-Mar-2024). You should decide if you want this partial fix now, or instead wish to wait until the full fix is available.

If you want this partial fix now, it will take us a day or so to cherry-pick it into the GA branch, build the packages, do another round of testing, and distribute the build to GA.

@zack-vii zack-vii merged commit a194645 into alpha Apr 16, 2024
1 check passed
@zack-vii zack-vii deleted the zck_dispatch branch April 16, 2024 04:53
@zack-vii
Copy link
Contributor Author

@mwinkel-dev : can you give me some details about the edge cases that are still leaking.

@mwinkel-dev
Copy link
Contributor

Hi @zack-vii,

I am testing the "action server" as I would any web server -- normal load (which is A-OK), spike load (fails), heavy continuous load (still to do) and so forth.

For the spike load test, I have mdstcl dispatch hundreds of actions faster than the action server can execute them. In that scenario, some of the dispatched actions are reported as failing because they've lost connection to the action server. Each of those failed actions leaks a socket. I have generated ~100 leaked sockets with that test.
#2740 (comment)
#2740 (comment)

I have also noticed in the code that the action server has a limited port range of 100 to 200 ports. It is possible (probable?) that the spike load test consumes all of those ports.

Although the spike load condition is unlikely to arise during normal workflows at GA, if it ever does then it might force GA to reboot the physical server. Which disrupts the work of many users and thus is a serious problem.

My hunch is that the fix for the spike load test will involve a client-side (mdstcl) fix. And instead of doing load balancing and dynamically adjusting the number of available receiver threads, a simpler approach would simply be to add a throttle to mdstcl so it can never dispatch more actions than available receiver threads. (I already wrote such a throttle as part of my client-side only approach for solving Issue #2731.)

Now that your client-side fix (i.e., this PR) has been merged to alpha, I will continue my investigation / testing. When I find the root cause of the problem, I will report my findings via GitHub (i.e., in a new issue).

@smithsp
Copy link

smithsp commented Apr 16, 2024

@mwinkel-dev We would like to take you up on your offer to get a cherry-pick version of this partial fix to our GA version with RPM kits. Thank you for your efforts.

@mwinkel-dev
Copy link
Contributor

Hi @smithsp,

OK, will do. Here are the steps.

  • This evening, I will repeat my socket tests to make sure the alpha branch (with this partial-fix PR merged in) behaves the same way it did when I tested it prior to the merge.
  • On Wednesday morning we will start the process (e.g., cherry-pick this PR into the GA branch, etc).
  • After the RPMs are produced, I will repeat the socket testing (plus run the full suite of automated tests).
  • And then we'll distribute the RPMs to GA.

Note: -- Unless GA objects, we will also include PR #2735 in the cherry-pick to the GA branch. That is a simple change that can be useful when troubleshooting multi-threaded code. It would be useful to have that feature in the GA branch too.

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Apr 17, 2024

Hi @smithsp and @sflanagan,

Before this PR was merged to the alpha branch, it passed all automated tests that Jenkins runs (on all platforms).

After it was merged to alpha, it also passed the following manual tests (performed on Ubuntu 20.04).

  • did not leak sockets under a normal workload
  • did not leak sockets under a spike load of 400 actions (see note below)
  • re-ran the IDL tests and they passed
  • passed the MATLAB tests

Caveats

  • This fix has not been tested with the "action monitor".
  • Has only been tested with a single computer running both the mdstcl dispatch and the "action service". Stated another way, it has not been tested with a pair of computers: one running mdstcl, the other running the "action service".
  • Has not been tested while other people were using the same computer.
  • My testing has only been done on Ubuntu 20.04; have not done testing on RHEL8.
  • My manual testing has only used the dispatch /build and dispatch /phase statements of mdstcl. Other dispatch related features have not been tested (e.g., abort server, dispatch /check, dispatch /close, and so forth).
  • The spike load test results are presently a mystery. The log files indicate that this spike load test worked OK. But there are also reasons to question the results. (The inconsistency compared to previous spike load tests might be caused by a slightly different test harness.)

Summary
This PR worked well after being merged to alpha. It likely covers all normal workflows. If it fails, it will likely be when dealing with edge cases.

@zack-vii
Copy link
Contributor Author

I have also noticed in the code that the action server has a limited port range of 100 to 200 ports. It is possible (probable?) that the spike load test consumes all of those ports.

the 100 ports are given by the default of the MDSIP_REPLY_PORT_RANGE (? or similar) env var which i thing ranges 8800-8899. it was used as one port per actionserver. I think this is not the case anymore as there is on listening port that handles all replies.

I will try to replicate your setup. Do you dispatch the actions in a single thread or multi-thread?

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Apr 17, 2024

Hi @zack-vii,

This post has three topics: PR #2740 behavior (versus prior), client-side cleanup, and my test harness.

1) With PR 2740 versus Without
Regarding last evening's testing of PR #2740 (in the alpha branch), I just realized that all 1,200 dispatched actions shared a single connection ID. My recollection (perhaps wrong) was that prior to the PR, that the "action service" created a separate connection ID for each action. Does sharing the same connection ID across all actions pose any problems? (The log file of the test indicates that the 1,200 actions probably did all execute correctly.). I will repeat this test after breakfast today to make sure my recollection is correct.

2) Client-Side Cleanup
The mdstcl client also maintains data structures. But they probably aren't being cleaned-up. As per my earlier post -- "Looks like the Client_do_message() of Client.h needs a Client_remove(c, factive) call in the case SrvJobFINISHED section.
#2740 (comment)

3) My Test Harness
I use mdstcl to dispatch actions via the dispatch phase command. My test cases typically dispatch in multiples of 400 actions. Although mdstcl is a multi-threaded program, it only uses a single thread to dispatch actions.

However, my test harness is also based on the client-side only fix that I created for Issue #2731. I have made numerous changes to my test harness to eliminate features that are now handled by your PR #2740. It is thus possible (probable?) that I am observing a bug in my test harness and not in the dispatch phase feature of mdstcl. After the build / release of the GA branch has been completed, I will investigate and let you know whether it is a bug in my code or not. And when my test harness is stable, I will also make it available to you.

Summary
For GA, the most important questions to answer are 1) and 2) above. Item 3) can be ignored for now.

Addendum
Regarding item 1), the errors log file for the "action service" only shows a single connection was made. I was instead expecting 400 connections made (even if the connection ID was reused).

For 2), I experimented with an approach that allowed the "receiver" threads to access the "thread-static information" in the "main" thread. My experimental code worked OK when it was configured to just read the linked list of connections. However, if it was configured to delete connections, then it caused deadlock when under heavy load. My hope is that with this PR #2740, we can ignore the client-side data structures -- however more testing should be done to confirm that doing so is OK.

Regarding 3), my test harness presently consists of a single instance of mdstcl. It is thus not a multi-threaded test (i.e., not running multiple instances of mdstcl by using multiple terminal windows).

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Apr 17, 2024

Hi @zack-vii -- Likely root cause of the edge case that leaks sockets is simply the difference between re-using an existing connection and creating new connections.

  • It appears that during normal usage, a single connection is opened between mdstcl and the "action service" -- and that one connection is used for dispatching all actions to the "action service". This is based on observing stable_7.96.9, alpha of 14-Apr-2024 (immediately prior to this PR 2740), and alpha of 16-Apr-2024 (includes PR 2740).
  • My test harness for the edge case is probably deleting the connection and re-creating it for every single action. Which if so, is definitely an abnormal workflow.

I will do a few more experiments this evening to confirm if this conjecture is indeed true.

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Apr 18, 2024

Hi @zack-vii,

Major mystery solved. And yes, my test harness was too extreme.

The crux of the matter is that the architectures of the mdsip and action services are not well documented by the comments in the source code. Thus as a maintenance programmer, one makes guesses based on patterns seen in small regions of the code without understanding the broader context. I guessed wrong. However, as those guesses fail, one learns the architecture of the services by trial and error.

Now that I have a better understanding of the services, I can answer my own questions.

1) With PR 2740 and Without
Yes, the architecture is designed to re-use a single connection to dispatch all actions. Thus stable_7.96.9 (the version GA was running prior to November 2023), alpha of 14-Apr-2024 (prior to PR #2740), and alpha of 16-Apr-2024 (with PR #2740) are all working as designed.

2) Client-side Cleanup
Definitely should not have a Client_remove() call in the SrvJobFINISHED case. The architecture is designed to re-use the connection, and thus the connection should be retained no matter how many actions have completed. The cases that do use Client_remove() are those that are associated with a defective connection. Although connections should not be removed when actions finish, perhaps there are other client-side cleanup tasks that should be done.

3) My Test Harness
My test harness was closing sockets and creating a new connection after each action executed (i.e., whenever an action's "receiver" thread received the SrvJobFINISHED reply). And thus was creating hundreds of connections at the same time that hundreds of actions were being executed. That is an abnormal work flow for mdstcl, and thus a very extreme stress test. Yes, mdstcl and the "action service" should be able to handle that stress test without leaking sockets. But it is an improbable workflow, and thus the release of PR #2740 is low risk.

Addendum
Apparently the "action service" only has one "receiver" thread. Which is a standard design for a web server. (But when I glanced at the high-level routines, I assumed that the service was starting a "receiver" thread for each action.)

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Apr 18, 2024

Prior to this PR, the sockets were being leaked in the server_connect() routine of ServerSendMessage.c. For each action, that function was always calling ConnectToMds() which in turn always creates a new connection.

This PR fixes that by changing server_connect() to reuse connections. It only calls ConnectToMds() in the occasional instance when it is unable to reuse a connection.

The other changes made by this PR are also useful (refactoring for clarity, using mutex to protect data structures, and so forth).

mwinkel-dev pushed a commit that referenced this pull request Apr 24, 2024
* Fix: reuse action_server connection id in ServerConnect; avoid duplicates in list

* Fix: set dispatched early; unset if dispatching failed; prevent race on fast actions

* Fix: lock Clients in ServerQAction; cleanup and check before use

* Fix: reconnect dropped connections

* Fix: use correct windows SOCKET print format

* Fix: satisfy rhel7 c standard
WhoBrokeTheBuild pushed a commit that referenced this pull request May 21, 2024
* Fix: reuse action_server connection id in ServerConnect; avoid duplicates in list

* Fix: set dispatched early; unset if dispatching failed; prevent race on fast actions

* Fix: lock Clients in ServerQAction; cleanup and check before use

* Fix: reconnect dropped connections

* Fix: use correct windows SOCKET print format

* Fix: satisfy rhel7 c standard
WhoBrokeTheBuild added a commit that referenced this pull request May 21, 2024
* Gm apd java (#2729)

* Improve APD support for Java interface

* Improve APD support for Java - forgotten files

* Commit packages

* When activate debug trace, now compiles without error. (#2735)

This fixes Issue 2734.

* Fix: reduce open files due to dispatcher (#2740)

* Fix: reuse action_server connection id in ServerConnect; avoid duplicates in list

* Fix: set dispatched early; unset if dispatching failed; prevent race on fast actions

* Fix: lock Clients in ServerQAction; cleanup and check before use

* Fix: reconnect dropped connections

* Fix: use correct windows SOCKET print format

* Fix: satisfy rhel7 c standard

* Gm apd thin cpp (#2742)

* Added ADP support in C++ thin client

* Added tdi fun

* Added TDI FUn

* Fix commands

* Gm new marte (#2743)

* more parameters for marte2_simulink_generic

* Proceed with the new implementation

* Proceed

* Proceed

* Proceed

* Proceed

* Proceed

* proceed

* Proceed

* Proceed

* Partially tested version

* Added execution times recording

* Proceed

* Procced with debugging

* Proceed

* Proceed

* Proceed

* Fixes for multisampled acquisition

* Remove quotes from string parameters

* Minor fixes

* Procced debugging

* Debugging

* More channels

* Debug Distributed configuration

* Fix sognal recording for synchronized inputs

* Further debug

* Further debug

* Small fixes

* Close ti final version

* Forgotten fix

* Make port visible, fix parameter name

* unaligned nids

* Increase DiscontinuityFactor

* Discontinuityfactor

* More channels

* Proceed with the new implementation

* Proceed

* Proceed

* Proceed

* Proceed

* Proceed

* proceed

* Proceed

* Proceed

* Partially tested version

* Added execution times recording

* Proceed

* Procced with debugging

* Proceed

* Proceed

* Proceed

* Fixes for multisampled acquisition

* Remove quotes from string parameters

* Minor fixes

* Procced debugging

* Debugging

* More channels

* Debug Distributed configuration

* Fix sognal recording for synchronized inputs

* Further debug

* Further debug

* Small fixes

* Close ti final version

* Forgotten fix

* Make port visible, fix parameter name

* unaligned nids

* Increase DiscontinuityFactor

* Discontinuityfactor

* More channels

* Packages updated

* Remove print

* Remove error messages

---------

Co-authored-by: mdsplus <mdsplus@roactive2.rfx.local>

* Docs: Improve documentation for getSegment* python wrappers (#2732)

Add explanation and rename parameters for:
* getSegmentLimits
* getSegmentList

* Fix: Update JAVASOURCE to 8 to support JDK 17 (#2747)

* Fix: improve mdstcl's error handling and add comments (#2746)

* add comments regarding action service

* send_reply() now does cleanup_client() on bad socket

* explain mdstcl's receiver thread cannot access main thread's connection list

* Improve handling of non-MDSplus error codes

* add comments regarding action dispatch

* add comment explaining receiver thread select loop

* Fix: multiple string escape warnings thrown by python 12 (#2748)

```
mdsplus/pydevices/RfxDevices/FAKECAMERA.py:40: SyntaxWarning: invalid escape sequence '\C'
  {'path': ':EXP_NODE', 'type': 'text', 'value': '\CAMERATEST::FLIR:FRAMES'},

mdsplus/pydevices/RfxDevices/PLFE.py:220: SyntaxWarning: invalid escape sequence '\#'
  '^(\#[0-5][01]([01][0-9][0-9]|2[0-4][0-9]|25[0-5])){6}$', msg)

mdsplus/pydevices/RfxDevices/CYGNET4K.py:361: SyntaxWarning: invalid escape sequence '\E'
  self.serialIO(b'\x55\x99\x66\x11\x50\EB', None)

mdsplus/pydevices/RfxDevices/CYGNET4K.py:461: SyntaxWarning: invalid escape sequence '\8'
  return self.setValue(b'\81\x82', min(0xFFF, value), True)

mdsplus/pydevices/MitDevices/dt100.py:161: SyntaxWarning: invalid escape sequence '\.'
  regstr = '([0-9\.]*) [0-9] ST_(.*)\r\n'
```

The \CAMERATEST became \\CAMERATEST
The regex strings should be python r-strings `r""`, but to maintain backwards compatibility, we're using \\
The broken hex-codes now have x in them

* Build: Resolve linker error after updating the windows builder to Fedora 39 (#2749)

* Build: Resolve linker error after updating the windows builder to Fedora 39

This appeared after updating the mdsplus/builder:windows docker image to Fedora 39, and Wine to 9.0
The newer libxml2 tried to link dynamically unless we explicitly set LIBXML_STATIC

* Hopefully fix the MdsTreeNodeTest

It turns out that this was failing previously, but we weren't properly catching the error

* Fix errors in windows build from newer gcc

* Docs: Update sites.csv (#2615)

add Startorus Fusion in Xi'an, China

* Fix: mdsip now sends proper auth status back to the client (#2752)

Fixes issues #2750 and #2652

* Fix: mdstcl's `show current` no longer segfaults when no tree paths defined (#2754)

* Fix: "show current" no longer segfaults when no tree paths defined

* Fix: corrected typo in error message

* Use original error message so tests pass

* Fix: Add Debian 12 and Ubuntu 24.04 and support GCC 12+ (#2753)

* Build: Add Debian 12 and Ubuntu 24.04

* Add extra flags for GCC 12+ and stub imp for Python 3.12

GCC 12+ triggers a bunch of false positive warnings (which we treat as errors)
This adds AX_C_FLAGS to configure those `-Wno-*` flags for GCC 12+
`cmdExecute.c` now uses snprintf to avoid buffer overflow warnings, also generated by GCC 12+
`compound.py.in` now supports Python 3.12+

* compound.py now supports Python 2.7.. again

---------

Co-authored-by: Stephen Lane-Walsh <slwalsh@psfc.mit.edu>

* Fix: Improve error messaging when calling Setup Device in jTraverser (#2744)

* Improve error messaging when calling Setup Device in jTraverser

e.getMessage() sometimes returned null, but just e will always print something
Add a printStackTrace() for InvocationTargetException exceptions to show the encapsulated error

* Add import for InvocationTargetException

* Build: Fix off-by-one versions produced by Jenkins (#2756)

This fixes the bug where `--os=bootstrap` wasn't receiving the version from `--version=x.y.z`
However, confusingly, this also changes the Jenkinsfile to not use that feature, and instead use `git tag` in order to embed the proper git information as well as the proper version information
The `--os=bootstrap` and `--version` fix is still included just so that it doesn't break if someone else tries to use it

* Build: Increase default test timeout to 1h (#2757)

When the build server(s) are at capacity, it's not unreasonable for a test to take more than 10 seconds, which was the old default timeout
This sets the default to 1h, and removes the overrides in various tests

* Gm fix filter (#2755)

* Allow filtering data from MinMax resampling; remove useless thread in jServer

* Fix compile error

* Remove debug message

* Make Windows Compiler happy

* Build: Fix 'HEAD' in `show version` and tag error (#2758)

Jenkins builds in a detached HEAD state, which caused bootstrap to use HEAD as the branch name
We pass --branch= to the bootstrap call in Jenkins, but $BRANCH wasn't being passed into the bootstrap docker container
Also, attempts to build alpha versions with tags that already existed failed

* Fix: mdstcl show version tag and links (#2760)

Fixes Issue #2759

* Feature: CompileTree will exit with non-zero status code for error messages. (#2446)

And error message should go to stderr.

* Build: Add package override for ubuntu and debian (#2761)

Override sections for Ubuntu 24 and Debian Bookworm were added.

* Fix: Python release version tag (#2764)

* Feature: Add "Date:" to show version output (#2767)

Implements #2766

Example:
```
$ mdstcl sho ver

MDSplus version: 7.140.75
----------------------
  Release:  alpha_release-7-140-75
  Date:     Thu May 16 17:43:14 UTC 2024
  Browse:   https://github.com/MDSplus/mdsplus/tree/alpha_release-7-140-75
  Download: https://github.com/MDSplus/mdsplus/releases/tag/alpha_release-7-140-75
```

* Fix: remove abort flag from RfxDevices DIO2 initialization (#2769)

Fixes issue #2768

* Fix: Missing repo metadata signing (#2770)

This will hopefully fix the lack of signed metadata files that are preventing us from automatically publishing releases

---------

Co-authored-by: GabrieleManduchi <gabriele.manduchi@igi.cnr.it>
Co-authored-by: mwinkel-dev <122583770+mwinkel-dev@users.noreply.github.com>
Co-authored-by: Timo Schroeder <zack-vii@users.noreply.github.com>
Co-authored-by: mdsplus <mdsplus@roactive2.rfx.local>
Co-authored-by: Josh Stillerman <jas@psfc.mit.edu>
Co-authored-by: Fernando Santoro <44955673+santorofer@users.noreply.github.com>
Co-authored-by: Louwrensth <Louwrensth@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An unexpected problem or unintended behavior tool/tcl Relates to the Tree Control Language or mdstcl prompt US Priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants