Test failures from Radical pilot executor #3029

yadudoc · 2024-01-18T20:17:44Z

Describe the bug

I noticed some instability in the radical pilot executor that caused failures in the CI tests for the mpi_experimental_3 branch. These failures are intermittent so rerunning failed tasks gets past these errors, but it would be good to have these tests or the cause of instability fixed.

Here's a small snippet of the failure from github actions (https://github.com/Parsl/parsl/actions/runs/7479805965/job/20357787240#step:8:2239)

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.9/site-packages/typeguard/__init__.py:1033: in wrapper
    retval = func(*args, **kwargs)
parsl/dataflow/dflow.py:1428: in load
    cls._dfk = DataFlowKernel(config)
.venv/lib/python3.9/site-packages/typeguard/__init__.py:1033: in wrapper
    retval = func(*args, **kwargs)
parsl/dataflow/dflow.py:186: in __init__
    self.add_executors(config.executors)
parsl/dataflow/dflow.py:1130: in add_executors
    block_ids = executor.start()
parsl/executors/radical/executor.py:244: in start
    pilot = self.pmgr.submit_pilots(pd)
.venv/lib/python3.9/site-packages/radical/pilot/pilot_manager.py:588: in submit_pilots
    pilot = Pilot(pmgr=self, descr=pd)
.venv/lib/python3.9/site-packages/radical/pilot/pilot.py:120: in __init__
    self._endpoint_fs      = self._session._get_endpoint_fs     (pilot)
.venv/lib/python3.9/site-packages/radical/pilot/session.py:1539: in _get_endpoint_fs
    resource_sandbox  = self._get_resource_sandbox(pilot)
.venv/lib/python3.9/site-packages/radical/pilot/session.py:1410: in _get_resource_sandbox
    shell = self.get_js_shell(resource, schema)
.venv/lib/python3.9/site-packages/radical/pilot/session.py:1456: in get_js_shell
    shell = rsup.PTYShell(js_url, self)
.venv/lib/python3.9/site-packages/radical/saga/utils/pty_shell.py:236: in __init__
    self.pty_info   = self.factory.initialize (self.url,    self.session,
.venv/lib/python3.9/site-packages/radical/saga/utils/pty_shell_factory.py:206: in initialize
    self._initialize_pty(info['pty'], info)
.venv/lib/python3.9/site-packages/radical/saga/utils/pty_shell_factory.py:427: in _initialize_pty
    raise ptye.translate_exception (e) from e
.venv/lib/python3.9/site-packages/radical/saga/utils/pty_shell_factory.py:269: in _initialize_pty
    pty_shell.write(" export PROMPT_COMMAND='' PS1='$';"
.venv/lib/python3.9/site-packages/radical/saga/utils/pty_process.py:858: in write
    raise ptye.translate_exception(e, "write failed (%s)" % e) \
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <radical.saga.utils.pty_process.PTYProcess object at 0x7f9d58bb34f0>
data = " export PROMPT_COMMAND='' PS1='$'; set prompt='$'\n", nolog = False

    def write (self, data, nolog=False) :
        """
        This method will repeatedly attempt to push the given data into the
        child's stdin pipe, until it succeeds to write all data.
        """
    
        with self.rlock :
    
            if not self.alive (recover=False) :
                raise ptye.translate_exception(
                        se.NoSuccess("cannot write to dead process (%s) [%5d]"
                                     % (self.cache[-256:], self.parent_in)))
    
            try :
                log = self._hide_data (data, nolog)
                log =  log.replace ('\n', '\\n')
                log =  log.replace ('\r', '')
                if  len(log) > _DEBUG_MAX :
                    self.logger.debug ("write: [%5d] [%5d] (%s ... %s)"
                                    % (self.parent_in, len(data),
                                       log[:30], log[-30:]))
                else :
                    self.logger.debug ("write: [%5d] [%5d] (%s)"
                                    % (self.parent_in, len(data), log))
    
                # attempt to write forever -- until we succeeed
                while data :
    
                    # check if the pty pipe is ready for data
>                   _, wlist, _ = select.select ([], [self.parent_in], [],
                                                 _POLLDELAY)
E                                                ValueError: filedescriptor out of range in select()

.venv/lib/python3.9/site-packages/radical/saga/utils/pty_process.py:842: ValueError

To Reproduce
Steps to reproduce the behavior, for e.g:

Rerun tests in CI until you see this failure, this is an intermittent failure

Expected behavior
Tests passing without instability

Environment

Github Actions
branch: mpi_experimental_3

It would be nice if someone from the radical team, perhaps @AymenFJA, could take a quick look.

The text was updated successfully, but these errors were encountered:

AymenFJA · 2024-01-18T20:30:35Z

Hey @yadudoc, thanks for raising this. It seems like the error source is from the job submission layer (saga/PSI-J).

I am on it and I will try to reproduce it if possible.

benclifford · 2024-02-08T13:56:06Z

since some recent changes that hopefully fixed other CI failures, it looks like this error is now the dominant cause of CI test failures that I am experiencing.

benclifford · 2024-02-08T15:19:13Z

(this has affected every PR I've tried to merge or review today)

AymenFJA · 2024-02-08T15:28:16Z

@benclifford I am preparing a PR that fixes this in addition to several fixes by today.

AymenFJA · 2024-02-08T23:20:24Z

@yadudoc , @benclifford, a PR to fix this issue and others is open now #3060.

benclifford · 2024-02-15T20:52:20Z

@AymenFJA 's fix PR #3060 didn't satisfy my "why is this broken / how does this fix it?" curiosity, so I dug in a bit today. I've identified these issues which interact here, I think, in no particular order:

radical.saga PTYProcess misbehaves when more than FD_SETSIZE (I think) files are open: file descriptors returned by fork cannot always be used with select, resulting in base error message that is reported in this issue Test failures from Radical pilot executor #3029: ValueError: filedescriptor out of range in select() means the given file descriptor is bigger than FD_SETSIZE despite otherwise being valid (I think).
some part of the pytest stack hangs at exit when the above error occurs during a test, rather than exiting cleanly. I can recreate this hang by putting in a deliberate exception like this:

                     # check if the pty pipe is ready for data
                    raise RuntimeError("BENC: FAIL AT SELECT")
                    _, wlist, _ = select.select ([], [self.parent_in], [],
                                                 _POLLDELAY)

... and then running pytest parsl/tests/test_radical/test_mpi_funcs.py --config local

parsl tests in --config local mode are accumulating around 10 file descriptors per test: so it's pretty easy for 1024 open FDs to be reached, if the radical test happens to run late in the randomised order. On my laptop I've seen other fd exhaustion errors, in a ulimit constrained environment, distinct from this Test failures from Radical pilot executor #3029 radical hang. I've documented this more in issue --config local tests (and so probably parsl used without immediately exiting) accumulates many open fds. #3077

benclifford · 2024-02-15T21:45:30Z

I did some more hacking for the hang described about as "some part of the pytest stack hangs at exit": the stack is waiting for a process launched in radical.pilot/proxy.py to exit before the whole Python process will exit (as that is how multiprocessing processes work) -- however it looks like that process does not exit in this situation:

I modified my Python runtime so that it reports the name of processes that it waits for at exit, and I modified radical.pilot to add a name= on all processes, and I see this at the end of the hanging test when I'd expect pytest to exit.

        proc = mp.Process(target=self._worker, args=(sid, q, term, self._path), name="BENC: proxy")

and then:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================ 2 warnings, 1 error in 8.06s ================================================
calling join() for process BENC: proxy

If I set daemon=True on that proxy process, pytest exits without hanging, but I have not tried to understand if that is a safe thing for proxy.py to do - it was just to understand if that was the only hang.

benclifford · 2024-02-15T21:46:17Z

none of the above gives any explanation for what works (or does not work) about Aymen's PR #3060 though... so I'll try to understand that.

benclifford · 2024-02-15T22:25:20Z

adding safe-exit tag because this hang is also going to occur for real users in some situations

See radical-cybertools/radical.saga#885 which repeatedly recommends avoiding the radical SAGA layer, and the corresponding Parsl issue #3029

See radical-cybertools/radical.saga#885 which repeatedly recommends avoiding the radical SAGA layer, and the corresponding Parsl issue #3029 This PR updates the comment for psi-j-parsl in setup.py to reflect the different between psi-j-parsl and psij-python.

benclifford · 2024-02-16T11:29:15Z

hypothesis on why this failure has become worse recently:

this test failure is happening when around a thousand file descriptors are open at the start of the radical test.

this test failure happens in pytest --config local because many tests are run before, all of which leave some fds (and other junk) around (see issue #3077)

this test failure happens non-deterministically because the tests are deliberately run in random order to flush out inter-test dependencies (introduced in massive test rearrangement PR #1210) - if this test runs "early" in the test run, the fd count will be "low" and the test will not fail. if this test runs "late" in the test run, the fd count will be "high" and the test will fail

recently a bunch more tests have been added to the --config local collection, for example driven by recent htex feature development.

adding more tests means that it's more likely that this radical test will run beyond the 1000-ish tests

so while this bug has always existed, it's not a change in the radical stack that has caused it to happen more, but instead a change in the test environment to be more fd-intensive.

andre-merzky · 2024-02-18T04:16:58Z

Hi Ben,

thanks for deep-diving into this!

As Aymen said earlier, SAGA is on it's way out of RP and we'll likely not attempt to fix this on the SAGA layer. We still use it in some places as PSI/J does not jet support data staging. The PARSL tests triggering these kind of errors only add to the motivation to complete the transition though ;-) Let me have a look over the next few days to see if I can do a hackish SAGA replacement for the code path relevant to this.

PS: Setting daemon on the RP processes will avoid the hangs, but at the same time may result in zombie processes. Getting Python to clean up nicely when subprocesses and threads are mixed has been (an continues to be) a challenge unfortunately.

These tests are highlighting issues raised in issue #3029 and #3089, and should be re-instated as part of resolving those issues. Between them, the hangs of these tests meant that it was becoming hard, and in some cases apparently impossible, to merge new PRs.

andre-merzky · 2024-03-10T14:36:27Z

Quick note that this should be resolved with the next RPC release: the saga removal is completed, the respective code path simply does not exist anymore. Sorry it took so long, but it turned out that some refactoring and code removal ultimately was simpler (and more sensible) than fixing this.

The fix to this issue comes from a change in Radical Cybertools, not a change in the Parsl codebase. See radical-cybertools/radical.saga#885

yadudoc added the bug label Jan 18, 2024

AymenFJA mentioned this issue Jan 19, 2024

Intermittent RPEX failure (PSI-J/SAGA layer) radical-cybertools/radical.saga#885

Closed

AymenFJA linked a pull request Feb 8, 2024 that will close this issue

Fix two bugs and add support for N masters, N workers with load balancing in RadicalPilotExecutor #3060

Open

benclifford mentioned this issue Feb 15, 2024

--config local tests (and so probably parsl used without immediately exiting) accumulates many open fds. #3077

Open

benclifford added the safe-exit label Feb 15, 2024

benclifford added a commit that referenced this issue Feb 16, 2024

Add psij-python dependency for radical to avoid radical.saga layer

e6128db

See radical-cybertools/radical.saga#885 which repeatedly recommends avoiding the radical SAGA layer, and the corresponding Parsl issue #3029

benclifford mentioned this issue Feb 16, 2024

Add psij-python dependency for radical to avoid radical.saga layer #3079

Closed

benclifford mentioned this issue Feb 20, 2024

Skip two tests that are blocking CI with hangs #3090

Merged

benclifford added a commit that referenced this issue Mar 24, 2024

Re-enable Radical/MPI test that was waiting on issue #3029

53fce2a

The fix to this issue comes from a change in Radical Cybertools, not a change in the Parsl codebase. See radical-cybertools/radical.saga#885

benclifford linked a pull request Mar 24, 2024 that will close this issue

Re-enable Radical/MPI test that was waiting on issue #3029 #3286

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test failures from Radical pilot executor #3029

Test failures from Radical pilot executor #3029

yadudoc commented Jan 18, 2024

AymenFJA commented Jan 18, 2024

benclifford commented Feb 8, 2024

benclifford commented Feb 8, 2024

AymenFJA commented Feb 8, 2024

AymenFJA commented Feb 8, 2024

benclifford commented Feb 15, 2024 •

edited

benclifford commented Feb 15, 2024

benclifford commented Feb 15, 2024

benclifford commented Feb 15, 2024

benclifford commented Feb 16, 2024

andre-merzky commented Feb 18, 2024

andre-merzky commented Mar 10, 2024

Test failures from Radical pilot executor #3029

Test failures from Radical pilot executor #3029

Comments

yadudoc commented Jan 18, 2024

AymenFJA commented Jan 18, 2024

benclifford commented Feb 8, 2024

benclifford commented Feb 8, 2024

AymenFJA commented Feb 8, 2024

AymenFJA commented Feb 8, 2024

benclifford commented Feb 15, 2024 • edited

benclifford commented Feb 15, 2024

benclifford commented Feb 15, 2024

benclifford commented Feb 15, 2024

benclifford commented Feb 16, 2024

andre-merzky commented Feb 18, 2024

andre-merzky commented Mar 10, 2024

benclifford commented Feb 15, 2024 •

edited