Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failures from Radical pilot executor #3029

Open
yadudoc opened this issue Jan 18, 2024 · 12 comments · May be fixed by #3060 or #3286
Open

Test failures from Radical pilot executor #3029

yadudoc opened this issue Jan 18, 2024 · 12 comments · May be fixed by #3060 or #3286

Comments

@yadudoc
Copy link
Member

yadudoc commented Jan 18, 2024

Describe the bug

I noticed some instability in the radical pilot executor that caused failures in the CI tests for the mpi_experimental_3 branch. These failures are intermittent so rerunning failed tasks gets past these errors, but it would be good to have these tests or the cause of instability fixed.

Here's a small snippet of the failure from github actions (https://github.com/Parsl/parsl/actions/runs/7479805965/job/20357787240#step:8:2239)

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.9/site-packages/typeguard/__init__.py:1033: in wrapper
    retval = func(*args, **kwargs)
parsl/dataflow/dflow.py:1428: in load
    cls._dfk = DataFlowKernel(config)
.venv/lib/python3.9/site-packages/typeguard/__init__.py:1033: in wrapper
    retval = func(*args, **kwargs)
parsl/dataflow/dflow.py:186: in __init__
    self.add_executors(config.executors)
parsl/dataflow/dflow.py:1130: in add_executors
    block_ids = executor.start()
parsl/executors/radical/executor.py:244: in start
    pilot = self.pmgr.submit_pilots(pd)
.venv/lib/python3.9/site-packages/radical/pilot/pilot_manager.py:588: in submit_pilots
    pilot = Pilot(pmgr=self, descr=pd)
.venv/lib/python3.9/site-packages/radical/pilot/pilot.py:120: in __init__
    self._endpoint_fs      = self._session._get_endpoint_fs     (pilot)
.venv/lib/python3.9/site-packages/radical/pilot/session.py:1539: in _get_endpoint_fs
    resource_sandbox  = self._get_resource_sandbox(pilot)
.venv/lib/python3.9/site-packages/radical/pilot/session.py:1410: in _get_resource_sandbox
    shell = self.get_js_shell(resource, schema)
.venv/lib/python3.9/site-packages/radical/pilot/session.py:1456: in get_js_shell
    shell = rsup.PTYShell(js_url, self)
.venv/lib/python3.9/site-packages/radical/saga/utils/pty_shell.py:236: in __init__
    self.pty_info   = self.factory.initialize (self.url,    self.session,
.venv/lib/python3.9/site-packages/radical/saga/utils/pty_shell_factory.py:206: in initialize
    self._initialize_pty(info['pty'], info)
.venv/lib/python3.9/site-packages/radical/saga/utils/pty_shell_factory.py:427: in _initialize_pty
    raise ptye.translate_exception (e) from e
.venv/lib/python3.9/site-packages/radical/saga/utils/pty_shell_factory.py:269: in _initialize_pty
    pty_shell.write(" export PROMPT_COMMAND='' PS1='$';"
.venv/lib/python3.9/site-packages/radical/saga/utils/pty_process.py:858: in write
    raise ptye.translate_exception(e, "write failed (%s)" % e) \
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <radical.saga.utils.pty_process.PTYProcess object at 0x7f9d58bb34f0>
data = " export PROMPT_COMMAND='' PS1='$'; set prompt='$'\n", nolog = False

    def write (self, data, nolog=False) :
        """
        This method will repeatedly attempt to push the given data into the
        child's stdin pipe, until it succeeds to write all data.
        """
    
        with self.rlock :
    
            if not self.alive (recover=False) :
                raise ptye.translate_exception(
                        se.NoSuccess("cannot write to dead process (%s) [%5d]"
                                     % (self.cache[-256:], self.parent_in)))
    
            try :
                log = self._hide_data (data, nolog)
                log =  log.replace ('\n', '\\n')
                log =  log.replace ('\r', '')
                if  len(log) > _DEBUG_MAX :
                    self.logger.debug ("write: [%5d] [%5d] (%s ... %s)"
                                    % (self.parent_in, len(data),
                                       log[:30], log[-30:]))
                else :
                    self.logger.debug ("write: [%5d] [%5d] (%s)"
                                    % (self.parent_in, len(data), log))
    
                # attempt to write forever -- until we succeeed
                while data :
    
                    # check if the pty pipe is ready for data
>                   _, wlist, _ = select.select ([], [self.parent_in], [],
                                                 _POLLDELAY)
E                                                ValueError: filedescriptor out of range in select()

.venv/lib/python3.9/site-packages/radical/saga/utils/pty_process.py:842: ValueError

To Reproduce
Steps to reproduce the behavior, for e.g:

  1. Rerun tests in CI until you see this failure, this is an intermittent failure

Expected behavior
Tests passing without instability

Environment

  • Github Actions
  • branch: mpi_experimental_3

It would be nice if someone from the radical team, perhaps @AymenFJA, could take a quick look.

@yadudoc yadudoc added the bug label Jan 18, 2024
@AymenFJA
Copy link
Contributor

Hey @yadudoc, thanks for raising this. It seems like the error source is from the job submission layer (saga/PSI-J).

I am on it and I will try to reproduce it if possible.

@benclifford
Copy link
Collaborator

since some recent changes that hopefully fixed other CI failures, it looks like this error is now the dominant cause of CI test failures that I am experiencing.

@benclifford
Copy link
Collaborator

(this has affected every PR I've tried to merge or review today)

@AymenFJA
Copy link
Contributor

AymenFJA commented Feb 8, 2024

@benclifford I am preparing a PR that fixes this in addition to several fixes by today.

@AymenFJA
Copy link
Contributor

AymenFJA commented Feb 8, 2024

@yadudoc , @benclifford, a PR to fix this issue and others is open now #3060.

@benclifford
Copy link
Collaborator

benclifford commented Feb 15, 2024

@AymenFJA 's fix PR #3060 didn't satisfy my "why is this broken / how does this fix it?" curiosity, so I dug in a bit today. I've identified these issues which interact here, I think, in no particular order:

  • radical.saga PTYProcess misbehaves when more than FD_SETSIZE (I think) files are open: file descriptors returned by fork cannot always be used with select, resulting in base error message that is reported in this issue Test failures from Radical pilot executor #3029: ValueError: filedescriptor out of range in select() means the given file descriptor is bigger than FD_SETSIZE despite otherwise being valid (I think).

  • some part of the pytest stack hangs at exit when the above error occurs during a test, rather than exiting cleanly. I can recreate this hang by putting in a deliberate exception like this:

                     # check if the pty pipe is ready for data
                    raise RuntimeError("BENC: FAIL AT SELECT")
                    _, wlist, _ = select.select ([], [self.parent_in], [],
                                                 _POLLDELAY)

... and then running pytest parsl/tests/test_radical/test_mpi_funcs.py --config local

@benclifford
Copy link
Collaborator

I did some more hacking for the hang described about as "some part of the pytest stack hangs at exit": the stack is waiting for a process launched in radical.pilot/proxy.py to exit before the whole Python process will exit (as that is how multiprocessing processes work) -- however it looks like that process does not exit in this situation:

I modified my Python runtime so that it reports the name of processes that it waits for at exit, and I modified radical.pilot to add a name= on all processes, and I see this at the end of the hanging test when I'd expect pytest to exit.

        proc = mp.Process(target=self._worker, args=(sid, q, term, self._path), name="BENC: proxy")

and then:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================ 2 warnings, 1 error in 8.06s ================================================
calling join() for process BENC: proxy

If I set daemon=True on that proxy process, pytest exits without hanging, but I have not tried to understand if that is a safe thing for proxy.py to do - it was just to understand if that was the only hang.

@benclifford
Copy link
Collaborator

none of the above gives any explanation for what works (or does not work) about Aymen's PR #3060 though... so I'll try to understand that.

@benclifford
Copy link
Collaborator

adding safe-exit tag because this hang is also going to occur for real users in some situations

benclifford added a commit that referenced this issue Feb 16, 2024
See radical-cybertools/radical.saga#885
which repeatedly recommends avoiding the radical SAGA layer, and
the corresponding Parsl issue #3029
benclifford added a commit that referenced this issue Feb 16, 2024
See radical-cybertools/radical.saga#885
which repeatedly recommends avoiding the radical SAGA layer, and
the corresponding Parsl issue #3029

This PR updates the comment for psi-j-parsl in setup.py to reflect
the different between psi-j-parsl and psij-python.
@benclifford
Copy link
Collaborator

hypothesis on why this failure has become worse recently:

this test failure is happening when around a thousand file descriptors are open at the start of the radical test.

this test failure happens in pytest --config local because many tests are run before, all of which leave some fds (and other junk) around (see issue #3077)

this test failure happens non-deterministically because the tests are deliberately run in random order to flush out inter-test dependencies (introduced in massive test rearrangement PR #1210) - if this test runs "early" in the test run, the fd count will be "low" and the test will not fail. if this test runs "late" in the test run, the fd count will be "high" and the test will fail

recently a bunch more tests have been added to the --config local collection, for example driven by recent htex feature development.

adding more tests means that it's more likely that this radical test will run beyond the 1000-ish tests

so while this bug has always existed, it's not a change in the radical stack that has caused it to happen more, but instead a change in the test environment to be more fd-intensive.

@andre-merzky
Copy link

Hi Ben,

thanks for deep-diving into this!

As Aymen said earlier, SAGA is on it's way out of RP and we'll likely not attempt to fix this on the SAGA layer. We still use it in some places as PSI/J does not jet support data staging. The PARSL tests triggering these kind of errors only add to the motivation to complete the transition though ;-) Let me have a look over the next few days to see if I can do a hackish SAGA replacement for the code path relevant to this.

PS: Setting daemon on the RP processes will avoid the hangs, but at the same time may result in zombie processes. Getting Python to clean up nicely when subprocesses and threads are mixed has been (an continues to be) a challenge unfortunately.

benclifford added a commit that referenced this issue Feb 20, 2024
These tests are highlighting issues raised in issue #3029 and #3089,
and should be re-instated as part of resolving those issues.

Between them, the hangs of these tests meant that it was becoming
hard, and in some cases apparently impossible, to merge new PRs.
benclifford added a commit that referenced this issue Feb 20, 2024
These tests are highlighting issues raised in issue #3029 and #3089,
and should be re-instated as part of resolving those issues.

Between them, the hangs of these tests meant that it was becoming
hard, and in some cases apparently impossible, to merge new PRs.
@andre-merzky
Copy link

Quick note that this should be resolved with the next RPC release: the saga removal is completed, the respective code path simply does not exist anymore. Sorry it took so long, but it turned out that some refactoring and code removal ultimately was simpler (and more sensible) than fixing this.

benclifford added a commit that referenced this issue Mar 24, 2024
The fix to this issue comes from a change in Radical Cybertools, not a
change in the Parsl codebase.
See radical-cybertools/radical.saga#885
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment