[not for merge] benc poking at CI hangs in flux test #3259

benclifford · 2024-03-18T14:17:01Z

No description provided.

merging

…flux merge

…iled tests

…r me now

benclifford · 2024-03-18T21:28:18Z

@mercybassey looks like I managed to make the hang reproducible both in CI and running the equivalent commands in docker on my laptop - this branch is a fork of your flux testing PR.

benclifford · 2024-03-18T22:06:34Z

@jameshcorbett @vsoch hi, not sure if you're interested in digging into this at the moment, but this branch runs flux tests based on @mercybassey 's PR #3159 which was hanging occasionally - I fixed the seed of test order randomisation and added a bit of logging and it looks like it hangs every time now in parsl/tests/test_flux.py::test_affinity. I also ran this same flux container on my laptop and get the same hang every time there too. So there's something reproducible here.

I put some log statements in in order to try to trace what's happening and it seems to non-deterministically hang somewhere around the point that flux.job.executor.FluxExecutor.__init__ is executed. The last log message is usually
but not always around these lines:

        self._shutdown_lock = threading.Lock()
        self._broken_event = threading.Event()
        self._shutdown_event = threading.Event()

So I think this is something weird happening inside Flux proper rather than in the parsl.executors.flux flux executor, but I don't have enough feeling for whats really meant to be happening here to give a decent diagnosis...

If you're interested, I think you should be able to recreate the hang using the commands in the parsl+flux.yaml github actions workflow that this PR adds - it's what I did on my laptop. Let me know if there's anything I can do to get more useful information.

vsoch · 2024-03-18T23:21:19Z

My guess would be using that function you don't have any cores you are allowed to run on, so it doesn't run (and hangs). https://stackoverflow.com/questions/64189176/os-sched-getaffinity0-vs-os-cpu-count

At least on Linux I found this to mean that if none of the allowed cores is currently available, the thread of a child-process won't run, even if other, non-allowed cores would be idle. So "affinity" is a bit misleading here.

There might be more information in that issue, not really sure what you are testing there, but asking to submit a job that asks for a value (the affinity) that will hang / not return in the case that there are none available in a test environment with a bunch of other stuff running (and using up the threads) smells funny.

benclifford · 2024-03-19T00:04:12Z

That test_affinity test does pass in some situations (it was hard for us to get it to happen in CI until I realised it seems to be test order dependent) and especially if I run just that test on its own, it seems to pass just fine.

With test_affinity running at the end of the test sequence, using --random-order-seed=893320, which is how I got the order that seems to hang for me often:

If replace sched_getaffinity with eg. eval I still get hangs:

        future = executor.submit(eval, {"cores_per_task": 2}, "[1,2,3]")

        # future = executor.submit(os.sched_getaffinity, {"cores_per_task": 2}, 0)

If I drop cores_per_task to 1, I still get hangs.

If I remove the core resource specification entirely, I still get hangs:

        future = executor.submit(eval, {}, "[1,2,3]")
        # future = executor.submit(eval, {"cores_per_task": 1}, "[1,2,3]")
        # future = executor.submit(os.sched_getaffinity, {"cores_per_task": 2}, 0)

... and this very special horror: if I $ touch parsl/tests/test_flux.py then the first time I run:

flux start pytest parsl/tests/test_flux.py --config local  --random-order --random-order-bucket=module --random-order-seed=893320 --full-trace --log-cli-level=DEBUG

the whole set of tests pass the first time. And then do not pass if I re-run inside the same container. Until I touch test_flux.py again. I'm not really clear what effect changing the time metadata on that test_flux.py file would have, though...

So I don't think this is directly anything to do with that os.sched_getaffinity call, and more to do with ... something else?

This test_flux.py came as part of the PR #2051 contribution of the parsl/flux executor from @jameshcorbett and I think I've never tried running it before the last week or so - so I don't have any feel for what could go wrong here.

vsoch · 2024-03-19T00:36:46Z

Sorry can’t add more insight here - I don’t really understand this test.

mercybassey and others added 30 commits March 6, 2024 09:14

Added flux ci

b3bcdef

removed tests

1b6e399

configured the CI to install Parsl

5c7a6c0

Merge branch 'master' into parsl+flux

2bf4de1

Merge branch 'master' into parsl+flux

a5fbe78

Install checked-out version of Parsl in CI

b1eda08

Merge remote-tracking branch 'origin/parsl+flux' into parsl+flux

9fa40ef

merging

Add basic Parsl verification to CI

efacad0

Fixed indentation

177944c

Add step to install pytest

f714af6

Installed pytest and pytest-random-order for CI tests

a71e86c

Install dependencies from test-requirements.txt for CI tests

8447148

Adjust CI test command to resolve pytest failure

54d21e8

Merge branch 'master' into parsl+flux

d8b073b

Combining steps

15d280e

Merge branch 'parsl+flux' of github.com:mercybassey/parsl into parsl+…

f1c8105

…flux merge

Added --random-order

961561c

Configured CI to install python3-dev

a2f0923

Added a test for writing to non-writable directory

fe5921e

Merge branch 'master' into parsl+flux

c934f15

Added logging in the test

89c294f

Merge branch 'parsl+flux' of github.com:mercybassey/parsl into parsl+…

00741ba

…flux merge

Merge branch 'master' into parsl+flux

636a1ce

Merge branch 'master' into parsl+flux

a9d8afa

commented out failing test

e696cc0

Merge branch 'parsl+flux' of github.com:mercybassey/parsl into parsl+…

c44adfc

…flux merge

Merge branch 'master' into parsl+flux

bb03950

commented out failing tests

c79c0ec

fixed flake8 issue

48d4e9a

Added a step to test Parsl with flux

d15ee18

mercybassey and others added 25 commits March 11, 2024 14:53

fixed flux test

627dccc

Merge branch 'master' into parsl+flux

8401356

running ci again

3a5628f

fix makefile typo

448ae1f

Added flux test config

0b8676f

minor

c9bab6b

fixed flake8 error and seperated the flux+parsl test in ci

45b4032

Merge branch 'master' into parsl+flux

ee3af80

running tests

d661fc1

Merge branch 'master' into parsl+flux

791c268

Running tests

9df54fc

Merge branch 'parsl+flux' of github.com:mercybassey/parsl into parsl+…

8bc0bfd

…flux merge

fixed flake8 error

0903435

remove a mistake in the makefile

2583075

added a comment at 'flux_local_test'

b514001

Merge branch 'master' into parsl+flux

b2b96fd

Merge branch 'master' into parsl+flux

c7761aa

took test_stdout.py to how it was and corrected ci

91d1539

Omitted tests marked as as well in 'Test Parsl with Flux Config'

215a035

remove main CI workflow for debug

0e806df

Add a bit more runtime fuel

70abc3d

Merge remote-tracking branch 'origin/master' into benc-tmp-debug-ci-hang

d0d8913

store some test artefacts, which might help debug hangs

a41b732

Turn out debug logging, copy pytest random order seed from earlier fa…

989071f

…iled tests

Add logging in affinity test which is the test that has hung twice fo…

8f6d5c8

…r me now

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[not for merge] benc poking at CI hangs in flux test #3259

[not for merge] benc poking at CI hangs in flux test #3259

benclifford commented Mar 18, 2024

benclifford commented Mar 18, 2024

benclifford commented Mar 18, 2024

vsoch commented Mar 18, 2024

benclifford commented Mar 19, 2024

vsoch commented Mar 19, 2024

[not for merge] benc poking at CI hangs in flux test #3259

Are you sure you want to change the base?

[not for merge] benc poking at CI hangs in flux test #3259

Conversation

benclifford commented Mar 18, 2024

benclifford commented Mar 18, 2024

benclifford commented Mar 18, 2024

vsoch commented Mar 18, 2024

benclifford commented Mar 19, 2024

vsoch commented Mar 19, 2024