Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix `test_midway_ipp.ssh.slurm.config1.py` test #196

Closed
annawoodard opened this Issue Apr 5, 2018 · 2 comments

Comments

Projects
None yet
3 participants
@annawoodard
Copy link
Collaborator

annawoodard commented Apr 5, 2018

I see the following error when running
python test_sites/test_midway/test_midway_ipp.ssh.slurm.config1.py

with the singleNode config:

Traceback (most recent call last):
  File "/Users/awoodard/software/anaconda3/envs/parsl_py36/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/Users/awoodard/software/anaconda3/envs/parsl_py36/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/awoodard/ci/parsl/parsl/dataflow/flow_control.py", line 121, in _wake_up_timer
    self.make_callback(kind='timer')
  File "/Users/awoodard/ci/parsl/parsl/dataflow/flow_control.py", line 141, in make_callback
    self.callback(tasks=self._event_buffer, kind=kind)
  File "/Users/awoodard/ci/parsl/parsl/dataflow/strategy.py", line 162, in _strategy_simple
    status = exc.status()
  File "/Users/awoodard/ci/parsl/parsl/executors/ipp.py", line 253, in status
    status = self.execution_provider.status(self.engines)
  File "/Users/awoodard/ci/libsubmit/libsubmit/providers/cluster_provider.py", line 174, in status
    self._status()
  File "/Users/awoodard/ci/libsubmit/libsubmit/providers/slurm/slurm.py", line 141, in _status
    self.resources[job_id]['status'] = status
KeyError: '41784475_20'

@yadudoc yadudoc self-assigned this May 7, 2018

@yadudoc yadudoc added the bug label May 7, 2018

@yadudoc yadudoc added this to the Parsl-0.6.0 milestone May 7, 2018

@djf604

This comment has been minimized.

Copy link
Collaborator

djf604 commented May 8, 2018

Running test_multi_ipp_None_local.py

Spits out the following log:

INFO:parsl.dataflow.dflow:Task 18 launched on site Local_IPP_1
DEBUG:parsl.dataflow.dflow:Task 18 launched with AppFut:<AppFuture at 0x7f02d208f5c0 state=pending>
INFO:parsl.dataflow.dflow:Task 19 submitted for App python_app_2, waiting on tasks []
INFO:parsl.dataflow.dflow:Task 19 launched on site Local_IPP_2
DEBUG:parsl.dataflow.dflow:Task 19 launched with AppFut:<AppFuture at 0x7f02d208fda0 state=pending>
Waiting ....
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/cephfs/users/dominic/.conda/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/cephfs/users/dominic/.conda/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/cephfs/users/dominic/.conda/lib/python3.6/site-packages/parsl/dataflow/flow_control.py", line 121, in _wake_up_timer
    self.make_callback(kind='timer')
  File "/cephfs/users/dominic/.conda/lib/python3.6/site-packages/parsl/dataflow/flow_control.py", line 141, in make_callback
    self.callback(tasks=self._event_buffer, kind=kind)
  File "/cephfs/users/dominic/.conda/lib/python3.6/site-packages/parsl/dataflow/strategy.py", line 162, in _strategy_simple
    status = exc.status()
  File "/cephfs/users/dominic/.conda/lib/python3.6/site-packages/parsl/executors/ipp.py", line 253, in status
    status = self.execution_provider.status(self.engines)
  File "/cephfs/users/dominic/.conda/lib/python3.6/site-packages/libsubmit/providers/cluster_provider.py", line 180, in status
    self._status()
  File "/cephfs/users/dominic/.conda/lib/python3.6/site-packages/libsubmit/providers/slurm/slurm.py", line 146, in _status
    self.resources[job_id]['status'] = status
KeyError: '16083'

INFO:parsl.dataflow.dflow:Task 1 completed
INFO:parsl.dataflow.dflow:Task 3 completed
INFO:parsl.dataflow.dflow:Task 5 completed
INFO:parsl.dataflow.dflow:Task 7 completed

While the queue at the time of execution is:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
16083  daenerys ICD_Enco   akumar  R 3-20:14:25      1 kg15-20
16264  daenerys dnaseq-a   thomas  R    2:05:47      1 kg15-21
16265  daenerys dnaseq-a   thomas  R    2:05:47      1 kg15-21

This only happens when the config has multiple defined sites, otherwise it's fine. It's only been tested on slurm, so it may or may not occur on other clusters. Notice the job ID that produces the key error does actually exist, but it's always the oldest job ID in the queue. Some of the engines do successfully execute jobs, but the script as a whole will never terminate.

@yadudoc

This comment has been minimized.

Copy link
Member

yadudoc commented May 8, 2018

The stategy module polls the status of all jobs periodically. Often this poll happens prior to the actual submission of jobs to the site. At this point when there are no jobs in the resources list, slurm returns counter-intuitive results when called with squeue --jobs .This returns all jobs instead of none, and that breaks the provider code with a KeyError.

I'm able to reproduce this issue. A fix is in testing.

yadudoc added a commit to Parsl/libsubmit that referenced this issue May 8, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.