New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in task checkpointing #234

Closed
yadudoc opened this Issue Apr 25, 2018 · 5 comments

Comments

Projects
None yet
2 participants
@yadudoc
Contributor

yadudoc commented Apr 25, 2018

With config["globals"]["checkpointMode"] = "task_exit", the DFK attempts to checkpoint every task at task completion. However in the case of fast apps without dependencies, the task returns before the AppFuture is created. This results in a race condition between the checkpoint method and the task update.

If the checkpoint method is called first we get :

Traceback (most recent call last):
  File "test_regression_233.py", line 38, in <module>
    test_checkpoint_availability()
  File "test_regression_233.py", line 27, in test_checkpoint_availability
    original = run_checkpointed([])
  File "test_regression_233.py", line 18, in run_checkpointed
    x = cached_rand(i)
  File "/home/yadu/src/parsl/parsl/app/app_factory.py", line 76, in __call__
    return app_obj(*args, **kwargs)
  File "/home/yadu/src/parsl/parsl/app/python_app.py", line 46, in __call__
    **kwargs)
  File "/home/yadu/src/parsl/parsl/dataflow/dflow.py", line 499, in submit
    task_id, func, *new_args, **kwargs)
  File "/home/yadu/src/parsl/parsl/dataflow/dflow.py", line 322, in launch_task
    exec_fu.add_done_callback(partial(self.handle_update, task_id))
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 376, in add_done_callback
    fn(self)
  File "/home/yadu/src/parsl/parsl/dataflow/dflow.py", line 218, in handle_update
    self.checkpoint(tasks=[task_id])
  File "/home/yadu/src/parsl/parsl/dataflow/dflow.py", line 617, in checkpoint
    if self.tasks[task_id]['app_fu'].done() and \
AttributeError: 'NoneType' object has no attribute 'done'

@yadudoc yadudoc added the bug label Apr 25, 2018

@yadudoc yadudoc added this to the Parsl-0.6.0 milestone Apr 25, 2018

@yadudoc yadudoc self-assigned this Apr 25, 2018

@benclifford

This comment has been minimized.

Contributor

benclifford commented Apr 25, 2018

I also see this error - but in my case, as far as I can tell, it is from a bash app that is slow-ish (a 5 second sleep) - with no dependencies.

@yadudoc

This comment has been minimized.

Contributor

yadudoc commented Apr 25, 2018

This might be a different bug. I'm able to run 1s python apps with no issue.

@benclifford

This comment has been minimized.

Contributor

benclifford commented Apr 25, 2018

Yes, I have a different backtrace further up the stack more closely, now I look. It looks like it is something happening at the start of a call, not the end. I'll try to make a smaller test case.

yadudoc added a commit that referenced this issue Apr 25, 2018

Adding tests for regression for #234
Changing ec2 config type

yadudoc added a commit that referenced this issue Apr 25, 2018

Safer calls to checkpoint that does not rely on futures. fixes #234
Checkpointing now takes list of tasks to limit looping.
We rely on internal tasks status to avoid getting state from futures.

yadudoc added a commit that referenced this issue Apr 25, 2018

Adding tests for regression for #234
Changing ec2 config type

yadudoc added a commit that referenced this issue Apr 25, 2018

Safer calls to checkpoint that does not rely on futures. fixes #234
Checkpointing now takes list of tasks to limit looping.
We rely on internal tasks status to avoid getting state from futures.
@yadudoc

This comment has been minimized.

Contributor

yadudoc commented Apr 25, 2018

@benclifford Can you test with this branch please ?

@benclifford

This comment has been minimized.

Contributor

benclifford commented Apr 26, 2018

It looks like my exceptions weren't appearing against master, but against my hacked dev version: I don't get exceptions on master or with your #237 PR...

yadudoc added a commit that referenced this issue May 1, 2018

Adding tests for regression for #234
Changing ec2 config type

yadudoc added a commit that referenced this issue May 1, 2018

Safer calls to checkpoint that does not rely on futures. fixes #234
Checkpointing now takes list of tasks to limit looping.
We rely on internal tasks status to avoid getting state from futures.

@yadudoc yadudoc closed this in #237 May 3, 2018

yadudoc added a commit that referenced this issue May 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment