New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoints will not reload from a run that was Ctrl-C'ed #232

Closed
benclifford opened this Issue Apr 25, 2018 · 5 comments

Comments

Projects
None yet
2 participants
@benclifford
Contributor

benclifford commented Apr 25, 2018

If I use ctrl-c to kill a run, the checkpoint file will not load into a subsequent run:

2018-04-25 10:31:39 parsl.dataflow.dflow:692 [ERROR]  Failed to load Checkpoint: /home/benc/parsl/desc-skeleton/runinfo/001/checkpoint/tasks.pkl
Traceback (most recent call last):
  File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 672, in _load_checkpoints
    data = pickle.load(f)
TypeError: __init__() missing 2 required positional arguments: 'reason' and 'exitcode'

My gut feeling is that it is having trouble loading a serialised exception, but I haven't probed.

@yadudoc yadudoc added the bug label Apr 27, 2018

@yadudoc yadudoc self-assigned this Apr 27, 2018

@yadudoc

This comment has been minimized.

Contributor

yadudoc commented Apr 27, 2018

We need to have this investigated for 0.5.1.

@yadudoc yadudoc added this to the Parsl-0.5.1 milestone Apr 27, 2018

@yadudoc

This comment has been minimized.

Contributor

yadudoc commented May 1, 2018

Looks like we need a signal handler in the DFK. I suspect that we weren't writing a checkpoint at all in your case. Can you check if the checkpoint files were empty ?

@yadudoc

This comment has been minimized.

Contributor

yadudoc commented May 3, 2018

@benclifford Can you check if this test adequately capture your situation ? -> https://github.com/Parsl/parsl/blob/sigint_restart_%23232/parsl/tests/test_checkpointing/test_regression_232.py

I believe this issue might have disappeared now that we don't checkpoint failed tasks at all.

@benclifford

This comment has been minimized.

Contributor

benclifford commented May 4, 2018

Those tests look broadly fine, though I'm having trouble replicating what I think is the expected behaviour on my laptop when I kill things other than by pressing control-C.

In any case, whatever is left behind when I kill stuff with a ctrl-C does not upset checkpoint restore, so recent changes have seem to have fixed this bug, yes.

@yadudoc

This comment has been minimized.

Contributor

yadudoc commented May 4, 2018

Okay, I'm closing this then.

@yadudoc yadudoc closed this May 4, 2018

yadudoc added a commit that referenced this issue May 14, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment