Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zdtm: Distinguish between fail and crash of dump #2376

Open
wants to merge 2 commits into
base: criu-dev
Choose a base branch
from

Conversation

bsach64
Copy link

@bsach64 bsach64 commented Mar 28, 2024

This PR distinguishes criu dump crash from criu dump fail in the zdtm.py script. Now, zdtm reports criu dump crash as a test failure.

Very similiar to the work done in this PR (#2140)

But, to detect a rpc crash we see that if the response from the RPC is invalid, then we assume
the RPC crashed.

For detecting the crash for the CLI we can check the Popen.returncode
https://docs.python.org/3/library/subprocess.html#subprocess.Popen.returncode

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

Fixes: #350

@codecov-commenter
Copy link

codecov-commenter commented Mar 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.57%. Comparing base (00d7cdc) to head (6298e80).

❗ Current head 6298e80 differs from pull request most recent head 4bfaf6e. Consider uploading reports for the commit 4bfaf6e to get more accurate results

Additional details and impacted files
@@             Coverage Diff              @@
##           criu-dev    #2376      +/-   ##
============================================
- Coverage     70.73%   70.57%   -0.16%     
============================================
  Files           136      135       -1     
  Lines         32940    33449     +509     
============================================
+ Hits          23299    23606     +307     
- Misses         9641     9843     +202     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

test/zdtm.py Outdated
def exit_signal(ret):
if ret == -2:
return True
return False
Copy link
Member

@rst0git rst0git Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified to:

def exit_signal(ret):
    return ret == -2

Why do we use -2 here?

Copy link
Author

@bsach64 bsach64 Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should've written that in the first place. I will fix it. Thank You :)

If I understand your question correctly,
The run static method returns -1 if criu dump fails.
I changed the run method to return -2 if rpc sends an invalid response (CRIU crashed?).

The exit_signal function checks for a crash. So, if ret == -2 then it means CRIU crashed.

test/zdtm.py Outdated Show resolved Hide resolved
@Snorch
Copy link
Member

Snorch commented Apr 16, 2024

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Adds a exit_signal static method to criu_cli, criu_config and criu_rpc
used to detect a crash.

Fixes: checkpoint-restore#350

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
@bsach64
Copy link
Author

bsach64 commented Apr 19, 2024

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch,
I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.
On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh.
There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

@Snorch
Copy link
Member

Snorch commented Apr 22, 2024

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch, I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.

Looks correct. But from zdtm.py point of view it may be not so simple, see how __criu_act retries action without fault, we likely don't want our fault to be retry-able by zdtm, so maybe just make it's id >=128 ?

On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh. There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

Yeh, it's a good question. Let's just use one random test with crfail (e.g.: vfork00). I don't see easy alternative to this approach, maybe someone else does...

@bsach64
Copy link
Author

bsach64 commented Apr 24, 2024

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch, I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.

Looks correct. But from zdtm.py point of view it may be not so simple, see how __criu_act retries action without fault, we likely don't want our fault to be retry-able by zdtm, so maybe just make it's id >=128 ?

I don't think giving the fault injection an id >= 128 is what we want.
If I understand correctly, ids >= 128 refer to non-fatal fault injections (criu/include/fault-injection.h). Giving this fatal fault injection an id >= 128 causes tests to fail. I think currently zdtm retries all fatal fault injections. I will try to figure out a different way :)
Can you elaborate on why we don't want our fault to be retry-able?

On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh. There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

Yeh, it's a good question. Let's just use one random test with crfail (e.g.: vfork00). I don't see easy alternative to this approach, maybe someone else does...

I have added vfork00 to criu-fault.sh.

@Snorch
Copy link
Member

Snorch commented Apr 26, 2024

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch, I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.

Looks correct. But from zdtm.py point of view it may be not so simple, see how __criu_act retries action without fault, we likely don't want our fault to be retry-able by zdtm, so maybe just make it's id >=128 ?

I don't think giving the fault injection an id >= 128 is what we want. If I understand correctly, ids >= 128 refer to non-fatal fault injections (criu/include/fault-injection.h). Giving this fatal fault injection an id >= 128 causes tests to fail.

I may misunderstand something, correct me if I'm wrong, but isn't a test fail exactly the behavior we expect in case of criu crash? In other words: if criu fails to dump/restore crfail test zdtm reports PASS, if criu crashes zdtm should report FAIL, isn't it? (And in criu-fault.sh we should expect FAIL from zdtm.)

I think currently zdtm retries all fatal fault injections. I will try to figure out a different way :) Can you elaborate on why we don't want our fault to be retry-able?

On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh. There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

Yeh, it's a good question. Let's just use one random test with crfail (e.g.: vfork00). I don't see easy alternative to this approach, maybe someone else does...

I have added vfork00 to criu-fault.sh.

test/jenkins/criu-fault.sh Outdated Show resolved Hide resolved
Copy link
Member

@Snorch Snorch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@bsach64
Copy link
Author

bsach64 commented Apr 26, 2024

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch, I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.

Looks correct. But from zdtm.py point of view it may be not so simple, see how __criu_act retries action without fault, we likely don't want our fault to be retry-able by zdtm, so maybe just make it's id >=128 ?

I don't think giving the fault injection an id >= 128 is what we want. If I understand correctly, ids >= 128 refer to non-fatal fault injections (criu/include/fault-injection.h). Giving this fatal fault injection an id >= 128 causes tests to fail.

I may misunderstand something, correct me if I'm wrong, but isn't a test fail exactly the behavior we expect in case of criu crash? In other words: if criu fails to dump/restore crfail test zdtm reports PASS, if criu crashes zdtm should report FAIL, isn't it? (And in criu-fault.sh we should expect FAIL from zdtm.)

I think currently zdtm retries all fatal fault injections. I will try to figure out a different way :) Can you elaborate on why we don't want our fault to be retry-able?

On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh. There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

Yeh, it's a good question. Let's just use one random test with crfail (e.g.: vfork00). I don't see easy alternative to this approach, maybe someone else does...

I have added vfork00 to criu-fault.sh.

Sorry, the misunderstanding was on my part. You are correct; the tests previously passed as they were re-run without the fault injection. Thank you for all your help @Snorch :)

@Snorch
Copy link
Member

Snorch commented Apr 26, 2024

Hm, test still does not pass with && fail as though we don't call fail function we still return non-zero exit code and with set -e it fails the script.

This seem to do the right trick:

if ./test/zdtm.py run -t zdtm/static/vfork00 --fault 136 --report report -f h ; then
        fail
fi

Also maybe someone else has better ideas?

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Distinguish criu dump fail from criu dump crash in zdtm
4 participants