Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRETE hangs indefinitely if QEMU terminates before first trace is dumped #10

Closed
moralismercatus opened this issue Dec 3, 2016 · 5 comments

Comments

@moralismercatus
Copy link
Collaborator

moralismercatus commented Dec 3, 2016

Consider the following scenario:

  1. vm-node starts QEMU.
  2. vm-node transmits the seed to QEMU.
  3. QEMU terminates/crashes before a trace is dumped.
  4. vm-node, via fault tolerance, logs the crash, restarts QEMU and proceeds as normal.

At this point, there are no more test cases and no traces from which to generate new test cases. This is the logical point at which dispatch should recognize that there is nothing more to do and terminate gracefully; however, dispatch instead hangs at this point indefinitely.

The reason for this is that the guard DispatchFSM_::is_target_expired returns false if the first trace has not yet been received.

A simple fix may be to use the first test case, instead of the first trace, as the condition in which to start checking if the target is expired. If the first test case is not a consistent source of indicating that a VM instance has started - because the first test may originate from a seed - then a consistent indicator should be the reception of guest data by dispatch. We can be sure of this because a VM instance must have been started in order for vm-node to get the data from the guest OS.

A more thorough fix would be to re-evaluate how CRETE determines when testing has completed.

@likebreath
Copy link
Collaborator

The timer of crete-dispatch now started once the first node is connected to dispatch, and hence will not hang indefinitely (instead the crete will terminate the current test once timeout is reached). Please refer to the relevant commit 4763919.

@moralismercatus
Copy link
Collaborator Author

@likebreath The reason why I originally changed the timer to start when the first trace was received is because copying over the VM image in distributed mode can be time consuming. Have you resolved this issue, or is it no longer a concern?

@likebreath
Copy link
Collaborator

@moralismercatus Yes, you are right. You introduced the change when you did the work about regression framework. I juset undo that change, because the infinitely hangging issue is more seious than the copying image issue. Certainly, a better solution to resolve both the infinitely hangging and the copying image issue is needed. This is why I am not closing this issue. Let me know your thoughts.

@moralismercatus
Copy link
Collaborator Author

From my perspective, indefinite hanging is more important. Just wanted to make sure you were aware of the implications.

Since this change has been merged to master, maybe we should close this issue and open another e.g., "Timer runs while VM image is being copied"

@likebreath
Copy link
Collaborator

My motivation to comment on this issue was also to make sure you are aware of this change. Let's open another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants