User Request: Support for duration #350

nikos912000 · 2021-06-23T10:11:40Z

Is your feature request related to a problem? Please describe.
Usually chaos engineering experiments run over a predetermined period of time. This has many benefits:

Users don't need to terminate the experiments manually.
It acts as a safety net; it is fairly easy to forget to terminate experiments.
Many times users of chaos engineering frameworks don't even have access to tools like kubectl to run a kubectl delete. This is the case for most of our users internally.
CI/CD platforms like Spinnaker can be used to run experiments. The entire lifecycle of an experiment needs to be handled in that case. Examples of integrations with CI/CD platform include Chaos Monkey, Litmus, and more.

Describe the solution you'd like
The duration of the experiment (in seconds) will be defined in the CRD. The controller sleeps for duration seconds and sends an exit signal (SIGINT/SIGTERM) to the injector pods when this is exceeded.

Describe alternatives you've considered
The alternative is to handle the lifecycle of an experiment manually but this not always possible and desirable as mentioned earlier.

The text was updated successfully, but these errors were encountered:

Devatoria · 2021-06-23T11:02:19Z

Hello @nikos912000 and thanks for the feedback. It is a feature that we are planning to implement soon, because of most of the reasons you mentioned. The global idea would be to have a default (but customizable for long experiments) timeout on disruptions so the disruption would expire by itself.

We monitor long running disruptions on our side and we indeed see a lot of people forgetting about an applied disruption.

No ETA yet for the feature but definitely planned as a high priority feature in our Q3 OKRs (starting July) so you can expect it to be done soon.

nikos912000 · 2021-06-23T11:24:08Z

Awesome, thanks @Devatoria.

We have this feature in a similar controller internally so happy to provide feedback. The main difference is the architecture (we don't have an injector pod) but I think it'll work in a similar way as described in my message above.

If that helps:

We also set a default duration in the CRD
We have validation both on the CRD + server side
For certain disruptions which are high-risk (e.g. AZ failures) we set a maximum duration which is sensible in the context of the disruption.

ptnapoleon added the enhancement New feature or request label Jun 23, 2021

nikos912000 mentioned this issue Jul 28, 2021

User Request: Dead man's switch #375

Closed

DataDog locked and limited conversation to collaborators Jul 29, 2021

Devatoria closed this as completed Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

User Request: Support for duration #350

User Request: Support for duration #350

nikos912000 commented Jun 23, 2021

Devatoria commented Jun 23, 2021

nikos912000 commented Jun 23, 2021 •

edited

This issue was moved to a discussion.

This issue was moved to a discussion.

User Request: Support for duration #350

User Request: Support for duration #350

Comments

nikos912000 commented Jun 23, 2021

Devatoria commented Jun 23, 2021

nikos912000 commented Jun 23, 2021 • edited

This issue was moved to a discussion.

nikos912000 commented Jun 23, 2021 •

edited