New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new Chaos transport that can simulate network failure and add it to the kubelet #6729
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project, in which case you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed, please reply here (e.g.
|
@timothysc this will be of relevance for our reliability testing |
6609de9
to
2b928a0
Compare
/cc @fabioy (who was also working on simulating failures, albeit in other ways). |
/cc @jayunit100 @satnam6502 fyi. |
Adding this to the client and controller manager will come in a separate pull. I want to get the basics sorted and agreed on here. ----- Original Message -----
|
I looked through the PR, and in general I'm a +1, but I would still like a daemon-killing chaos-monkey re: #4548 . Primarily b/c it's pretty difficult to simulate a start-up storm, or net-split, etc. That being said, I think both would be good to have. |
Yeah, different problem set. Although this could trigger random panics, I don't think it's the right place for external chaos. ----- Original Message -----
|
@@ -95,6 +96,7 @@ type KubeletServer struct { | |||
TLSPrivateKeyFile string | |||
CertDirectory string | |||
NodeStatusUpdateFrequency time.Duration | |||
ChaosChance float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of having oddball testing options in this struct ("ReallyCrashForTesting"?). I'd prefer is these settings were coalesced into a separate struct and referenced either by a pointer from KubeletServer or perhaps just as a floating global config (presumably there won't be more than 1 per process...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are internal configs, so it's appropriate to put them under the struct. Globals are far worse and break encapsulation so I don't think we should ever add those (ReallyCrash is probably the exception).
I don't see a ton of value in separating them out, although I'd be happy to add a better description of when you would use those and separate them visually.
----- Original Message -----
@@ -95,6 +96,7 @@ type KubeletServer struct {
TLSPrivateKeyFile string
CertDirectory string
NodeStatusUpdateFrequency time.Duration
- ChaosChance float64
Not a fan of having oddball testing options in this struct
("ReallyCrashForTesting"?). I'd prefer is these settings were coalesced into
a separate struct and referenced either by a pointer from KubeletServer or
perhaps just as a floating global config (presumably there won't be more
than 1 per process...).
Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/6729/files#r28259029
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is with "spinkling of test code" throughout the codebase. I'd prefer to have them separated out in an easily discernible way, in case we wish to rip them out or refactor them in the future. At the least, please comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a small set of folks that would run Chaos in production either as canaries or continuous test, but I agree with the sentiment and will add comments to them.
----- Original Message -----
@@ -95,6 +96,7 @@ type KubeletServer struct {
TLSPrivateKeyFile string
CertDirectory string
NodeStatusUpdateFrequency time.Duration
- ChaosChance float64
My concern is with "spinkling of test code" throughout the codebase. I'd
prefer to have them separated out in an easily discernible way, in case we
wish to rip them out or refactor them in the future. At the least, please
comment.
Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/6729/files#r28284595
LGTM. Just update with the changes and I'll merge. Thanks. |
2b928a0
to
ca335d7
Compare
Updated ----- Original Message -----
|
Add a new Chaos transport that can simulate network failure and add it to the kubelet
A new package
pkg/client/chaosclient
has a framework for simulating random HTTPclient failures as well as returning arbitrary responses. The client.Config is
extended to support the ability to wrap the transport to inject those errors, and
the kubelet now takes an argument
--chaos_chance=<p>
which reflects the probabilitya request to the master will be rejected with a "connection reset by peer" error.
To try this locally, pass --chaos_chance=0.1 to the kubelet on start, or try this
with the local cluster:
The Chaos transport will log when it replaces a request - since the default error
is a fairly generic one, most parts of the code should immediately log that to
glog at at least V(2).
Future enhancements will be including more error scenarios, the ability to
simulate network latency, and possibly panics.