Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new Chaos transport that can simulate network failure and add it to the kubelet #6729

Merged
merged 3 commits into from Apr 13, 2015

Conversation

smarterclayton
Copy link
Contributor

A new package pkg/client/chaosclient has a framework for simulating random HTTP
client failures as well as returning arbitrary responses. The client.Config is
extended to support the ability to wrap the transport to inject those errors, and
the kubelet now takes an argument --chaos_chance=<p> which reflects the probability
a request to the master will be rejected with a "connection reset by peer" error.

To try this locally, pass --chaos_chance=0.1 to the kubelet on start, or try this
with the local cluster:

$ CHAOS_CHANCE=0.1 hack/local-up-cluster.sh

The Chaos transport will log when it replaces a request - since the default error
is a fairly generic one, most parts of the code should immediately log that to
glog at at least V(2).

Future enhancements will be including more error scenarios, the ability to
simulate network latency, and possibly panics.

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project, in which case you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please let us know the company's name.

@smarterclayton
Copy link
Contributor Author

@timothysc this will be of relevance for our reliability testing

@roberthbailey
Copy link
Contributor

/cc @fabioy (who was also working on simulating failures, albeit in other ways).

@timothysc
Copy link
Member

/cc @jayunit100 @satnam6502 fyi.

@smarterclayton
Copy link
Contributor Author

Adding this to the client and controller manager will come in a separate pull. I want to get the basics sorted and agreed on here.

----- Original Message -----

/cc @jayunit100 @satnam6502 fyi.


Reply to this email directly or view it on GitHub:
#6729 (comment)

@timothysc
Copy link
Member

I looked through the PR, and in general I'm a +1, but I would still like a daemon-killing chaos-monkey re: #4548 . Primarily b/c it's pretty difficult to simulate a start-up storm, or net-split, etc.

That being said, I think both would be good to have.

@smarterclayton
Copy link
Contributor Author

Yeah, different problem set. Although this could trigger random panics, I don't think it's the right place for external chaos.

----- Original Message -----

I looked through the PR, and in general I'm a +1, but I would still like a
daemon-killing chaos-monkey re:
#4548 . Primarily
b/c it's pretty difficult to simulate a start-up storm, or net-split, etc.

That being said, I think both would be good to have.


Reply to this email directly or view it on GitHub:
#6729 (comment)

@@ -95,6 +96,7 @@ type KubeletServer struct {
TLSPrivateKeyFile string
CertDirectory string
NodeStatusUpdateFrequency time.Duration
ChaosChance float64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of having oddball testing options in this struct ("ReallyCrashForTesting"?). I'd prefer is these settings were coalesced into a separate struct and referenced either by a pointer from KubeletServer or perhaps just as a floating global config (presumably there won't be more than 1 per process...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are internal configs, so it's appropriate to put them under the struct. Globals are far worse and break encapsulation so I don't think we should ever add those (ReallyCrash is probably the exception).

I don't see a ton of value in separating them out, although I'd be happy to add a better description of when you would use those and separate them visually.

----- Original Message -----

@@ -95,6 +96,7 @@ type KubeletServer struct {
TLSPrivateKeyFile string
CertDirectory string
NodeStatusUpdateFrequency time.Duration

  • ChaosChance float64

Not a fan of having oddball testing options in this struct
("ReallyCrashForTesting"?). I'd prefer is these settings were coalesced into
a separate struct and referenced either by a pointer from KubeletServer or
perhaps just as a floating global config (presumably there won't be more
than 1 per process...).


Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/6729/files#r28259029

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is with "spinkling of test code" throughout the codebase. I'd prefer to have them separated out in an easily discernible way, in case we wish to rip them out or refactor them in the future. At the least, please comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a small set of folks that would run Chaos in production either as canaries or continuous test, but I agree with the sentiment and will add comments to them.

----- Original Message -----

@@ -95,6 +96,7 @@ type KubeletServer struct {
TLSPrivateKeyFile string
CertDirectory string
NodeStatusUpdateFrequency time.Duration

  • ChaosChance float64

My concern is with "spinkling of test code" throughout the codebase. I'd
prefer to have them separated out in an easily discernible way, in case we
wish to rip them out or refactor them in the future. At the least, please
comment.


Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/6729/files#r28284595

@fabioy
Copy link
Contributor

fabioy commented Apr 13, 2015

LGTM. Just update with the changes and I'll merge.

Thanks.

@smarterclayton
Copy link
Contributor Author

Updated

----- Original Message -----

LGTM. Just update with the changes and I'll merge.

Thanks.


Reply to this email directly or view it on GitHub:
#6729 (comment)

fabioy added a commit that referenced this pull request Apr 13, 2015
Add a new Chaos transport that can simulate network failure and add it to the kubelet
@fabioy fabioy merged commit e99141d into kubernetes:master Apr 13, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants