Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split publish/subscribe behavior to better-leverage context #476

Closed
tgross opened this issue Aug 15, 2017 · 7 comments
Closed

split publish/subscribe behavior to better-leverage context #476

tgross opened this issue Aug 15, 2017 · 7 comments

Comments

@tgross
Copy link
Contributor

tgross commented Aug 15, 2017

Cancelling jobs, watches, servers, etc. has been a bit of a bug farm. We don't really leverage context as much as we probably could, but any use of this needs to make sure we're sending messages back like stopping/stopped and deregistering from the event bus.

I've worked up a preliminary design that splits the EventHandler into separate Publisher and Subscriber types and uses context.CancelFunc as a means of signaling shutdown instead of publishing an event. This seems to work ok and I think will be subtly simpler in the long run.

Here's the worked demo: https://gist.github.com/tgross/6b0e32a85157c045718ee2ef5d3a8da6
If you go run main.go 1000 this will run for 1000ms and then exit cleanly. It has the pub/sub types, an implementation of watch (pub-only), metric (sub-only), and job (pub+sub).

cc @cheapRoc

@tgross
Copy link
Contributor Author

tgross commented Aug 21, 2017

Updated the worked demo example here: https://gist.github.com/tgross/b167a2b1d4a5b727643d6e25f9962553

This one handles signals (SIGTERM and SIGINT, plus SIGHUP for reloads) rather than having a fixed time interval.

@jwreagor
Copy link
Contributor

jwreagor commented Sep 8, 2017

There's a lot of work sitting behind this and I definitely feel the urge to complete what I've put in so far.

At the moment I'm stuck attempting to uncover the reason why CP is not exiting when all jobs have completed cleanly. The primary integration test where this was uncovered is running the following (test_envvars)...

{
  consul: "127.0.0.1:8500",
  jobs: [
    {
      name: "testenvvar",
      port: 8500,
      exec: "env",
      health: {
        exec: "true",
        interval: 10,
        ttl: 25
      }
    }
  ]
}

This will halt execution and exits cleanly on master so I know the call stack the bug is in but not the specific code that's causing it. When this fires on my branch, it's still blocking on the WaitGroup, which is puzzling since de-registration should have occurred for the Job after the Command exits.

Still in flux.

@jwreagor
Copy link
Contributor

jwreagor commented Sep 8, 2017

Ok, just as a note, it looks like the control server is not being shut down when a job exits cleanly/successfully. This makes sense because the existing implementation doesn't Unregister itself from the EventBus. Now we're getting somewhere...

@jwreagor
Copy link
Contributor

Another note, I'm on a good path now and down to a single failing integration test, test_coprocess.

jwreagor pushed a commit to jwreagor/containerpilot that referenced this issue Oct 5, 2017
…taCenter#476)

This commit refactors away EventHandler to replace it with separate
Publisher/Subscriber interfaces. The side effect is properly using the `context`
package throughout ContainerPilot as the basis for it's shutdown process. This
helps provide a more structured shutdown and no longer relies on the event bus
and event processing which was bug prone.

* Introduce more `context` handling throughout our shutdown processes
* Introduce Subscriber and Publisher interfaces
* Turn `Job` into a Publisher and Subscriber
* Turn `Watch` into a Publisher
* Turn `Metric` into a Subscriber
* Refactor Control and Telemetry server shutdown/reload
* Feed a context.CancelFunc up stack into the reload control endpoint
* Telemetry server reload uses an appropriate context.CancelFunc
* Fix all broken unit tests for pub/sub refactoring
* Patch tests around pub/sub to use QuitByClose event for testing only
* Reuse QuitByTest event for shutting down through internal control channel
* Shut down CP after all Jobs have completed
@jwreagor
Copy link
Contributor

jwreagor commented Oct 5, 2017

I cleaned up where I left off a few weeks ago and all tests are passing. Going to start rigorous testing in order to make sure this thing is truly stable.

jwreagor pushed a commit to jwreagor/containerpilot that referenced this issue Oct 5, 2017
…taCenter#476)

This commit refactors away EventHandler to replace it with separate
Publisher/Subscriber interfaces. The side effect is properly using the `context`
package throughout ContainerPilot as the basis for it's shutdown process. This
helps provide a more structured shutdown and no longer relies on the event bus
and event processing which was bug prone.

* Introduce more `context` handling throughout our shutdown processes
* Introduce Subscriber and Publisher interfaces
* Turn `Job` into a Publisher and Subscriber
* Turn `Watch` into a Publisher
* Turn `Metric` into a Subscriber
* Refactor Control and Telemetry server shutdown/reload
* Feed a context.CancelFunc up stack into the reload control endpoint
* Telemetry server reload uses an appropriate context.CancelFunc
* Fix all broken unit tests for pub/sub refactoring
* Patch tests around pub/sub to use QuitByClose event for testing only
* Reuse QuitByTest event for shutting down through internal control channel
* Shut down CP after all Jobs have completed
jwreagor pushed a commit that referenced this issue Oct 10, 2017
This commit refactors away EventHandler to replace it with separate
Publisher/Subscriber interfaces. The side effect is properly using the `context`
package throughout ContainerPilot as the basis for it's shutdown process. This
helps provide a more structured shutdown and no longer relies on the event bus
and event processing which was bug prone.

* Introduce more `context` handling throughout our shutdown processes
* Introduce Subscriber and Publisher interfaces
* Turn `Job` into a Publisher and Subscriber
* Turn `Watch` into a Publisher
* Turn `Metric` into a Subscriber
* Refactor Control and Telemetry server shutdown/reload
* Feed a context.CancelFunc up stack into the reload control endpoint
* Telemetry server reload uses an appropriate context.CancelFunc
* Fix all broken unit tests for pub/sub refactoring
* Patch tests around pub/sub to use QuitByClose event for testing only
* Reuse QuitByTest event for shutting down through internal control channel
* Shut down CP after all Jobs have completed
@jwreagor
Copy link
Contributor

This has just been merged. Thanks for the proposal @tgross!

@jwreagor
Copy link
Contributor

No idea why I closed this without getting it released first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants