split publish/subscribe behavior to better-leverage context #476

tgross · 2017-08-15T17:17:19Z

Cancelling jobs, watches, servers, etc. has been a bit of a bug farm. We don't really leverage context as much as we probably could, but any use of this needs to make sure we're sending messages back like stopping/stopped and deregistering from the event bus.

I've worked up a preliminary design that splits the EventHandler into separate Publisher and Subscriber types and uses context.CancelFunc as a means of signaling shutdown instead of publishing an event. This seems to work ok and I think will be subtly simpler in the long run.

Here's the worked demo: https://gist.github.com/tgross/6b0e32a85157c045718ee2ef5d3a8da6
If you go run main.go 1000 this will run for 1000ms and then exit cleanly. It has the pub/sub types, an implementation of watch (pub-only), metric (sub-only), and job (pub+sub).

cc @cheapRoc

The text was updated successfully, but these errors were encountered:

tgross · 2017-08-21T18:15:48Z

Updated the worked demo example here: https://gist.github.com/tgross/b167a2b1d4a5b727643d6e25f9962553

This one handles signals (SIGTERM and SIGINT, plus SIGHUP for reloads) rather than having a fixed time interval.

jwreagor · 2017-09-08T13:49:48Z

There's a lot of work sitting behind this and I definitely feel the urge to complete what I've put in so far.

At the moment I'm stuck attempting to uncover the reason why CP is not exiting when all jobs have completed cleanly. The primary integration test where this was uncovered is running the following (test_envvars)...

{
  consul: "127.0.0.1:8500",
  jobs: [
    {
      name: "testenvvar",
      port: 8500,
      exec: "env",
      health: {
        exec: "true",
        interval: 10,
        ttl: 25
      }
    }
  ]
}

This will halt execution and exits cleanly on master so I know the call stack the bug is in but not the specific code that's causing it. When this fires on my branch, it's still blocking on the WaitGroup, which is puzzling since de-registration should have occurred for the Job after the Command exits.

Still in flux.

jwreagor · 2017-09-08T17:33:37Z

Ok, just as a note, it looks like the control server is not being shut down when a job exits cleanly/successfully. This makes sense because the existing implementation doesn't Unregister itself from the EventBus. Now we're getting somewhere...

jwreagor · 2017-09-11T14:23:44Z

Another note, I'm on a good path now and down to a single failing integration test, test_coprocess.

…taCenter#476) This commit refactors away EventHandler to replace it with separate Publisher/Subscriber interfaces. The side effect is properly using the `context` package throughout ContainerPilot as the basis for it's shutdown process. This helps provide a more structured shutdown and no longer relies on the event bus and event processing which was bug prone. * Introduce more `context` handling throughout our shutdown processes * Introduce Subscriber and Publisher interfaces * Turn `Job` into a Publisher and Subscriber * Turn `Watch` into a Publisher * Turn `Metric` into a Subscriber * Refactor Control and Telemetry server shutdown/reload * Feed a context.CancelFunc up stack into the reload control endpoint * Telemetry server reload uses an appropriate context.CancelFunc * Fix all broken unit tests for pub/sub refactoring * Patch tests around pub/sub to use QuitByClose event for testing only * Reuse QuitByTest event for shutting down through internal control channel * Shut down CP after all Jobs have completed

jwreagor · 2017-10-05T03:16:40Z

I cleaned up where I left off a few weeks ago and all tests are passing. Going to start rigorous testing in order to make sure this thing is truly stable.

…taCenter#476) This commit refactors away EventHandler to replace it with separate Publisher/Subscriber interfaces. The side effect is properly using the `context` package throughout ContainerPilot as the basis for it's shutdown process. This helps provide a more structured shutdown and no longer relies on the event bus and event processing which was bug prone. * Introduce more `context` handling throughout our shutdown processes * Introduce Subscriber and Publisher interfaces * Turn `Job` into a Publisher and Subscriber * Turn `Watch` into a Publisher * Turn `Metric` into a Subscriber * Refactor Control and Telemetry server shutdown/reload * Feed a context.CancelFunc up stack into the reload control endpoint * Telemetry server reload uses an appropriate context.CancelFunc * Fix all broken unit tests for pub/sub refactoring * Patch tests around pub/sub to use QuitByClose event for testing only * Reuse QuitByTest event for shutting down through internal control channel * Shut down CP after all Jobs have completed

This commit refactors away EventHandler to replace it with separate Publisher/Subscriber interfaces. The side effect is properly using the `context` package throughout ContainerPilot as the basis for it's shutdown process. This helps provide a more structured shutdown and no longer relies on the event bus and event processing which was bug prone. * Introduce more `context` handling throughout our shutdown processes * Introduce Subscriber and Publisher interfaces * Turn `Job` into a Publisher and Subscriber * Turn `Watch` into a Publisher * Turn `Metric` into a Subscriber * Refactor Control and Telemetry server shutdown/reload * Feed a context.CancelFunc up stack into the reload control endpoint * Telemetry server reload uses an appropriate context.CancelFunc * Fix all broken unit tests for pub/sub refactoring * Patch tests around pub/sub to use QuitByClose event for testing only * Reuse QuitByTest event for shutting down through internal control channel * Shut down CP after all Jobs have completed

jwreagor · 2017-10-10T19:21:07Z

This has just been merged. Thanks for the proposal @tgross!

jwreagor · 2017-10-13T04:01:35Z

No idea why I closed this without getting it released first.

tgross added proposal refactor labels Aug 15, 2017

tgross mentioned this issue Aug 21, 2017

documentation on stopTimeout field is confusing #479

Open

This was referenced Aug 24, 2017

Refactor shutdown with new pub/sub interfaces ~ WIP #496

Closed

Potential race during reload removes control socket file on filesystem #500

Closed

Tag logger field with the PID of a Command's exec process #498

Closed

jwreagor mentioned this issue Sep 6, 2017

Reload worker process and it's jobs via supervisor hitting control socket #505

Closed

jwreagor added open PR v3.x and removed proposal labels Oct 5, 2017

jwreagor closed this as completed Oct 10, 2017

jwreagor reopened this Oct 13, 2017

jwreagor added awaiting release released and removed open PR awaiting release labels Oct 13, 2017

jwreagor closed this as completed Oct 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split publish/subscribe behavior to better-leverage context #476

split publish/subscribe behavior to better-leverage context #476

tgross commented Aug 15, 2017

tgross commented Aug 21, 2017

jwreagor commented Sep 8, 2017

jwreagor commented Sep 8, 2017

jwreagor commented Sep 11, 2017

jwreagor commented Oct 5, 2017

jwreagor commented Oct 10, 2017

jwreagor commented Oct 13, 2017

split publish/subscribe behavior to better-leverage context #476

split publish/subscribe behavior to better-leverage context #476

Comments

tgross commented Aug 15, 2017

tgross commented Aug 21, 2017

jwreagor commented Sep 8, 2017

jwreagor commented Sep 8, 2017

jwreagor commented Sep 11, 2017

jwreagor commented Oct 5, 2017

jwreagor commented Oct 10, 2017

jwreagor commented Oct 13, 2017