implement trainingjob controller without autoscaler #18

m3ngyang · 2018-03-30T11:50:29Z

This pull request implements a training job controller without autoscaler, which registers a self-defined resource called TrainingJob and then watches the event of TrainingJob to manage the lifecycle of TrainingJob instance.

typhoonzero · 2018-03-30T15:30:37Z

pkg/controller/trainingjob_controller.go

+	"github.com/paddlepaddle/edl/pkg/updater"
+)
+
+type TrainingJobController struct {


Exported type must have comments. Curious this should be checked by the linter.

Right, I will supplement some comments.

typhoonzero · 2018-03-30T15:33:59Z

pkg/controller/trainingjob_controller.go

+	"sync"
+	"time"
+
+	"github.com/golang/glog"


We prefer to use log15 recently.

Is there any difference between those two log packages?

typhoonzero · 2018-03-30T15:40:47Z

pkg/controller/trainingjob_controller.go

+	eventBroadcaster.StartLogging(glog.Infof)
+	eventBroadcaster.StartRecordingToSink(&typedcorev1.EventSinkImpl{Interface: kubeCli.CoreV1().Events("")})
+	workqueue := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "TrainingJob")
+	recorder := eventBroadcaster.NewRecorder(scheme.Scheme, corev1.EventSource{Component: "TrainingJobController"})


Maybe we need some document or a sample controller to inform these concepts in a controller:

informer

event broadcaster

workqueue

recorder

typhoonzero · 2018-03-30T15:49:04Z

pkg/controller/trainingjob_controller.go

+	defer c.workqueue.ShutDown()
+
+	glog.Info("Starting trainingjob controller")
+	glog.Info("Starting to create custom resource definition")


Maybe can reduce some info logs, output info log only when the operation success or fail.

Maybe we can use different log levels for them, abundant logs are useful when debugging.

typhoonzero · 2018-03-30T15:53:41Z

pkg/controller/trainingjob_controller.go

+	}
+}
+
+func (c *TrainingJobController) processNestWorkItem() bool {


Looks like the name should be processNextWorkItem

Yes, it's a typo here.

helinwang · 2018-04-09T22:32:07Z

pkg/signals/signal.go

+)
+
+// SetupSignalHandler registered for SIGTERM and SIGINT. A stop channel is returned
+// which is closed on one of these signals. If a second signal is caught, the program


"If a second signal is caught, the program is terminated with exit code 1." --- this is a hardcoded rule that people don't expect in general. Do we really need it? (if the user really want to terminate it, he can send SIGKILL using command kill -9).

For graceful shutdown, I think typically it's done in the following way:

setup signal handler

start a goroutine that does the real work, block main function

when signal is caught, trigger graceful shutdown (e.g., closing a channel, cancel a context).

the main function will return normally when graceful shutdown is complete.

An example: https://gist.github.com/peterhellberg/38117e546c217960747aacf689af3dc2

helinwang · 2018-04-09T22:36:52Z

pkg/controller/trainingjob_controller.go

+// is closed, at which point it will shutdown the workqueue and wait for
+// workers to finish processing their current work items.
+func (c *TrainingJobController) Run(threadiness int, stopCh <-chan struct{}) error {
+	// TODO add a lock to ensure there is only one controller in the cluster


Please put your name after TODO, e.g., // TODO(helin): ...

This comment is redundant, and I'll delete it. A leader election has been implemented to assure that there is only one controller to manage TrainingJob.

helinwang · 2018-04-09T22:38:56Z

pkg/controller/trainingjob_controller.go

+// is closed, at which point it will shutdown the workqueue and wait for
+// workers to finish processing their current work items.
+func (c *TrainingJobController) Run(threadiness int, stopCh <-chan struct{}) error {
+	// TODO add a lock to ensure there is only one controller in the cluster


Do we need this TODO? Maybe it's the job of the binary (e.g., cmd/paddle_controller/paddle_controller.go) to ensure only one instance is running, rather than the library.

helinwang · 2018-04-09T22:43:16Z

pkg/controller/trainingjob_controller.go

+
+	log.Info("Starting workers")
+	for i := 0; i < threadiness; i++ {
+		go wait.Until(c.runWorker, time.Second, stopCh)


Why do we need a worker pool? Goroutines are already multiplexed onto a thread pool by the Go runtime, perhaps we can have a single loop, and start a new goroutine for each iteration.

Sorry, I don't understand the differences between the worker pool and the single loop. Here, wait.Util(f func(), period time.Duration, stopCh <-chan struct{}) loops until stop channel is closed, running f every period.

No worries, let me explain.

Here is the current code:

for i := 0; i < threadiness; i++ { go wait.Until(c.runWorker, time.Second, stopCh) } func (c *TrainingJobController) runWorker() { for c.processNextWorkItem() { } }

And there are threadiness number of gorountines running all work items (assume N). The max concurrency is threadiness.

Another implementation is:

go wait.Until(c.runWorker, time.Second, stopCh) // only a single wait.Until func (c *TrainingJobController) runWorker() { for item := range c.itemCh { go c.process(item) } }

In this way the The max concurrency is N rather than threadiness. Eliminating the unnecessary configuration value threadiness, at the same time provides better concurrency.

In C++ developers usually uses a thread pool, but in Go, the runtime already use a thread pool to very efficiently run Goroutines, based on a very good scheduling algorithm. We don't have to worry about it.

Does it makes sense to you?

Thanks for your elaborated explanation, and I consulted implementation of some built-in controllers, such as deployment controller and job controller. These controllers usually use a controlled number to indicate the max number of objects that are allowed to sync concurrently, instead of the total number. Besides, sync number can be managed by the configuration of controller-manager, for example --concurrent-deployment-syncs int32.
To sum up, I think it's more reasonable to use threadiness as the max concurrency.

I see, I did not realize that there is a need for limiting the concurrency, now it makes sense. Thanks!

m3ngyang · 2018-05-06T09:28:06Z

this pr is included in #24
closed here.

m3ngyang requested review from helinwang, gongweibao and typhoonzero March 30, 2018 11:50

typhoonzero reviewed Mar 30, 2018

View reviewed changes

m3ngyang force-pushed the trainingjob-controller branch 2 times, most recently from f358323 to 8bada91 Compare April 8, 2018 05:40

implement training job controller

c546c9f

m3ngyang force-pushed the trainingjob-controller branch from 8bada91 to c546c9f Compare April 8, 2018 06:31

helinwang reviewed Apr 9, 2018

View reviewed changes

m3ngyang closed this May 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement trainingjob controller without autoscaler #18

implement trainingjob controller without autoscaler #18

m3ngyang commented Mar 30, 2018

typhoonzero Mar 30, 2018

m3ngyang Mar 30, 2018

typhoonzero Mar 30, 2018

m3ngyang Mar 30, 2018

typhoonzero Mar 30, 2018

typhoonzero Mar 30, 2018

m3ngyang Mar 30, 2018 •

edited

typhoonzero Mar 30, 2018

m3ngyang Mar 30, 2018

helinwang Apr 9, 2018

helinwang Apr 9, 2018

m3ngyang Apr 11, 2018

helinwang Apr 9, 2018

helinwang Apr 9, 2018 •

edited

m3ngyang Apr 12, 2018

helinwang Apr 12, 2018 •

edited

m3ngyang Apr 15, 2018

helinwang Apr 16, 2018

m3ngyang commented May 6, 2018

implement trainingjob controller without autoscaler #18

implement trainingjob controller without autoscaler #18

Conversation

m3ngyang commented Mar 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m3ngyang Mar 30, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Apr 9, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Apr 12, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m3ngyang commented May 6, 2018

m3ngyang Mar 30, 2018 •

edited

helinwang Apr 9, 2018 •

edited

helinwang Apr 12, 2018 •

edited