Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement trainingjob controller without autoscaler #18

Conversation

m3ngyang
Copy link
Collaborator

This pull request implements a training job controller without autoscaler, which registers a self-defined resource called TrainingJob and then watches the event of TrainingJob to manage the lifecycle of TrainingJob instance.

"github.com/paddlepaddle/edl/pkg/updater"
)

type TrainingJobController struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exported type must have comments. Curious this should be checked by the linter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I will supplement some comments.

"sync"
"time"

"github.com/golang/glog"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to use log15 recently.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any difference between those two log packages?

eventBroadcaster.StartLogging(glog.Infof)
eventBroadcaster.StartRecordingToSink(&typedcorev1.EventSinkImpl{Interface: kubeCli.CoreV1().Events("")})
workqueue := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "TrainingJob")
recorder := eventBroadcaster.NewRecorder(scheme.Scheme, corev1.EventSource{Component: "TrainingJobController"})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need some document or a sample controller to inform these concepts in a controller:

  • informer
  • event broadcaster
  • workqueue
  • recorder

defer c.workqueue.ShutDown()

glog.Info("Starting trainingjob controller")
glog.Info("Starting to create custom resource definition")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe can reduce some info logs, output info log only when the operation success or fail.

Copy link
Collaborator Author

@m3ngyang m3ngyang Mar 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use different log levels for them, abundant logs are useful when debugging.

}
}

func (c *TrainingJobController) processNestWorkItem() bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the name should be processNextWorkItem

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a typo here.

@m3ngyang m3ngyang force-pushed the trainingjob-controller branch 2 times, most recently from f358323 to 8bada91 Compare April 8, 2018 05:40
)

// SetupSignalHandler registered for SIGTERM and SIGINT. A stop channel is returned
// which is closed on one of these signals. If a second signal is caught, the program
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If a second signal is caught, the program is terminated with exit code 1." --- this is a hardcoded rule that people don't expect in general. Do we really need it? (if the user really want to terminate it, he can send SIGKILL using command kill -9).

For graceful shutdown, I think typically it's done in the following way:

  1. setup signal handler
  2. start a goroutine that does the real work, block main function
  3. when signal is caught, trigger graceful shutdown (e.g., closing a channel, cancel a context).
  4. the main function will return normally when graceful shutdown is complete.

An example: https://gist.github.com/peterhellberg/38117e546c217960747aacf689af3dc2

// is closed, at which point it will shutdown the workqueue and wait for
// workers to finish processing their current work items.
func (c *TrainingJobController) Run(threadiness int, stopCh <-chan struct{}) error {
// TODO add a lock to ensure there is only one controller in the cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put your name after TODO, e.g., // TODO(helin): ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is redundant, and I'll delete it. A leader election has been implemented to assure that there is only one controller to manage TrainingJob.

// is closed, at which point it will shutdown the workqueue and wait for
// workers to finish processing their current work items.
func (c *TrainingJobController) Run(threadiness int, stopCh <-chan struct{}) error {
// TODO add a lock to ensure there is only one controller in the cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this TODO? Maybe it's the job of the binary (e.g., cmd/paddle_controller/paddle_controller.go) to ensure only one instance is running, rather than the library.


log.Info("Starting workers")
for i := 0; i < threadiness; i++ {
go wait.Until(c.runWorker, time.Second, stopCh)
Copy link
Contributor

@helinwang helinwang Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a worker pool? Goroutines are already multiplexed onto a thread pool by the Go runtime, perhaps we can have a single loop, and start a new goroutine for each iteration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand the differences between the worker pool and the single loop. Here, wait.Util(f func(), period time.Duration, stopCh <-chan struct{}) loops until stop channel is closed, running f every period.

Copy link
Contributor

@helinwang helinwang Apr 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, let me explain.

Here is the current code:

for i := 0; i < threadiness; i++ {
	go wait.Until(c.runWorker, time.Second, stopCh)
}

func (c *TrainingJobController) runWorker() {
	for c.processNextWorkItem() {
	}
}

And there are threadiness number of gorountines running all work items (assume N). The max concurrency is threadiness.

Another implementation is:

go wait.Until(c.runWorker, time.Second, stopCh) // only a single wait.Until

func (c *TrainingJobController) runWorker() {
	for item := range c.itemCh {
                go c.process(item)
	}
}

In this way the The max concurrency is N rather than threadiness. Eliminating the unnecessary configuration value threadiness, at the same time provides better concurrency.

In C++ developers usually uses a thread pool, but in Go, the runtime already use a thread pool to very efficiently run Goroutines, based on a very good scheduling algorithm. We don't have to worry about it.

Does it makes sense to you?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your elaborated explanation, and I consulted implementation of some built-in controllers, such as deployment controller and job controller. These controllers usually use a controlled number to indicate the max number of objects that are allowed to sync concurrently, instead of the total number. Besides, sync number can be managed by the configuration of controller-manager, for example --concurrent-deployment-syncs int32.
To sum up, I think it's more reasonable to use threadiness as the max concurrency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I did not realize that there is a need for limiting the concurrency, now it makes sense. Thanks!

@m3ngyang
Copy link
Collaborator Author

m3ngyang commented May 6, 2018

this pr is included in #24
closed here.

@m3ngyang m3ngyang closed this May 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants