-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement trainingjob controller without autoscaler #18
implement trainingjob controller without autoscaler #18
Conversation
"github.com/paddlepaddle/edl/pkg/updater" | ||
) | ||
|
||
type TrainingJobController struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exported type must have comments. Curious this should be checked by the linter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I will supplement some comments.
"sync" | ||
"time" | ||
|
||
"github.com/golang/glog" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer to use log15 recently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any difference between those two log packages?
eventBroadcaster.StartLogging(glog.Infof) | ||
eventBroadcaster.StartRecordingToSink(&typedcorev1.EventSinkImpl{Interface: kubeCli.CoreV1().Events("")}) | ||
workqueue := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "TrainingJob") | ||
recorder := eventBroadcaster.NewRecorder(scheme.Scheme, corev1.EventSource{Component: "TrainingJobController"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need some document or a sample controller to inform these concepts in a controller:
- informer
- event broadcaster
- workqueue
- recorder
defer c.workqueue.ShutDown() | ||
|
||
glog.Info("Starting trainingjob controller") | ||
glog.Info("Starting to create custom resource definition") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe can reduce some info logs, output info log only when the operation success or fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use different log levels for them, abundant logs are useful when debugging.
} | ||
} | ||
|
||
func (c *TrainingJobController) processNestWorkItem() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the name should be processNextWorkItem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a typo here.
f358323
to
8bada91
Compare
8bada91
to
c546c9f
Compare
) | ||
|
||
// SetupSignalHandler registered for SIGTERM and SIGINT. A stop channel is returned | ||
// which is closed on one of these signals. If a second signal is caught, the program |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"If a second signal is caught, the program is terminated with exit code 1." --- this is a hardcoded rule that people don't expect in general. Do we really need it? (if the user really want to terminate it, he can send SIGKILL using command kill -9
).
For graceful shutdown, I think typically it's done in the following way:
- setup signal handler
- start a goroutine that does the real work, block main function
- when signal is caught, trigger graceful shutdown (e.g., closing a channel, cancel a context).
- the main function will return normally when graceful shutdown is complete.
An example: https://gist.github.com/peterhellberg/38117e546c217960747aacf689af3dc2
// is closed, at which point it will shutdown the workqueue and wait for | ||
// workers to finish processing their current work items. | ||
func (c *TrainingJobController) Run(threadiness int, stopCh <-chan struct{}) error { | ||
// TODO add a lock to ensure there is only one controller in the cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put your name after TODO, e.g., // TODO(helin): ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is redundant, and I'll delete it. A leader election has been implemented to assure that there is only one controller to manage TrainingJob.
// is closed, at which point it will shutdown the workqueue and wait for | ||
// workers to finish processing their current work items. | ||
func (c *TrainingJobController) Run(threadiness int, stopCh <-chan struct{}) error { | ||
// TODO add a lock to ensure there is only one controller in the cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this TODO? Maybe it's the job of the binary (e.g., cmd/paddle_controller/paddle_controller.go
) to ensure only one instance is running, rather than the library.
|
||
log.Info("Starting workers") | ||
for i := 0; i < threadiness; i++ { | ||
go wait.Until(c.runWorker, time.Second, stopCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a worker pool? Goroutines are already multiplexed onto a thread pool by the Go runtime, perhaps we can have a single loop, and start a new goroutine for each iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand the differences between the worker pool and the single loop. Here, wait.Util(f func(), period time.Duration, stopCh <-chan struct{})
loops until stop channel is closed, running f every period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, let me explain.
Here is the current code:
for i := 0; i < threadiness; i++ {
go wait.Until(c.runWorker, time.Second, stopCh)
}
func (c *TrainingJobController) runWorker() {
for c.processNextWorkItem() {
}
}
And there are threadiness number of gorountines running all work items (assume N). The max concurrency is threadiness.
Another implementation is:
go wait.Until(c.runWorker, time.Second, stopCh) // only a single wait.Until
func (c *TrainingJobController) runWorker() {
for item := range c.itemCh {
go c.process(item)
}
}
In this way the The max concurrency is N rather than threadiness. Eliminating the unnecessary configuration value threadiness, at the same time provides better concurrency.
In C++ developers usually uses a thread pool, but in Go, the runtime already use a thread pool to very efficiently run Goroutines, based on a very good scheduling algorithm. We don't have to worry about it.
Does it makes sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your elaborated explanation, and I consulted implementation of some built-in controllers, such as deployment controller and job controller. These controllers usually use a controlled number to indicate the max number of objects that are allowed to sync concurrently, instead of the total number. Besides, sync number can be managed by the configuration of controller-manager, for example --concurrent-deployment-syncs int32
.
To sum up, I think it's more reasonable to use threadiness as the max concurrency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I did not realize that there is a need for limiting the concurrency, now it makes sense. Thanks!
this pr is included in #24 |
This pull request implements a training job controller without autoscaler, which registers a self-defined resource called
TrainingJob
and then watches the event ofTrainingJob
to manage the lifecycle ofTrainingJob
instance.