-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add TaskFail interface #2719
add TaskFail interface #2719
Changes from 7 commits
e25c155
52cc601
108b0fa
7663a40
a94d217
8f70885
578dd09
a40a7a5
b64c7a6
d05d19b
dd8685f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,27 +34,28 @@ type Chunk struct { | |
// Task is the basic unit of data instances assigned to trainers. | ||
type Task struct { | ||
ID int | ||
Epoch int | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Task分发给Trainer的一个要做的事情,Trainer貌似不需要知道这是第几个Epoch。 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 我添加了一个ISSUE:#2752 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 感觉用全局唯一的ID表示Task就可以了,最终可以统计这个task被执行过几次,哪些成功哪些失败了。Epoch用来对timeout计数可以放在 任务结束时,就可以统计task的成功和失败的情况。 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 在ISSUE中讨论这个问题吧。已经回了。 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe change like this is more consistent with type TaskMeta struct {
ID int
Epoch int
}
type Task struct {
TaskMeta Meta
Chunks []Chunk
}
func (s *Service) TaskFailed(meta TaskMeta, dummy *int) error {
} There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
Chunks []Chunk | ||
} | ||
|
||
type taskEntry struct { | ||
Epoch int | ||
NumTimeout int | ||
Task Task | ||
Task Task | ||
// A task fails if it's timeout or trainer reports it exits unnormally. | ||
NumFailure int | ||
} | ||
|
||
type taskQueues struct { | ||
Todo []taskEntry | ||
Pending map[int]taskEntry // map from task ID to task entry | ||
Done []taskEntry | ||
Failed []Task | ||
Failed []taskEntry | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 把 |
||
} | ||
|
||
// Service is the master server service. | ||
type Service struct { | ||
chunksPerTask int | ||
timeoutDur time.Duration | ||
timeoutMax int | ||
failureMax int | ||
ready chan struct{} | ||
store Store | ||
|
||
|
@@ -91,11 +92,11 @@ func partition(chunks []Chunk, chunksPerTask int) []taskEntry { | |
} | ||
|
||
// NewService creates a new service. | ||
func NewService(store Store, chunksPerTask int, timeoutDur time.Duration, timeoutMax int) (*Service, error) { | ||
func NewService(store Store, chunksPerTask int, timeoutDur time.Duration, failureMax int) (*Service, error) { | ||
s := &Service{} | ||
s.chunksPerTask = chunksPerTask | ||
s.timeoutDur = timeoutDur | ||
s.timeoutMax = timeoutMax | ||
s.failureMax = failureMax | ||
s.taskQueues = taskQueues{} | ||
s.taskQueues.Pending = make(map[int]taskEntry) | ||
s.ready = make(chan struct{}) | ||
|
@@ -257,6 +258,34 @@ func (s *Service) SetDataset(globPaths []string, dummy *int) error { | |
return nil | ||
} | ||
|
||
func (s *Service) procFailedTask(t taskEntry, epoch int) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 请问proc是什么意思? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
if t.Task.Epoch != epoch { | ||
// new epoch, task launched after the | ||
// schedule of this timeout check or failed status report. | ||
return | ||
} | ||
|
||
defer func() { | ||
err := s.snapshot() | ||
if err != nil { | ||
log.Errorln(err) | ||
} | ||
}() | ||
|
||
delete(s.taskQueues.Pending, t.Task.ID) | ||
|
||
t.NumFailure++ | ||
if t.NumFailure > s.failureMax { | ||
log.Warningf("Task %v failed %d times, discard.", t.Task, t.NumFailure) | ||
s.taskQueues.Failed = append(s.taskQueues.Failed, t) | ||
return | ||
} | ||
|
||
log.Warningf("Task %v failed %d times, discard.", t.Task, t.NumFailure) | ||
s.taskQueues.Todo = append(s.taskQueues.Todo, t) | ||
return | ||
} | ||
|
||
func (s *Service) checkTimeoutFunc(taskID int, epoch int) func() { | ||
return func() { | ||
s.mu.Lock() | ||
|
@@ -267,30 +296,7 @@ func (s *Service) checkTimeoutFunc(taskID int, epoch int) func() { | |
return | ||
} | ||
|
||
if t.Epoch != epoch { | ||
// new epoch, task launched after the | ||
// schedule of this timeout check. | ||
return | ||
} | ||
|
||
defer func() { | ||
err := s.snapshot() | ||
if err != nil { | ||
log.Errorln(err) | ||
} | ||
}() | ||
|
||
delete(s.taskQueues.Pending, t.Task.ID) | ||
|
||
t.NumTimeout++ | ||
if t.NumTimeout > s.timeoutMax { | ||
log.Warningf("Task %v timed out %d times, discard.", t.Task, t.NumTimeout) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这里可能也会被failed调用,所以不一定都是time out,可以用泛化点的描述。 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
s.taskQueues.Failed = append(s.taskQueues.Failed, t.Task) | ||
return | ||
} | ||
|
||
log.Warningf("Task %v timed out %d times, retry.", t.Task, t.NumTimeout) | ||
s.taskQueues.Todo = append(s.taskQueues.Todo, t) | ||
s.procFailedTask(t, epoch) | ||
} | ||
} | ||
|
||
|
@@ -339,7 +345,7 @@ func (s *Service) GetTask(dummy int, task *Task) error { | |
} | ||
|
||
t := s.taskQueues.Todo[0] | ||
t.Epoch++ | ||
t.Task.Epoch++ | ||
s.taskQueues.Todo = s.taskQueues.Todo[1:] | ||
s.taskQueues.Pending[t.Task.ID] = t | ||
err := s.snapshot() | ||
|
@@ -348,9 +354,9 @@ func (s *Service) GetTask(dummy int, task *Task) error { | |
} | ||
|
||
*task = t.Task | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Delete unused There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这行是返回值,是有用的。:) |
||
log.WithFields(s.logFields()).Infof("Task #%d dispatched.", task.ID) | ||
log.WithFields(s.logFields()).Infof("Task #%v dispatched.", t) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A task contains There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
|
||
time.AfterFunc(s.timeoutDur, s.checkTimeoutFunc(t.Task.ID, t.Epoch)) | ||
time.AfterFunc(s.timeoutDur, s.checkTimeoutFunc(t.Task.ID, t.Task.Epoch)) | ||
return nil | ||
} | ||
|
||
|
@@ -371,7 +377,7 @@ func (s *Service) TaskFinished(taskID int, dummy *int) error { | |
} | ||
|
||
// task finished, reset timeout | ||
t.NumTimeout = 0 | ||
t.NumFailure = 0 | ||
s.taskQueues.Done = append(s.taskQueues.Done, t) | ||
delete(s.taskQueues.Pending, taskID) | ||
|
||
|
@@ -389,3 +395,29 @@ func (s *Service) TaskFinished(taskID int, dummy *int) error { | |
} | ||
return err | ||
} | ||
|
||
// TaskID is a struct which client uses for reports failure. | ||
type TaskID struct { | ||
ID int | ||
Epoch int | ||
} | ||
|
||
// TaskFailed tells the service that a task is failed. | ||
func (s *Service) TaskFailed(taskID TaskID, dummy *int) error { | ||
select { | ||
case <-s.ready: | ||
} | ||
|
||
s.mu.Lock() | ||
defer s.mu.Unlock() | ||
|
||
t, ok := s.taskQueues.Pending[taskID.ID] | ||
if !ok { | ||
err := errors.New("pending task not found") | ||
log.WithFields(s.logFields()).Warningln("TaskFailed:Pending task #%v not found.", taskID) | ||
return err | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we return error here? I think it's normal if that task is no longer pending (completed by other workers). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
} | ||
|
||
s.procFailedTask(t, taskID.Epoch) | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be "Service.TaskFailed" :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
汗。Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
加了测试用例。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed so many mistakes when reviewing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typhoonzero No worries, that's why we have multiple developers reviewing, we make mistake :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
汗死。我应该加
unit test
。盲目的自信是不可以有的。也不符合我们的做事方法。惭愧。