Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add logs and fix a bug #5074

Merged
merged 9 commits into from
Oct 27, 2017
Merged

Conversation

gongweibao
Copy link
Contributor

@gongweibao gongweibao commented Oct 25, 2017

if err.Error() == ErrPassAfter.Error() {
// to prevent too many logs
if i%60 == 0 {
log.Debug(fmt.Sprintf("getTask passID:%d error.", passID), log.Ctx{"error": err})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.Debugf or just log.Debug(a, b) will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if i%60 == 0 {
log.Debug("getTask of passID error.",
log.Ctx{"error": err, "passID": passID})
i = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why here is not i = 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确实。Done。

@@ -132,21 +132,19 @@ func (c *Client) getRecords(passID int) {
break
}

i := 0
if err.Error() == ErrPassAfter.Error() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line is still needed, or else the other unknown error may also cause sleep and wait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// if err.Error() == ErrPassAfter.Error()
//      wait util last pass finishes
// if other error such as network error
//      wait to reconnect or task time out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ErrPassAfter我理解应该是明确要等待。如果是其他不可恢复的错误,这里应该记录日志并且panic之类的,不然可能导致job hang住?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以从几个方面来考虑:

  1. 从接口来说,pass_num是一个全局量,如果出现错误就退出的话,client会去尝试下一个pass_num. https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/reader/creator.py#L113
  2. 从流程上来说,task应该得到这个pass_num的数据,如果出错,而错误不是ErrPassBefore ErrNoMoreAvailable ErrAllTaskFailed,他应该继续等待。如果不hang住,客户端的动作应该是?我觉得还是重试。

time.Sleep(time.Second * 3)
continue

if i%60 == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not use i and the magic value 60, just log error every time it happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那样日志会很多。满屏都是一个日志。

Copy link
Contributor

@helinwang helinwang Oct 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原来如此,好的。

@@ -117,6 +117,9 @@ func TestNextRecord(t *testing.T) {
if e != nil {
panic(e)
}

c.StartGetRecords(100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只是为了做什么的,能否在文件里comment说明一下?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个当时是想通过调用c.StartGetRecords(100)来观察日志打印了,而且没有打印太多,但现在发现这个放到unittest中也不能捕获不符合要求的错误,只能观察,所以还是去掉吧。

@@ -107,7 +107,12 @@ def cloud_reader(paths, etcd_endpoints, timeout_sec=5, buf_size=64):
import cPickle as pickle
import paddle.v2.master as master
c = master.client(etcd_endpoints, timeout_sec, buf_size)
c.set_dataset(paths)

if isinstance(paths, basestring):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^_^

helinwang
helinwang previously approved these changes Oct 26, 2017
Copy link
Contributor

@helinwang helinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -34,5 +34,6 @@ def test_check_grad(self):
self.check_grad(['X', 'C_prev'], ['C', 'H'])


if __name__ == "__main__":
unittest.main()
# TODO(gongwb):fix CI error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change should not exist in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gongweibao gongweibao merged commit 8c9119a into PaddlePaddle:develop Oct 27, 2017
@gongweibao gongweibao deleted the fixdisbugs2 branch October 27, 2017 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants