Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to improve training log storage/view #335

Open
Yancey1989 opened this issue Aug 22, 2017 · 1 comment
Open

How to improve training log storage/view #335

Yancey1989 opened this issue Aug 22, 2017 · 1 comment
Assignees

Comments

@Yancey1989
Copy link
Collaborator

Yancey1989 commented Aug 22, 2017

多谢 @xinghai-sun 的反馈,目前的日志存储/查看方式有很多不方便的地方,线下和 @typhoonzero 也讨论了一下,记录如下:

目前使用不方便的地方:

  1. job被kill后日志也会丢失,无法复查现场
  2. paddlecloud logs 只能查看部分日志内容

原因:

job被kill后日志也会丢失,无法复查现场

目前训练任务的日志是使用Docker Container原生的存储方式,Container被kill掉之后容器日志也会被清除

paddlecloud logs 只能查看部分日志内容

经查集群Docker Log Driver被默认配置成了journald,而Journald中的日志定期回滚(时间很短)

解决方法

存储

  1. 训练日志重定向到PFS中每个Job的目录下,例如/pfs/dlnel/home/<user>/jobs/<job-name>/logs
  2. 每个Pod的训练日志分别存储,例如/pfs/dlnel/home/<user>/jobs/<job-name>/logs/<pod-id>.log

查看

  1. PFS 支持对文件的head/tail, 同时也支持将文件download到本地进行查看。
  2. 原有pcloud logs命令调用PFS的tail接口查看文件。
  3. 限速:为了数据安全考虑,对文件的下载/日志的查看进行适当的带宽/流量限制,相关issue:How to prevent Data leaks #332

也请 @xinghai-sun @wanghaoshuang 帮看下此方法是否满足训练需求。

@xinghai-sun
Copy link

太棒了,我觉得这样方便多了!不再需要手动保存日志了!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants