rm log dir default #9552

shangguanshiyuan · 2022-12-06T12:39:28Z

python直接执行脚本时，默认不创建log文件夹，glog日志默认输出到stderr，默认级别为WARN
环境变量ONEFLOW_DEBUG_MODE=1时，创建log文件夹，日志写文件默认级别为INFO，屏幕输出默认级别为WARN

通过distributed.launch启动多进程时，创建log文件夹，日志写文件默认级别为INFO，屏幕输出默认级别为WARN

oneflow/core/job/env_global_objects_scope.cpp

github-actions · 2022-12-06T16:53:53Z

Speed stats:

GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 139.7ms (= 13969.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 161.0ms (= 16104.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 161.0ms / 139.7ms)

OneFlow resnet50 time: 85.0ms (= 8502.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.3ms (= 10129.7ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 101.3ms / 85.0ms)

OneFlow resnet50 time: 57.4ms (= 11487.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.1ms (= 15829.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.38 (= 79.1ms / 57.4ms)

OneFlow resnet50 time: 44.8ms (= 8956.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.1ms (= 14022.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.57 (= 70.1ms / 44.8ms)

OneFlow resnet50 time: 41.2ms (= 8234.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.7ms (= 13541.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.64 (= 67.7ms / 41.2ms)

github-actions · 2022-12-06T16:55:33Z

CI failed when running job: cpu-module. PR label automerge has been removed

…ir_default

github-actions · 2022-12-07T12:41:09Z

Speed stats:

github-actions · 2022-12-07T16:24:05Z

Speed stats:

GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 139.6ms (= 13956.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.3ms (= 16032.3ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 160.3ms / 139.6ms)

OneFlow resnet50 time: 85.2ms (= 8516.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.9ms (= 10191.7ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 101.9ms / 85.2ms)

OneFlow resnet50 time: 58.1ms (= 11616.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.1ms (= 15615.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.34 (= 78.1ms / 58.1ms)

OneFlow resnet50 time: 44.6ms (= 8911.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.9ms (= 14175.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.59 (= 70.9ms / 44.6ms)

OneFlow resnet50 time: 40.0ms (= 8000.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 75.1ms (= 15015.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.88 (= 75.1ms / 40.0ms)

github-actions · 2022-12-08T05:01:53Z

Speed stats:

GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 139.9ms (= 13985.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 168.3ms (= 16828.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 168.3ms / 139.9ms)

OneFlow resnet50 time: 85.3ms (= 8525.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.0ms (= 10199.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 102.0ms / 85.3ms)

OneFlow resnet50 time: 57.8ms (= 11563.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.6ms (= 15718.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.36 (= 78.6ms / 57.8ms)

OneFlow resnet50 time: 45.1ms (= 9012.5ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.0ms (= 14400.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.60 (= 72.0ms / 45.1ms)

OneFlow resnet50 time: 41.5ms (= 8293.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 66.5ms (= 13305.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.60 (= 66.5ms / 41.5ms)

github-actions · 2022-12-08T05:11:12Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9552/

github-actions · 2022-12-08T05:11:12Z

CI failed when running job: cuda-speed-test. PR label automerge has been removed

github-actions · 2022-12-08T05:41:10Z

Speed stats:

GPU Name: NVIDIA GeForce GTX 1080 









❌ OneFlow resnet50 time: 151.1ms (= 15114.3ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 175.5ms (= 17552.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 175.5ms / 151.1ms)

OneFlow resnet50 time: 96.3ms (= 9626.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 113.1ms (= 11312.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 113.1ms / 96.3ms)

OneFlow resnet50 time: 69.9ms (= 13977.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 87.8ms (= 17569.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.26 (= 87.8ms / 69.9ms)

OneFlow resnet50 time: 59.9ms (= 11982.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.3ms (= 14863.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.24 (= 74.3ms / 59.9ms)

OneFlow resnet50 time: 54.7ms (= 10946.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.3ms (= 14053.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.28 (= 70.3ms / 54.7ms)

github-actions · 2022-12-08T06:04:11Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9552/

strint · 2022-12-08T06:55:03Z

通过distributed.launch启动多进程时，创建log文件夹，日志写文件默认级别为INFO，屏幕输出默认级别为WARN

多卡时，现在也不会创建 log dir 了吧

jackalcooper · 2022-12-08T06:58:38Z

@shangguanshiyuan 多卡的时候其实避免创建log目录的需求是更常见的，主要是如果跑的是多机训练运行在一个分布式的文件系统上，log目录写东西会导致文件系统的同步，占用网络带宽或者产生竞争导致阻塞（经常需要靠不同的进程产生不同路径的log来避免竞争），之前有用户在超算上跑就遇到了这个问题。

shangguanshiyuan · 2022-12-08T06:59:20Z

多卡时，现在也不会创建 log dir 了吧

如果多卡是distributed.launch启动，就还会创建

strint · 2022-12-08T06:59:52Z

多卡时，现在也不会创建 log dir 了吧

如果多卡是distributed.launch启动，就还会创建

咦，判断逻辑在哪里

shangguanshiyuan · 2022-12-08T07:02:33Z

多卡的时候其实避免创建log目录的需求是更常见的，主要是如果跑的是多机训练运行在一个分布式的文件系统上，log目录写东西会导致文件系统的同步，占用网络带宽或者产生竞争导致阻塞（经常需要靠不同的进程产生不同路径的log来避免竞争），之前有用户在超算上跑就遇到了这个问题。

那distributed.launch就跟直接执行一致怎么样，默认都不创建，只在debug mode下创建。
这个distributed.launch下创建文件夹，之前是因为屏幕输出INFO，如果都打屏幕的话太乱，现在改成WARN就没关系了。

jackalcooper · 2022-12-08T07:03:14Z

那distributed.launch就跟直接执行一致怎么样

这样很好

strint · 2022-12-08T07:32:17Z

W20221208 15:29:38.473420 2558583 rpc_client.cpp:190] LoadServer 127.0.0.1 Failed at 0 times error_code 14 error_message failed to connect to all addresse

LOG(WARNING) << "LoadServer " << request.addr() << " Failed at " << retry_idx << " times"
                   << " error_code " << st.error_code() << " error_message " << st.error_message();

多卡时，这个提示也可以改改，retry_idx > 5 次了再提醒？

rm log dir default

5d8836a

shangguanshiyuan requested a review from strint December 6, 2022 12:39

shangguanshiyuan requested review from BBuf, daquexian and jackalcooper as code owners December 6, 2022 12:39

shangguanshiyuan added enhancement help wanted labels Dec 6, 2022

strint reviewed Dec 6, 2022

View reviewed changes

oneflow/core/job/env_global_objects_scope.cpp Outdated Show resolved Hide resolved

jackalcooper approved these changes Dec 6, 2022

View reviewed changes

jackalcooper added system automerge labels Dec 6, 2022

jackalcooper requested a review from oneflow-ci-bot December 6, 2022 13:30

github-actions bot removed the automerge label Dec 6, 2022

shangguanshiyuan added 2 commits December 7, 2022 07:16

change default log level

5b83654

add doc

e9ee5ab

shangguanshiyuan requested a review from doombeaker as a code owner December 7, 2022 07:45

Merge branch 'master' of github.com:Oneflow-Inc/oneflow into rm_log_d…

707049a

…ir_default

strint approved these changes Dec 7, 2022

View reviewed changes

shangguanshiyuan requested a review from jackalcooper December 7, 2022 08:53

doombeaker approved these changes Dec 7, 2022

View reviewed changes

only save file when debug

c7fb41f

jackalcooper requested a review from hjchen2 as a code owner December 7, 2022 14:35

jackalcooper enabled auto-merge (squash) December 7, 2022 14:38

jackalcooper added the automerge label Dec 7, 2022

shangguanshiyuan requested review from oneflow-ci-bot and removed request for oneflow-ci-bot December 8, 2022 04:50

github-actions bot removed the automerge label Dec 8, 2022

shangguanshiyuan added the automerge label Dec 8, 2022

jackalcooper merged commit 03ece9b into master Dec 8, 2022

jackalcooper deleted the rm_log_dir_default branch December 8, 2022 06:28

strint mentioned this pull request Dec 8, 2022

oneflow create a log folder at running #9461

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rm log dir default #9552

rm log dir default #9552

shangguanshiyuan commented Dec 6, 2022 •

edited

github-actions bot commented Dec 6, 2022

github-actions bot commented Dec 6, 2022

github-actions bot commented Dec 7, 2022

github-actions bot commented Dec 7, 2022

github-actions bot commented Dec 8, 2022

github-actions bot commented Dec 8, 2022

github-actions bot commented Dec 8, 2022

github-actions bot commented Dec 8, 2022

github-actions bot commented Dec 8, 2022

strint commented Dec 8, 2022

jackalcooper commented Dec 8, 2022

shangguanshiyuan commented Dec 8, 2022

strint commented Dec 8, 2022

shangguanshiyuan commented Dec 8, 2022

jackalcooper commented Dec 8, 2022

strint commented Dec 8, 2022

rm log dir default #9552

rm log dir default #9552

Conversation

shangguanshiyuan commented Dec 6, 2022 • edited

github-actions bot commented Dec 6, 2022

github-actions bot commented Dec 6, 2022

github-actions bot commented Dec 7, 2022

github-actions bot commented Dec 7, 2022

github-actions bot commented Dec 8, 2022

github-actions bot commented Dec 8, 2022

github-actions bot commented Dec 8, 2022

github-actions bot commented Dec 8, 2022

github-actions bot commented Dec 8, 2022

strint commented Dec 8, 2022

jackalcooper commented Dec 8, 2022

shangguanshiyuan commented Dec 8, 2022

strint commented Dec 8, 2022

shangguanshiyuan commented Dec 8, 2022

jackalcooper commented Dec 8, 2022

strint commented Dec 8, 2022

shangguanshiyuan commented Dec 6, 2022 •

edited