Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rm log dir default #9552

Merged
merged 5 commits into from Dec 8, 2022
Merged

rm log dir default #9552

merged 5 commits into from Dec 8, 2022

Conversation

shangguanshiyuan
Copy link
Contributor

@shangguanshiyuan shangguanshiyuan commented Dec 6, 2022

python直接执行脚本时,默认不创建log文件夹,glog日志默认输出到stderr,默认级别为WARN
环境变量ONEFLOW_DEBUG_MODE=1时,创建log文件夹,日志写文件默认级别为INFO,屏幕输出默认级别为WARN

通过distributed.launch启动多进程时,创建log文件夹,日志写文件默认级别为INFO,屏幕输出默认级别为WARN

@github-actions
Copy link
Contributor

github-actions bot commented Dec 6, 2022

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 139.7ms (= 13969.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 161.0ms (= 16104.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 161.0ms / 139.7ms)

OneFlow resnet50 time: 85.0ms (= 8502.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.3ms (= 10129.7ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 101.3ms / 85.0ms)

OneFlow resnet50 time: 57.4ms (= 11487.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.1ms (= 15829.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.38 (= 79.1ms / 57.4ms)

OneFlow resnet50 time: 44.8ms (= 8956.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.1ms (= 14022.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.57 (= 70.1ms / 44.8ms)

OneFlow resnet50 time: 41.2ms (= 8234.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.7ms (= 13541.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.64 (= 67.7ms / 41.2ms)

@github-actions github-actions bot removed the automerge label Dec 6, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Dec 6, 2022

CI failed when running job: cpu-module. PR label automerge has been removed

@github-actions
Copy link
Contributor

github-actions bot commented Dec 7, 2022

Speed stats:

@github-actions
Copy link
Contributor

github-actions bot commented Dec 7, 2022

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 139.6ms (= 13956.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.3ms (= 16032.3ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 160.3ms / 139.6ms)

OneFlow resnet50 time: 85.2ms (= 8516.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.9ms (= 10191.7ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 101.9ms / 85.2ms)

OneFlow resnet50 time: 58.1ms (= 11616.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.1ms (= 15615.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.34 (= 78.1ms / 58.1ms)

OneFlow resnet50 time: 44.6ms (= 8911.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.9ms (= 14175.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.59 (= 70.9ms / 44.6ms)

OneFlow resnet50 time: 40.0ms (= 8000.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 75.1ms (= 15015.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.88 (= 75.1ms / 40.0ms)

@shangguanshiyuan shangguanshiyuan requested review from oneflow-ci-bot and removed request for oneflow-ci-bot December 8, 2022 04:50
@github-actions
Copy link
Contributor

github-actions bot commented Dec 8, 2022

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 139.9ms (= 13985.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 168.3ms (= 16828.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 168.3ms / 139.9ms)

OneFlow resnet50 time: 85.3ms (= 8525.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.0ms (= 10199.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 102.0ms / 85.3ms)

OneFlow resnet50 time: 57.8ms (= 11563.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.6ms (= 15718.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.36 (= 78.6ms / 57.8ms)

OneFlow resnet50 time: 45.1ms (= 9012.5ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.0ms (= 14400.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.60 (= 72.0ms / 45.1ms)

OneFlow resnet50 time: 41.5ms (= 8293.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 66.5ms (= 13305.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.60 (= 66.5ms / 41.5ms)

@github-actions
Copy link
Contributor

github-actions bot commented Dec 8, 2022

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9552/

@github-actions
Copy link
Contributor

github-actions bot commented Dec 8, 2022

CI failed when running job: cuda-speed-test. PR label automerge has been removed

@github-actions
Copy link
Contributor

github-actions bot commented Dec 8, 2022

Speed stats:
GPU Name: NVIDIA GeForce GTX 1080 









❌ OneFlow resnet50 time: 151.1ms (= 15114.3ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 175.5ms (= 17552.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 175.5ms / 151.1ms)

OneFlow resnet50 time: 96.3ms (= 9626.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 113.1ms (= 11312.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 113.1ms / 96.3ms)

OneFlow resnet50 time: 69.9ms (= 13977.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 87.8ms (= 17569.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.26 (= 87.8ms / 69.9ms)

OneFlow resnet50 time: 59.9ms (= 11982.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.3ms (= 14863.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.24 (= 74.3ms / 59.9ms)

OneFlow resnet50 time: 54.7ms (= 10946.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.3ms (= 14053.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.28 (= 70.3ms / 54.7ms)

@github-actions
Copy link
Contributor

github-actions bot commented Dec 8, 2022

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9552/

@jackalcooper jackalcooper merged commit 03ece9b into master Dec 8, 2022
@jackalcooper jackalcooper deleted the rm_log_dir_default branch December 8, 2022 06:28
@strint
Copy link
Contributor

strint commented Dec 8, 2022

通过distributed.launch启动多进程时,创建log文件夹,日志写文件默认级别为INFO,屏幕输出默认级别为WARN

多卡时,现在也不会创建 log dir 了吧

@jackalcooper
Copy link
Collaborator

@shangguanshiyuan 多卡的时候其实避免创建log目录的需求是更常见的,主要是如果跑的是多机训练运行在一个分布式的文件系统上,log目录写东西会导致文件系统的同步,占用网络带宽或者产生竞争导致阻塞(经常需要靠不同的进程产生不同路径的log来避免竞争),之前有用户在超算上跑就遇到了这个问题。

@shangguanshiyuan
Copy link
Contributor Author

多卡时,现在也不会创建 log dir 了吧

如果多卡是distributed.launch启动,就还会创建

@strint
Copy link
Contributor

strint commented Dec 8, 2022

多卡时,现在也不会创建 log dir 了吧

如果多卡是distributed.launch启动,就还会创建

咦,判断逻辑在哪里

@shangguanshiyuan
Copy link
Contributor Author

多卡的时候其实避免创建log目录的需求是更常见的,主要是如果跑的是多机训练运行在一个分布式的文件系统上,log目录写东西会导致文件系统的同步,占用网络带宽或者产生竞争导致阻塞(经常需要靠不同的进程产生不同路径的log来避免竞争),之前有用户在超算上跑就遇到了这个问题。

那distributed.launch就跟直接执行一致怎么样,默认都不创建,只在debug mode下创建。
这个distributed.launch下创建文件夹,之前是因为屏幕输出INFO,如果都打屏幕的话太乱,现在改成WARN就没关系了。

@jackalcooper
Copy link
Collaborator

那distributed.launch就跟直接执行一致怎么样

这样很好

@strint
Copy link
Contributor

strint commented Dec 8, 2022

W20221208 15:29:38.473420 2558583 rpc_client.cpp:190] LoadServer 127.0.0.1 Failed at 0 times error_code 14 error_message failed to connect to all addresse
LOG(WARNING) << "LoadServer " << request.addr() << " Failed at " << retry_idx << " times"
                   << " error_code " << st.error_code() << " error_message " << st.error_message();

多卡时,这个提示也可以改改,retry_idx > 5 次了再提醒?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants