Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the training effiency #57

Closed
XiaoqiangZhou opened this issue Dec 10, 2021 · 9 comments
Closed

Questions about the training effiency #57

XiaoqiangZhou opened this issue Dec 10, 2021 · 9 comments

Comments

@XiaoqiangZhou
Copy link

Thanks for releasing the code of SiwnIR, which is a really great work for low-level vision tasks.

However, when I train the SwinIR model with the guidance provided in the repo, I find the training efficiency is relatively low.

Specifically, the GPU utilization rate keeps 0 for a while from time to time (run 14 seconds and sleep 14 seconds). When the GPU utilization rate is 0, the CPU utilization is also 0. It's worth noting that I use the DDP training on 8 TITAN-RTX GPU cards with the default batch_size. I train the classic SR task with DIV2K dataset on X2 scale. After half-day training, The epoch, iteration and PSNR on Set5 are about 1500, 42000 and 35.73dB, respectively. So, it will takes about 5 days to finish the 500k iterations, far exceeding the 2 days reported in the README.

Could you please help me to figure out the reason for training efficiency?

@JingyunLiang
Copy link
Owner

It's strange. When I use DDP, the GPU utilization fluctuates but is always high (70%-100%). Can you try to train the model using one GPU and check the GPU utilization?

@XiaoqiangZhou
Copy link
Author

XiaoqiangZhou commented Dec 10, 2021

@JingyunLiang Thanks for your quick reply~

Following your instruction, I try to train the model using one GPU to verify the GPU utilization.
After changing the gpu_ids in the config file from [0,1,2,3,4,5,6,7] to [0] and changing to corresponding dataloader_batch_size from 32 to 4, I have tried two ways to train the model, i.e., DDP and DP, with one Titan-RTX GPU card.

When I use DP, by running python main_train_psnr.py --opt options/swinir/the_config_file.json, the GPU utilization keeps around 50%. Maybe I can try to increase the batch_size to fully utilize the GPU capacity under DP mode?

When I use DDP, by running python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 main_train_psnr.py --opt options/swinir/the_config_file.json --dist True, the low-efficiency phenomenon still exists. So the problem may be caused by the DDP training process. I think I may try to adjust some other configurations such as dataloader_num_workers if I insist on using DDP mode. Do you have any other suggestion?

I will try to solve this problem in the coming days and update my progress here. If there is no progress on the GPU utilization, I will close this issue.

Thanks.

@XiaoqiangZhou
Copy link
Author

By the way, I'm using the torch==1.7.0.

@Priahi
Copy link

Priahi commented Dec 18, 2021

@XiaoqiangZhou any update on this? I am also facing similarly slow training time. With batch size 16 and 1000 iterations per epoch, it takes about 1000 seconds to run a single epoch, any insights on this @JingyunLiang?

@blackcow
Copy link

I have the same problem, GPU utilization is very low.

@lucky-zwx
Copy link

各位解决这个问题了么?

@songwg188
Copy link

You can try to cut the large pictures in the training set into small pictures, because a lot of IO time is spent reading high-resolution images.

After using this method, the GPU utilization rate did not appear 0 during my training. If there are still problems, the CPU performance in the server is probably insufficient.

@JingyunLiang
Copy link
Owner

Thank you @songwg188

@BeaverInGreenland
Copy link

You can try to cut the large pictures in the training set into small pictures, because a lot of IO time is spent reading high-resolution images.

Should you do it for both High Res and Low Res images?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants