Questions about the training effiency #57

XiaoqiangZhou · 2021-12-10T02:28:22Z

Thanks for releasing the code of SiwnIR, which is a really great work for low-level vision tasks.

However, when I train the SwinIR model with the guidance provided in the repo, I find the training efficiency is relatively low.

Specifically, the GPU utilization rate keeps 0 for a while from time to time (run 14 seconds and sleep 14 seconds). When the GPU utilization rate is 0, the CPU utilization is also 0. It's worth noting that I use the DDP training on 8 TITAN-RTX GPU cards with the default batch_size. I train the classic SR task with DIV2K dataset on X2 scale. After half-day training, The epoch, iteration and PSNR on Set5 are about 1500, 42000 and 35.73dB, respectively. So, it will takes about 5 days to finish the 500k iterations, far exceeding the 2 days reported in the README.

Could you please help me to figure out the reason for training efficiency?

JingyunLiang · 2021-12-10T09:27:57Z

It's strange. When I use DDP, the GPU utilization fluctuates but is always high (70%-100%). Can you try to train the model using one GPU and check the GPU utilization?

XiaoqiangZhou · 2021-12-10T11:12:22Z

@JingyunLiang Thanks for your quick reply~

Following your instruction, I try to train the model using one GPU to verify the GPU utilization.
After changing the gpu_ids in the config file from [0,1,2,3,4,5,6,7] to [0] and changing to corresponding dataloader_batch_size from 32 to 4, I have tried two ways to train the model, i.e., DDP and DP, with one Titan-RTX GPU card.

When I use DP, by running python main_train_psnr.py --opt options/swinir/the_config_file.json, the GPU utilization keeps around 50%. Maybe I can try to increase the batch_size to fully utilize the GPU capacity under DP mode?

When I use DDP, by running python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 main_train_psnr.py --opt options/swinir/the_config_file.json --dist True, the low-efficiency phenomenon still exists. So the problem may be caused by the DDP training process. I think I may try to adjust some other configurations such as dataloader_num_workers if I insist on using DDP mode. Do you have any other suggestion?

I will try to solve this problem in the coming days and update my progress here. If there is no progress on the GPU utilization, I will close this issue.

Thanks.

XiaoqiangZhou · 2021-12-10T11:49:33Z

By the way, I'm using the torch==1.7.0.

Priahi · 2021-12-18T20:06:44Z

@XiaoqiangZhou any update on this? I am also facing similarly slow training time. With batch size 16 and 1000 iterations per epoch, it takes about 1000 seconds to run a single epoch, any insights on this @JingyunLiang?

blackcow · 2021-12-26T11:43:28Z

I have the same problem, GPU utilization is very low.

lucky-zwx · 2022-03-15T10:03:49Z

各位解决这个问题了么？

songwg188 · 2022-04-26T06:40:49Z

You can try to cut the large pictures in the training set into small pictures, because a lot of IO time is spent reading high-resolution images.

After using this method, the GPU utilization rate did not appear 0 during my training. If there are still problems, the CPU performance in the server is probably insufficient.

JingyunLiang · 2022-06-17T20:00:54Z

Thank you @songwg188

BeaverInGreenland · 2023-06-28T08:43:53Z

You can try to cut the large pictures in the training set into small pictures, because a lot of IO time is spent reading high-resolution images.

Should you do it for both High Res and Low Res images?

JingyunLiang closed this as completed Jun 17, 2022

JingyunLiang mentioned this issue Jun 17, 2022

Why the train is so slow? cszn/KAIR#57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about the training effiency #57

Questions about the training effiency #57

XiaoqiangZhou commented Dec 10, 2021

JingyunLiang commented Dec 10, 2021

XiaoqiangZhou commented Dec 10, 2021 •

edited

XiaoqiangZhou commented Dec 10, 2021

Priahi commented Dec 18, 2021

blackcow commented Dec 26, 2021

lucky-zwx commented Mar 15, 2022

songwg188 commented Apr 26, 2022

JingyunLiang commented Jun 17, 2022

BeaverInGreenland commented Jun 28, 2023

Questions about the training effiency #57

Questions about the training effiency #57

Comments

XiaoqiangZhou commented Dec 10, 2021

JingyunLiang commented Dec 10, 2021

XiaoqiangZhou commented Dec 10, 2021 • edited

XiaoqiangZhou commented Dec 10, 2021

Priahi commented Dec 18, 2021

blackcow commented Dec 26, 2021

lucky-zwx commented Mar 15, 2022

songwg188 commented Apr 26, 2022

JingyunLiang commented Jun 17, 2022

BeaverInGreenland commented Jun 28, 2023

XiaoqiangZhou commented Dec 10, 2021 •

edited