How to solve the problem of loss NAN? #15

xiaoxiAries · 2022-10-06T08:00:04Z

Hi,

I reprocess this code on SSv2 datatset. I follow the blr 0.1, and use 2 gpus with batchsize of 7 (the valid total batchsize is 14). But the loss is Nan when epoch is 14. How to solve this problem? Thanks~

ShoufaChen · 2022-10-07T00:28:00Z

Hi,

Which configuration do you use? Full-tuning baseline or AdaptFormer?

xiaoxiAries · 2022-10-08T07:19:07Z

Hi，
I follow this configuration:

OMP_NUM_THREADS=1 python3 -m torch.distributed.launch
--nproc_per_node=2
--use_env main_video.py
--finetune /path/to/pre_trained/mae.pyth
--output_dir /path/to/output
--batch_size 7 --epochs 90 --blr 0.1 --weight_decay 0.0 --dist_eval
--data_path /path/to/SSV2 --data_set SSV2
--ffn_adapt

ShoufaChen · 2022-10-09T03:11:34Z

I am sorry I didn't experiment with your specific configuration. Reduce the learning rate and have a try.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to solve the problem of loss NAN? #15

How to solve the problem of loss NAN? #15

xiaoxiAries commented Oct 6, 2022

ShoufaChen commented Oct 7, 2022

xiaoxiAries commented Oct 8, 2022

ShoufaChen commented Oct 9, 2022

How to solve the problem of loss NAN? #15

How to solve the problem of loss NAN? #15

Comments

xiaoxiAries commented Oct 6, 2022

ShoufaChen commented Oct 7, 2022

xiaoxiAries commented Oct 8, 2022

ShoufaChen commented Oct 9, 2022