Train fp16 interrupt #61

5Yesterday · 2020-10-08T10:10:40Z

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
1500
265
1500
265
Every epoch need 188 iterations
Note that dataloader may hang with too much nworkers.
DLoss: 6.0000 Reg: 0.0000

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
I installed apex, and use the fp16 config, output this.

5Yesterday · 2020-10-09T01:09:23Z

when I run in terminal, got 段错误(核心已转储). Maybe I should locate it step by step?

layumi · 2020-10-09T01:30:17Z

Hi @5Yesterday
Please check the cuda version and your pytorch cuda version. Are they matched?
And please check the installation of apex. Do you successfully compile the apex with gcc 5+ ?

5Yesterday · 2020-10-09T03:45:19Z

@layumi Thanks, it works. I have trained it when not use fp16, so it's the gcc problem, gcc-5 work.

5Yesterday closed this as completed Oct 9, 2020

layumi mentioned this issue Oct 14, 2020

Change the data set to train #62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train fp16 interrupt #61

Train fp16 interrupt #61

5Yesterday commented Oct 8, 2020

5Yesterday commented Oct 9, 2020

layumi commented Oct 9, 2020

5Yesterday commented Oct 9, 2020

Train fp16 interrupt #61

Train fp16 interrupt #61

Comments

5Yesterday commented Oct 8, 2020

5Yesterday commented Oct 9, 2020

layumi commented Oct 9, 2020

5Yesterday commented Oct 9, 2020