Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train fp16 interrupt #61

Closed
5Yesterday opened this issue Oct 8, 2020 · 3 comments
Closed

Train fp16 interrupt #61

5Yesterday opened this issue Oct 8, 2020 · 3 comments

Comments

@5Yesterday
Copy link

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
1500
265
1500
265
Every epoch need 188 iterations
Note that dataloader may hang with too much nworkers.
DLoss: 6.0000 Reg: 0.0000

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
I installed apex, and use the fp16 config, output this.

@5Yesterday
Copy link
Author

when I run in terminal, got 段错误(核心已转储). Maybe I should locate it step by step?

@layumi
Copy link
Contributor

layumi commented Oct 9, 2020

Hi @5Yesterday
Please check the cuda version and your pytorch cuda version. Are they matched?
And please check the installation of apex. Do you successfully compile the apex with gcc 5+ ?

@5Yesterday
Copy link
Author

@layumi Thanks, it works. I have trained it when not use fp16, so it's the gcc problem, gcc-5 work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants