You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As I work through this example, I soon discovered the following error message:
This fp16_optimizer is designed to only work with apex.contrib.optimizers.* To update, use updated optimizers with AMP.
I figured out that this makes sense, as I am using AMD HW with a ROCm implementation of Pytorch. Still, the training times I get from using one node with 8 GPUs is no where near the 10 hours reported in the configuration for 50k steps, using the same yaml file (without fp16).
This raises both broad and specific questions. To begin with the latter:
How can I use mixed precision on ROCm? (And, what kind of speedup should I expect?)
Broadly speaking, are there special considerations using ROCm in performance terms, which affects the choices of optimizer, batch sizes, or parallelization type?
The text was updated successfully, but these errors were encountered:
I never had the chance to test with a AMD GPU so I'm afraid I can't answer to those questions.
Maybe on AMD Radeon communities / forums you may get some answers.
As I work through this example, I soon discovered the following error message:
This fp16_optimizer is designed to only work with apex.contrib.optimizers.* To update, use updated optimizers with AMP.
I figured out that this makes sense, as I am using AMD HW with a ROCm implementation of Pytorch. Still, the training times I get from using one node with 8 GPUs is no where near the 10 hours reported in the configuration for 50k steps, using the same yaml file (without fp16).
This raises both broad and specific questions. To begin with the latter:
The text was updated successfully, but these errors were encountered: