extremely slow training with multiple GPUs #2065
-
I am training a model with lightning where I am attempting to use all the GPUs on my system (4 in total). My trainer is run as:
My model is defined as follows:
When I try and run it, it seems the beginning of the epoch hangs for like 10 minutes to get data into the model and after that the progress is very sluggish. I also get these messages in the beginning. Not sure if it is of concern
It basically hangs with this:
During this time, looking at GPU utilisation it seems:
So, it seems that getting the data into the GPU is quite slow even though everything looks maxed out. And when it does eventually start the epoch after ~30 minutes, it seems to give similar performance as my CPU on MacBook Pro. I am really not sure if I am doing somethingvery wrong here in how I am using PL. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
update to master. will reopen if not fixed |
Beta Was this translation helpful? Give feedback.
-
Unfortunately, this seems to be even worst with the current master. First it gives the error:
which is incorrect. This did not happen with the previous version. And then it gets stuck after this line (at least no console output)
|
Beta Was this translation helpful? Give feedback.
-
@pamparana34 mind chek the recent 0.9 from master? |
Beta Was this translation helpful? Give feedback.
-
This looks very familiar, and I am sure I fixed this problem in #2997, please try again with the latest version. Regarding the relative import error, you probably just launched the script in the wrong directory, but anyway I recommend to use absolute imports. Please let me know if the upgrade fixes your problem, thanks. |
Beta Was this translation helpful? Give feedback.
This looks very familiar, and I am sure I fixed this problem in #2997, please try again with the latest version. Regarding the relative import error, you probably just launched the script in the wrong directory, but anyway I recommend to use absolute imports. Please let me know if the upgrade fixes your problem, thanks.