Skip to content

Training seems to pause every N steps #13375

Discussion options

You must be logged in to vote

I analyzed the problem a little bit more. I noticed that I have 48 CPUs and 48 workers. That makes the training process pausing every 48 steps. If use 12 workers, the pause happens every 12 steps.
I'd like to increase the number of workers but the RAM usage is crazy high. With 48 workers I am almost using all the 180Gb of RAM available. Is this normal for simply loading images of a few Kbytes?
Any suggestion on how to speed this up?

EDIT: I think I am facing this issue pytorch/pytorch#13246 (comment) even though I am not entirely sure. My memory consumption is of about 100-150 gb right after the training starts. I tried to used a numpy array to store the huge list of integers containing t…

Replies: 7 comments 3 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
1 reply
@akihironitta
Comment options

Answer selected by akihironitta
Comment options

You must be logged in to vote
1 reply
@asiron
Comment options

Comment options

You must be logged in to vote
1 reply
@zihaozou
Comment options

Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment