Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Caffe memory increases with time(iterations?) #1377
So I have been trying to train the imagenet example using caffe. I am using cuda 6.5 with cuDNN. But when I run the trainer as shown in http://caffe.berkeleyvision.org/gathered/examples/imagenet.html along with GLOG_log_dir I see that the memory usage of caffe (seen using top) keeps on increasing with time (iterations ?) . So the top gives something like :
SIZE RES SHR STATE TIME CPU COMMAND
This causes the machine to slow down to a crawl (cannot type anything on the console). Any idea what might be causing this. I see this with both leveldb and lmdb.
Note that the GPU memory usage remains fixed.
This lmdb fork: https://github.com/raingo/lmdb-fork/tree/mdb.master solves the problem (not fully tested and won't update with upstream).
Try setting /proc/sys/vm/swappiness to zero. LMDB will still use up lots of memory as page cache, but it will be efficient as intended and machine won't slow down to a crawl. In our case, it slows down because it uses swap too aggressively, resulting in thrashing, and everything on the system waits on hard drive IO to complete. After swappiness gets set to zero, we see no more freezes and slow down of the whole system.
Based on this answer from the main author of LMDB, Howard Chu. http://www.openldap.org/lists/openldap-technical/201503/msg00077.html
If you don't what change the swap behavior for the whole system, look into cgroups, where you can tune the memory usage and caching behavior for individual processes. Hope this helps!
I don't think the lmdb-fork downgrade the efficiency. The only thing it does is to notify OS to release the page cache read in the last iteration. We have already used it to train many models. I guess the lmdb author's concern with MAP_PRIVATE is irrelevant in sequential read case.
Have you tried the swappiness=0 solution with multiple training processes on multi-GPU machine? Our first try is to set the swappiness, but that fails again with multiple training processes, so we end up with the hacky solution.
Hope that helps.
@sguada in my test, memory usage do grow when using leveldb, even in
If caffe does not need random access, which seems to be true now, I think it's better to use just plain file instead of these dbs.
IMHO It's a poor way to use RAM against sequential read, no matter
In #2193 I tried training googlenet for 12 hours and RAM usage does not grow above 1G.