New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caffe memory increases with time(iterations?) #1377

Closed
lavania opened this Issue Oct 29, 2014 · 10 comments

Comments

Projects
None yet
10 participants
@lavania

lavania commented Oct 29, 2014

So I have been trying to train the imagenet example using caffe. I am using cuda 6.5 with cuDNN. But when I run the trainer as shown in http://caffe.berkeleyvision.org/gathered/examples/imagenet.html along with GLOG_log_dir I see that the memory usage of caffe (seen using top) keeps on increasing with time (iterations ?) . So the top gives something like :

SIZE RES SHR STATE TIME CPU COMMAND
2285G 58G 57G run 28:57 140% caffe

This causes the machine to slow down to a crawl (cannot type anything on the console). Any idea what might be causing this. I see this with both leveldb and lmdb.

Note that the GPU memory usage remains fixed.

Regards

@beniz

This comment has been minimized.

Show comment
Hide comment
@beniz

beniz Oct 31, 2014

I believe this is the mmap call from lmdb and leveldb that will use as max cache as possible, but I might be wrong.

beniz commented Oct 31, 2014

I believe this is the mmap call from lmdb and leveldb that will use as max cache as possible, but I might be wrong.

@amiralush

This comment has been minimized.

Show comment
Hide comment
@amiralush

amiralush Oct 31, 2014

I have also been experiencing this phenomena when using the dev branch. I'm not sure it's related to the LMDB's memory consumption since in earlier versions I didn't encounter this when using LMDB.
I don't yet fully understand this. @sguada any suggestions?

amiralush commented Oct 31, 2014

I have also been experiencing this phenomena when using the dev branch. I'm not sure it's related to the LMDB's memory consumption since in earlier versions I didn't encounter this when using LMDB.
I don't yet fully understand this. @sguada any suggestions?

@sguada

This comment has been minimized.

Show comment
Hide comment
@sguada

sguada Oct 31, 2014

Contributor

@amiralush we need to redo the #1238 to simplify its complexity and fix any memory overhead.
#1238 (comment)
Currently we are in deadline mode and cannot do it, but PR are welcome.

Contributor

sguada commented Oct 31, 2014

@amiralush we need to redo the #1238 to simplify its complexity and fix any memory overhead.
#1238 (comment)
Currently we are in deadline mode and cannot do it, but PR are welcome.

@sguada

This comment has been minimized.

Show comment
Hide comment
@sguada

sguada Dec 12, 2014

Contributor

@lavania the memory usage is due to LMDB since it tries to map the file into memory for faster access. That shouldn't happen with LEVELDB, so you can try that, but you will need to generate the leveldb from the data again.

Contributor

sguada commented Dec 12, 2014

@lavania the memory usage is due to LMDB since it tries to map the file into memory for faster access. That shouldn't happen with LEVELDB, so you can try that, but you will need to generate the leveldb from the data again.

@raingo

This comment has been minimized.

Show comment
Hide comment
@raingo

raingo Jan 17, 2015

This lmdb fork: https://github.com/raingo/lmdb-fork/tree/mdb.master solves the problem (not fully tested and won't update with upstream).

raingo commented Jan 17, 2015

This lmdb fork: https://github.com/raingo/lmdb-fork/tree/mdb.master solves the problem (not fully tested and won't update with upstream).

@bwang0

This comment has been minimized.

Show comment
Hide comment
@bwang0

bwang0 Mar 20, 2015

Try setting /proc/sys/vm/swappiness to zero. LMDB will still use up lots of memory as page cache, but it will be efficient as intended and machine won't slow down to a crawl. In our case, it slows down because it uses swap too aggressively, resulting in thrashing, and everything on the system waits on hard drive IO to complete. After swappiness gets set to zero, we see no more freezes and slow down of the whole system.

Based on this answer from the main author of LMDB, Howard Chu. http://www.openldap.org/lists/openldap-technical/201503/msg00077.html

If you don't what change the swap behavior for the whole system, look into cgroups, where you can tune the memory usage and caching behavior for individual processes. Hope this helps!

bwang0 commented Mar 20, 2015

Try setting /proc/sys/vm/swappiness to zero. LMDB will still use up lots of memory as page cache, but it will be efficient as intended and machine won't slow down to a crawl. In our case, it slows down because it uses swap too aggressively, resulting in thrashing, and everything on the system waits on hard drive IO to complete. After swappiness gets set to zero, we see no more freezes and slow down of the whole system.

Based on this answer from the main author of LMDB, Howard Chu. http://www.openldap.org/lists/openldap-technical/201503/msg00077.html

If you don't what change the swap behavior for the whole system, look into cgroups, where you can tune the memory usage and caching behavior for individual processes. Hope this helps!

@raingo

This comment has been minimized.

Show comment
Hide comment
@raingo

raingo Mar 20, 2015

I don't think the lmdb-fork downgrade the efficiency. The only thing it does is to notify OS to release the page cache read in the last iteration. We have already used it to train many models. I guess the lmdb author's concern with MAP_PRIVATE is irrelevant in sequential read case.

Have you tried the swappiness=0 solution with multiple training processes on multi-GPU machine? Our first try is to set the swappiness, but that fails again with multiple training processes, so we end up with the hacky solution.

Hope that helps.

raingo commented Mar 20, 2015

I don't think the lmdb-fork downgrade the efficiency. The only thing it does is to notify OS to release the page cache read in the last iteration. We have already used it to train many models. I guess the lmdb author's concern with MAP_PRIVATE is irrelevant in sequential read case.

Have you tried the swappiness=0 solution with multiple training processes on multi-GPU machine? Our first try is to set the swappiness, but that fails again with multiple training processes, so we end up with the hacky solution.

Hope that helps.

@immars

This comment has been minimized.

Show comment
Hide comment
@immars

immars Mar 26, 2015

@sguada in my test, memory usage do grow when using leveldb, even in rand_skiping of DataLayer.

If caffe does not need random access, which seems to be true now, I think it's better to use just plain file instead of these dbs.

IMHO It's a poor way to use RAM against sequential read, no matter mmaping or caching programmatically, unless you have enough memory to fit in the entire dataset.

In #2193 I tried training googlenet for 12 hours and RAM usage does not grow above 1G.

immars commented Mar 26, 2015

@sguada in my test, memory usage do grow when using leveldb, even in rand_skiping of DataLayer.

If caffe does not need random access, which seems to be true now, I think it's better to use just plain file instead of these dbs.

IMHO It's a poor way to use RAM against sequential read, no matter mmaping or caching programmatically, unless you have enough memory to fit in the entire dataset.

In #2193 I tried training googlenet for 12 hours and RAM usage does not grow above 1G.

@woozzu

This comment has been minimized.

Show comment
Hide comment
@woozzu

woozzu Jul 16, 2015

For Windows case, the problem of lmdb can be solved by just adding following lines at the top of LMDBCursor::Seek method. It releases memory mapped pages which are not used after current seek.

if (op != MDB_FIRST)
    VirtualUnlock(mdb_value_.mv_data, mdb_value_.mv_size);

woozzu commented Jul 16, 2015

For Windows case, the problem of lmdb can be solved by just adding following lines at the top of LMDBCursor::Seek method. It releases memory mapped pages which are not used after current seek.

if (op != MDB_FIRST)
    VirtualUnlock(mdb_value_.mv_data, mdb_value_.mv_size);

@woozzu woozzu referenced this issue Sep 16, 2015

Closed

memory leak #28

@shelhamer shelhamer closed this Apr 13, 2017

@hanchaow

This comment has been minimized.

Show comment
Hide comment
@hanchaow

hanchaow Apr 18, 2018

@woozzu How about the linux version of caffe?

hanchaow commented Apr 18, 2018

@woozzu How about the linux version of caffe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment