New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory issue? #103
Comments
I seem to have made some progress with (1) above by modifying the Node class:
but this doesn't fix problem (2) |
Thanks for the excellent bug report. Also a helpful tip I learned from your earlier email: lines like We had |
I simply call garbage collector after each call. |
Wow, this is very timely for me, issue (1) was causing me problems earlier today. The solution I was experimenting with was to adjust the backward pass to pop items off the
but this breaks the I think @hughsalimbeni 's solution is better (it doesn't break |
I've had a look at @jackkamm's solution for issue (2) and also thrown in a extra few weakrefs here and there, but it doesn't appear to make much of a difference. I should add that neither guppy nor gc is seeming to notice the problem |
The loopy references (which make the reference-counting garbage collection not work and which require some kind of mark-and-sweep strategy) are probably from the fact that nodes refer to tapes and tapes refer to nodes. The tapes aren't really a necessary data structure in the sense that they amount to a topological sorting of the computation graph (which is already encoded by the nodes pointing to each other), plus maybe a bit of extra bookkeeping ( A possible quick fix to try (that I don't see mentioned above) is to make the tapes keep weak references to nodes (as in a weak reference list). |
We just did some quick experiments and it seems that calling Also, we think we identified the problem: originally, these lines in while tape:
node = tape.pop()
... ensured that there were no more circular dependencies when the backward pass finished by popping things off the tape, so that reference counting would be sufficient to clean up garbage. However, to support efficient jacobians I added this line: tape = copy.copy(tape) # <-- problem!
while tape:
node = tape.pop()
... which had the unintended result of preserving circular dependencies and hence preventing reference counting cleanup. Still, after We're going to implement a fix (and probably eliminate circular references with tapes in the long term), but for now if you're having memory problems try calling |
As for (2) (which I actually only just read the code for), reverse-mode autodiff generally requires keeping the forward values around in memory, so the memory usage should scale as \Theta(k) there. If you can rewrite code to generate fewer temporaries on the forward pass then of course memory usage would be reduced, but those temporaries can't be garbage collected until the backward pass is done. An optimizing compiler could in principle rewrite that particular code for you (because some temporaries are superfluous and so they don't need to be kept around), but autograd doesn't do any rewriting; it just follows the forward pass (keeping references to any temporaries generated so they can't be garbage collected). |
Thanks @mattjj for your previous comment. I've poked around for some more memory savings for my application and I've done a couple of things: Firstly, after the line Secondly, and more importantly, following @mattjj's advice of creating fewer temporaries I changed things like |
I'm likewise facing a memory issue. Putting
On my MacBook Air, in my Activity Monitor, I'm observing that the "Memory" minus the "Compressed Memory" stays roughly at about 1.5GB, but the absolute value of "Memory" and "Compressed Memory" continue to climb beyond the 8GB on my local machine. Is there a way out of this? |
@hughsalimbeni: Implementing your changes to autograd helped fix my memory leakge issue as well. |
@mattjj : would you be willing to incorporate @hughsalimbeni 's changes onto master? It's only 2 lines of code, makes a huge memory difference, and doesn't break anything (I have been using it for some months now). I can submit a PR if that would be helpful (or maybe @hughsalimbeni should submit the PR if he'd like the credit :). My application involves computing a memory intensive M-estimator (sum of functions). For memory reasons I compute the gradient on minibatches instead of the final result. Having to call I would much prefer to just list |
Thanks for bringing this up again. It's very useful feedback and we want to fix it. As far as I can tell, you're talking about two changes:
Does that sound right to you? I pushed a commit that might address these issues and also should mean you don't need to run I have a basic memory test, copied below, that runs successfully on the new commit but blew up virtual memory on the previous master. However, it might not cover your use case, so please try it out and let us know how things look. import autograd.numpy as np
from autograd import grad
na = np.newaxis
a = 10000
b = 100
c = 50
A = np.random.randn(a)
B = np.random.randn(b)
C = np.random.randn(c)
def fn(x):
return np.sum(A[:, na, na]*B[na, :, na]*x[na, na, :])
g = grad(fn)
for i in range(50):
g(C) |
Thanks for the commit, I tested it and it fixes the memory issues in my use case! For the record, the change I was referencing and previously using was slightly different. In particular, I was using the Anyways, thanks again for fixing this, you guys rock! |
Making the nodes only have a weak reference to the tapes would be a more slick solution to the circular dependency problem, but as you say I am a bit wary of it (as in #17). Have you been using that solution without issue for a while? The commit I added last night is also pretty short, and it arguably follows the explicit is better than implicit Zen of Python. Since it seems to solve things, I'm going to close this issue, but let me know if the WeakKeyDictionary has been working well for you and maybe I'll look into that solution. |
I've been using the WeakKeyDict without issue for a few months now, ever since it was first proposed earlier in this thread. Also, I tried But I am happy with your solution -- WeakKeyDict is bit too mysterious/magical for my tastes, and explicitly doing the dereferencing seems less likely to cause errors in the future. |
I've run into an issue with large matrices and memory. There seem to be two problems:
is ramping up memory on each iteration.
This seems to scale in memory (for each call) as O(k), which don't think is the desired behaviour. For b=150000 this effect does not happen, however.
The text was updated successfully, but these errors were encountered: