Adds constant memory backprop #2

anshuln · 2019-07-18T06:53:27Z

The architecture is changed a bit, adding a new abstract class for layers to inherit from.Some changes are also made in the Generator class. Unit tests for gradient computations are added, and a function to make data generator from images is added.
However, there seems to be a small bug in gradient computations involving AffineCoupling layer, which requires some math to be figured out.

…er, in some tests, the computed gradient is NaN for AffineCoupling, and this bug remains even if O(n) gradient calculations are used.

AlexanderMath · 2019-07-19T13:40:21Z

Great job, the code looks very promising. Unfortunately I'm a bit sleep deprived today (so I'll focus on smaller refactoring issues today), and I'm busy tomorrow, but I'll probably have all Sunday to merge the code. I try to summarize the changes below to make sure I understand them. Please read my summary and highlight any potential misunderstandings on my behalf.

Overall. There are changes in five files. The main merging work will be with respect to the files containing Generator and Layers, since the code for data loader and unit tests doesn't really interact with previous code.

Generator class: Previously, the Generator inherited the functions train_on_batch, fit and fit_generator from Sequential which inherited the functions from Model. The new fit loops through data and calls train_on_batchwhich has the O(1) mem backprop.

Merging these changes seems fairly straight forward, since the previous Generator class had no functionality.

I understood all todo with one exception. On line 232 at dy,grads = layer.compute_gradients(...) there is a TODO stating 'implementing scaling here'. What does this mean? At line 32: it similarly says "TODO: imlpement scaling here -- DONE". Is this todo already handled?

I really appreciated all other TODOs, they were very easy to read and understand, I believe I'll be able to implement all of them.

Layers: There is a new virtual class called LayerWithGrads which all layers should inherit from. It implements compute_gradients which most layers inherits, except AdditiveCoupling and AffineCoupling. It forces all children to implement call and call_inv. I believe it makes to also force the children to implement log_det, what do you think?

I imagine the biggest difficulty is with respect to the Coupling Layers, especially getting them to work with the multi-scale architecture simultaneously. I will probably be able to get everything except that to work on Sunday.

AlexanderMath · 2019-07-19T13:44:54Z

Out of curiosity. Did you find any way of benchmarking memory usage on GPU? The O(\sqrt(L)) gradient checkpointing was implemented by OpenAI and they have a very nice graph showing the memory consumption as the number of residual blocks increase. When I finish all the merging, I'll try to figure out how they measured memory consumption, and try to make a similar plot which also shows the increase in computation time.

https://github.com/cybertronai/gradient-checkpointing

anshuln · 2019-07-19T17:44:18Z

Couldn't have summarized it better myself!
I just have one comment regarding the Generator class - the loss functions have been changed to have 0 or 1 args as opposed to 2 args in the original code, as a result some refactoring would be required.
Regarding the TODOs in the layer class, they referred to the fact that log_det is divided by a scaling factor before being added to the loss function. This issue has been handled in the layer.compute_gradients function.
Also, log_det should also be a part of the abstract class probably, since flow models definitely use it. The only issue here is that log_det needs to return a tensor for backprop to work, and we need to somehow enforce that condition.

AlexanderMath · 2019-07-20T12:49:05Z

The code is now included on the main branch. I'll leave this open for discussion of the code, and close it when we're happy with the constant memory backprop code. Below I added a few initial thoughts we probably want to discuss the following issues over Skype some day. I'm leaving them here both to open up discussion, but also as notes for myself on things that needs to be taken care of.

(1) In the train_on_batch there is a forward pass self.call(X). After testing memory consumption before and after using tool pip install nvidia-ml-py3, it seems it stores activations; at the very least the memory consumption afterwards was 1 GB larger than before the call (used a model with 100 ActNorm layers with just 2400 weights). There are a few things I need to understand about @tf.function, eager execution and how this all relates to the graph in tensorflow 2.0 before I decide on how to fix this.

(2) Should we remove compile and use the fit to specify optimizer?
I agree this makes sense from our perspective, but it I think it violates the Keras API, and thus believe it is a bad idea. That said, I agree it is a bit messy. The framework will only support loss='maximum_likelihood', so compile doesn't have any responsibility besides optimizer (metrics are by default set to what one needs anyway). Because of this is seems nicer to pass optimizer to fit and remove compile, which is what was done in the pull request.

That said, I currently believe it is better to follow the Keras API so people familiar with this won't be confused when they there is no compile function.

There is a similar issue with metrics. The Keras API specify that metrics takes two arguments y_pred and y_true, however, likelihood computations only use y_pred. In this sense specifying y_true is redundant, but adding this redundancy makes the code follow the Keras API. Performance wise there is no loss since we are only passing along a pointer, so I think (even though it is annoying from our perspective), it will be nicer from the perspective of users who are used to Keras.

(3) I changed the original fit function to take a parameter memory. If it is set to linear it uses the fit method inherited from Model. If set to constant it uses our new fit method. At some point I think we also want the O(sqrt(L)) strategy for architectures were the inverse takes a lot of time to compute, like e.g. iResNet or Residual Flow.

(4) In line 233 in of " invtf/generator_const_backprop.py " there was a line

The variable gradientsrads is not defined anywhere, I searched the entire file. I assumed this was a typo and changed it to gradients computed the line above. Subsequent code is interpreted with no errors, however, if I train the model with fit and change memory from "linear" to "constant" the model obtains different negative log likelihood (e.g. 8.3 for constant and 8.1 for linear, however, constant sometimes gives very bad 20-40). After debugging it seems (when both loss are around 8) that one approach gets a larger determinant loss and a lower latent space loss, so there might be an issue with scaling, but I have no clear picture of this bug.

(5) TQDM is one of my favorite libraries, but it introduces an unnecessary dependency users of the library will have to install. Furthermore, keras has a nice progressbar we can use instead, see picture below:

I currently removed TQDM from the fit function and plan to remove it from fit_generator.

(6) I changed the LayersWithGradient class to be called ÌnvLayer, I think this will be easier for users to understand. I also added the log_det function. When testing I got an issue with the compute_gradients if I had a layer with no trainable weights, for example Squeeze or Normalize.

I'll probably dedicate most of tomorrow (sunday) working on this. Next week I'll be out of office kayaking for a week. We get NeurIPS reviews back this week also, so when I get back the following week I'll have to answer these ASAP, after that I'll get back to working on this.

anshuln · 2019-07-20T14:28:43Z

if I train the model with fit and change memory from "linear" to "constant" the model obtains different negative log likelihood (e.g. 8.3 for constant and 8.1 for linear, however, constant sometimes gives very bad 20-40).

Does this model have AffineCoupling layers in it? Could you try the gradient unit test with your architecture to check whether the computations are correct.

When testing I got an issue with the compute_gradients if I had a layer with no trainable weights, for example Squeeze or Normalize.

What is the error exactly, because the unit tests involving these layers were passing on my system.

After testing memory consumption before and after using tool pip install nvidia-ml-py3, it seems it stores activations; at the very least the memory consumption afterwards was 1 GB larger than before the call (used a model with 100 ActNorm layers with just 2400 weights)

I am not sure what is causing this, it maybe that the call function is not doing proper resource management upon overwriting of variables.

We should probably have a discussion about these issues tomorrow.

anshuln added 15 commits July 13, 2019 07:56

Adds gradient computations for layers

55b19a8

Adds const memory backprop computations for layers

1a8d7c8

Merge upstream master

d3ee0c2

Merge

fb1bb54

Adds fit function, needs some testing

97bf377

Merge commit

d096c5d

Adds const memory backprop for AffineCoupling and other layers. Howev…

85fe171

…er, in some tests, the computed gradient is NaN for AffineCoupling, and this bug remains even if O(n) gradient calculations are used.

Adds custom fit funtion using ndarray or tf Tensor

1c95d38

Adds unit tests for gradient descent

d897faf

Adds unit tests for gradient descent

eefd903

Fixes incorrect gradient for AffineCoupling

990b302

Adds fit_generator for models

35a135e

Minor bug fixes

70dfb23

Potential fix for incorrect dy issue

74904a0

Fixes order of gradients in tests

62b8822

anshuln requested a review from AlexanderMath July 18, 2019 13:49

AlexanderMath self-assigned this Jul 19, 2019

AlexanderMath added the enhancement New feature or request label Jul 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds constant memory backprop #2

Adds constant memory backprop #2

anshuln commented Jul 18, 2019

AlexanderMath commented Jul 19, 2019

AlexanderMath commented Jul 19, 2019

anshuln commented Jul 19, 2019 •

edited

Loading

AlexanderMath commented Jul 20, 2019

anshuln commented Jul 20, 2019

Adds constant memory backprop #2

Are you sure you want to change the base?

Adds constant memory backprop #2

Conversation

anshuln commented Jul 18, 2019

AlexanderMath commented Jul 19, 2019

AlexanderMath commented Jul 19, 2019

anshuln commented Jul 19, 2019 • edited Loading

AlexanderMath commented Jul 20, 2019

anshuln commented Jul 20, 2019

anshuln commented Jul 19, 2019 •

edited

Loading