Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

EWC fails with RNN's using CUDA #736

Closed
iacobo opened this issue Sep 6, 2021 · 5 comments
Closed

EWC fails with RNN's using CUDA #736

iacobo opened this issue Sep 6, 2021 · 5 comments
Labels
bug Something isn't working Training Related to the Training module

Comments

@iacobo
Copy link
Contributor

iacobo commented Sep 6, 2021

馃悰 Describe the bug

When training a model which contains an RNN/LSTM/GRU layer using the EWC strategy using CUDA, an error is raised since RNN-like layers in PyTorch do not support backward calls on CUDA devices when in eval mode.

Trace:

  File "/.../avalanche/training/strategies/base_strategy.py", line 248, in train
    self.train_exp(self.experience, eval_streams, **kwargs)
  File "/.../avalanche/training/strategies/base_strategy.py", line 304, in train_exp
    self.after_training_exp(**kwargs)
  File "/.../avalanche/training/strategies/base_strategy.py", line 536, in after_training_exp
    p.after_training_exp(self, **kwargs)
  File "/.../avalanche/training/plugins/ewc.py", line 92, in after_training_exp
    importances = self.compute_importances(strategy.model,
  File "/.../avalanche/training/plugins/ewc.py", line 123, in compute_importances
    loss.backward()
  File "/.../torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/.../torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
RuntimeError: cudnn RNN backward can only be called in training mode

Source of error:

https://avalanche-api.continualai.org/_modules/avalanche/training/plugins/ewc/#EWCPlugin

    def compute_importances(self, model, criterion, optimizer,
                            dataset, device, batch_size):
        """
        Compute EWC importance matrix for each parameter
        """

        model.eval()                                       # <----------- Here eval mode is called

        # list of list
        importances = zerolike_params_dict(model)
        dataloader = DataLoader(dataset, batch_size=batch_size)
        for i, (x, y, task_labels) in enumerate(dataloader):
            x, y = x.to(device), y.to(device)

            optimizer.zero_grad()
            out = avalanche_forward(model, x, task_labels)
            loss = criterion(out, y)
            loss.backward()                                 # <----------- Given above, here causes the error

馃悳 To Reproduce

  • Use a machine with a CUDA capable GPU.
  • Create a model containing an RNN-like layer (RNN/LSTM/GRU).
  • Wrap it in the EWC strategy.
  • Set the device to the GPU.
  • Train the model (on any dataset).

馃悵 Expected behavior
RNN models to run with EWC without error on GPU.

馃 Additional context

This appears to be the opposite behaviour of the EWCPlugin as defined in:

https://avalanche-api.continualai.org/_modules/avalanche/training/plugins/#EWCPlugin

    def compute_importances(self, model, criterion, optimizer,
                            dataset, device, batch_size):
        """
        Compute EWC importance matrix for each parameter
        """

        model.train()   

馃悶 Potential fix

If this is just a typo, the above would be fixed by changing model.eval() to model.train().

@iacobo iacobo added the bug Something isn't working label Sep 6, 2021
@AntonioCarta
Copy link
Collaborator

If this is just a typo, the above would be fixed by changing model.eval() to model.train().

It is probably better to keep the eval mode for all the modules and add an exception for RNNs. We don't want the train mode for modules such as dropout or batch normalization.

@AntonioCarta AntonioCarta added the Training Related to the Training module label Sep 6, 2021
@iacobo
Copy link
Contributor Author

iacobo commented Sep 7, 2021

@AntonioCarta in that case should the line in plugins.EWCPlugin's compute_importances be changed to also say model.eval() (like in plugins.ewc.EWCPlugin)?

@AntonioCarta
Copy link
Collaborator

@AntonioCarta in that case should the line in plugins.EWCPlugin's compute_importances be changed to also say model.eval() (like in plugins.ewc.EWCPlugin)?

I agree.

@AndreaCossu
Copy link
Collaborator

@iacobo thanks for reporting! Are you willing to submit a PR with the solution? In my opinion, we could raise a warning in case of RNN + CUDA, explaining the problem to the user, and then use the train mode instead of failing.

@iacobo
Copy link
Contributor Author

iacobo commented Sep 8, 2021

@AndreaCossu Yep sure thing.

AntonioCarta added a commit that referenced this issue Nov 2, 2021
Fix #736 (CUDA bug with RNN's using EWC)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Training Related to the Training module
Projects
None yet
Development

No branches or pull requests

3 participants