EWC fails with RNN's using CUDA #736

iacobo · 2021-09-06T10:32:22Z

🐛 Describe the bug

When training a model which contains an RNN/LSTM/GRU layer using the EWC strategy using CUDA, an error is raised since RNN-like layers in PyTorch do not support backward calls on CUDA devices when in eval mode.

Trace:

  File "/.../avalanche/training/strategies/base_strategy.py", line 248, in train
    self.train_exp(self.experience, eval_streams, **kwargs)
  File "/.../avalanche/training/strategies/base_strategy.py", line 304, in train_exp
    self.after_training_exp(**kwargs)
  File "/.../avalanche/training/strategies/base_strategy.py", line 536, in after_training_exp
    p.after_training_exp(self, **kwargs)
  File "/.../avalanche/training/plugins/ewc.py", line 92, in after_training_exp
    importances = self.compute_importances(strategy.model,
  File "/.../avalanche/training/plugins/ewc.py", line 123, in compute_importances
    loss.backward()
  File "/.../torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/.../torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
RuntimeError: cudnn RNN backward can only be called in training mode

Source of error:

https://avalanche-api.continualai.org/_modules/avalanche/training/plugins/ewc/#EWCPlugin

    def compute_importances(self, model, criterion, optimizer,
                            dataset, device, batch_size):
        """
        Compute EWC importance matrix for each parameter
        """

        model.eval()                                       # <----------- Here eval mode is called

        # list of list
        importances = zerolike_params_dict(model)
        dataloader = DataLoader(dataset, batch_size=batch_size)
        for i, (x, y, task_labels) in enumerate(dataloader):
            x, y = x.to(device), y.to(device)

            optimizer.zero_grad()
            out = avalanche_forward(model, x, task_labels)
            loss = criterion(out, y)
            loss.backward()                                 # <----------- Given above, here causes the error

🐜 To Reproduce

Use a machine with a CUDA capable GPU.
Create a model containing an RNN-like layer (RNN/LSTM/GRU).
Wrap it in the EWC strategy.
Set the device to the GPU.
Train the model (on any dataset).

🐝 Expected behavior
RNN models to run with EWC without error on GPU.

🦋 Additional context

This appears to be the opposite behaviour of the EWCPlugin as defined in:

https://avalanche-api.continualai.org/_modules/avalanche/training/plugins/#EWCPlugin

    def compute_importances(self, model, criterion, optimizer,
                            dataset, device, batch_size):
        """
        Compute EWC importance matrix for each parameter
        """

        model.train()

🐞 Potential fix

If this is just a typo, the above would be fixed by changing model.eval() to model.train().

The text was updated successfully, but these errors were encountered:

AntonioCarta · 2021-09-06T12:30:17Z

If this is just a typo, the above would be fixed by changing model.eval() to model.train().

It is probably better to keep the eval mode for all the modules and add an exception for RNNs. We don't want the train mode for modules such as dropout or batch normalization.

iacobo · 2021-09-07T22:18:14Z

@AntonioCarta in that case should the line in plugins.EWCPlugin's compute_importances be changed to also say model.eval() (like in plugins.ewc.EWCPlugin)?

AntonioCarta · 2021-09-08T07:57:35Z

@AntonioCarta in that case should the line in plugins.EWCPlugin's compute_importances be changed to also say model.eval() (like in plugins.ewc.EWCPlugin)?

I agree.

AndreaCossu · 2021-09-08T09:33:37Z

@iacobo thanks for reporting! Are you willing to submit a PR with the solution? In my opinion, we could raise a warning in case of RNN + CUDA, explaining the problem to the user, and then use the train mode instead of failing.

iacobo · 2021-09-08T10:22:20Z

@AndreaCossu Yep sure thing.

Fix #736 (CUDA bug with RNN's using EWC)

iacobo added the bug Something isn't working label Sep 6, 2021

AntonioCarta added the Training Related to the Training module label Sep 6, 2021

iacobo mentioned this issue Sep 6, 2021

RNN's fail using EWC on GPU iacobo/continual#5

Open

AntonioCarta closed this as completed in 28e464d Nov 2, 2021

AntonioCarta added a commit that referenced this issue Nov 2, 2021

Merge pull request #745 from iacobo/rnn-ewc-cuda-patch

39b0925

Fix #736 (CUDA bug with RNN's using EWC)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EWC fails with RNN's using CUDA #736

EWC fails with RNN's using CUDA #736

iacobo commented Sep 6, 2021 •

edited

AntonioCarta commented Sep 6, 2021

iacobo commented Sep 7, 2021

AntonioCarta commented Sep 8, 2021

AndreaCossu commented Sep 8, 2021

iacobo commented Sep 8, 2021

EWC fails with RNN's using CUDA #736

EWC fails with RNN's using CUDA #736

Comments

iacobo commented Sep 6, 2021 • edited

Trace:

Source of error:

AntonioCarta commented Sep 6, 2021

iacobo commented Sep 7, 2021

AntonioCarta commented Sep 8, 2021

AndreaCossu commented Sep 8, 2021

iacobo commented Sep 8, 2021

iacobo commented Sep 6, 2021 •

edited