LearningRateMonitor KeyError with multiple parameter groups and no LR scheduler #10024

eladsegal · 2021-10-19T15:47:54Z

🐛 Bug

This bug is due to the change in #9786:
When there is an optimizer with multiple parameter groups and no learning rate scheduler,
_find_names_from_optimizers returns keys with suffix but then the suffix is added again in _get_lr_momentum_stat, resulting in a key error.
It only happens for multiple parameter groups, as for a single group _add_suffix doesn't add the suffix.

The text was updated successfully, but these errors were encountered:

kandluis · 2021-10-19T18:11:37Z

We have a few workflows that are facing this issue as well. What's the expected ETA on a fix?

tangbinh · 2021-10-20T02:41:38Z

We're also having an internal workflow failing for the same reason and would expect more to follow. Please see the following script to reproduce the problem described by @eladsegal.

import os

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.callbacks import LearningRateMonitor


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(32, 16), nn.Linear(16, 2))

    def forward(self, x):
        return self.net(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.Adam([
            {'params': self.net[0].parameters(), 'lr': 2e-4},
            {'params': self.net[1].parameters(), 'lr': 1e-3}
        ])


train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

model = BoringModel()
trainer = Trainer(
    default_root_dir=os.getcwd(),
    limit_train_batches=1,
    limit_val_batches=1,
    num_sanity_val_steps=0,
    max_epochs=1,
    callbacks=[LearningRateMonitor()]
)
trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
trainer.test(model, dataloaders=test_data)

Here's the stack trace we got after running the script on master:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-17-788ac4a26c1a> in <module>()
     62     callbacks=[LearningRateMonitor()]
     63 )
---> 64 trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
     65 trainer.test(model, dataloaders=test_data)

17 frames
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/lr_monitor.py in _extract_lr(self, param_group, name)
    210     def _extract_lr(self, param_group: Dict[str, Any], name: str) -> Dict[str, Any]:
    211         lr = param_group.get("lr")
--> 212         self.lrs[name].append(lr)
    213         return {name: lr}
    214 

KeyError: 'lr-Adam/pg1/pg1'

rohitgr7 · 2021-10-20T10:18:02Z

thank you guys for raising this..
ETA: today (most probably) :)

rohitgr7 · 2021-10-20T11:04:49Z

hey!
created a fix here: #10044
can anyone confirm if it's working for them now?
thanks :)

eladsegal · 2021-10-20T14:52:14Z

Hey, I can confirm the fix works.
Thanks!

eladsegal added bug Something isn't working help wanted Open to be worked on labels Oct 19, 2021

rohitgr7 self-assigned this Oct 19, 2021

rohitgr7 mentioned this issue Oct 20, 2021

Fix LearningRateMonitor logging with multiple param groups optimizer with no scheduler #10044

Merged

12 tasks

rohitgr7 closed this as completed in #10044 Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LearningRateMonitor KeyError with multiple parameter groups and no LR scheduler #10024

LearningRateMonitor KeyError with multiple parameter groups and no LR scheduler #10024

eladsegal commented Oct 19, 2021

kandluis commented Oct 19, 2021

tangbinh commented Oct 20, 2021

rohitgr7 commented Oct 20, 2021

rohitgr7 commented Oct 20, 2021

eladsegal commented Oct 20, 2021

LearningRateMonitor KeyError with multiple parameter groups and no LR scheduler #10024

LearningRateMonitor KeyError with multiple parameter groups and no LR scheduler #10024

Comments

eladsegal commented Oct 19, 2021

🐛 Bug

kandluis commented Oct 19, 2021

tangbinh commented Oct 20, 2021

rohitgr7 commented Oct 20, 2021

rohitgr7 commented Oct 20, 2021

eladsegal commented Oct 20, 2021