Question answering example throws an exception even if sanity check is skipped #233

Pointy-Hat · 2022-02-22T16:49:24Z

🐛 Bug

Running the squad example python train.py task=nlp/question_answering dataset=nlp/question_answering/squad trainer.gpus=[1] training.batch_size=8 trainer.num_sanity_val_steps=0 throws an exception while finalizing training. This is not a replication of #218

To Reproduce

Steps to reproduce the behavior:

Run python train.py task=nlp/question_answering dataset=nlp/question_answering/squad trainer.gpus=[1] training.batch_size=8 trainer.num_sanity_val_steps=0
See error

Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 12442/12445 [44:35<00:00,  4.65it/s, loss=0.957
Error executing job with overrides: ['task=nlp/question_answering', 'dataset=nlp/question_answering/squad', 'trainer.gpus=[1]', 'training.batch_size=8', 'trainer.num_sanity_val_steps=0']3 [01:39<00:00, 13.59it/s]
Traceback (most recent call last):
  File "/home/vrt/lightning-transformers/train.py", line 10, in hydra_entry
    main(cfg)
  File "/home/vrt/lightning-transformers/lightning_transformers/cli/train.py", line 69, in main
    run(
  File "/home/vrt/lightning-transformers/lightning_transformers/cli/train.py", line 60, in run
    trainer.fit(model, datamodule=data_module)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
    self.fit_loop.run()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 146, in run
    self.on_advance_end()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 242, in on_advance_end
    self._run_validation()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 337, in _run_validation
    self.val_loop.run()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 151, in run
    output = self.on_run_end()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 134, in on_run_end
    self._on_evaluation_epoch_end()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 241, in _on_evaluation_epoch_end
    self.trainer.call_hook(hook_name)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1501, in call_hook
    output = model_fx(*args, **kwargs)
  File "/home/vrt/lightning-transformers/lightning_transformers/task/nlp/question_answering/model.py", line 59, in on_validation_epoch_end
    metric_dict = self.metric.compute()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/torchmetrics/metric.py", line 380, in wrapped_func
    value = compute(*args, **kwargs)
  File "/home/vrt/lightning-transformers/lightning_transformers/task/nlp/question_answering/datasets/squad/metric.py", line 23, in compute
    example_ids = [reverse_lookup[i.item()] for i in self.example_ids]
  File "/home/vrt/lightning-transformers/lightning_transformers/task/nlp/question_answering/datasets/squad/metric.py", line 23, in <listcomp>
    example_ids = [reverse_lookup[i.item()] for i in self.example_ids]
KeyError: 0

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Environment

PyTorch Version: 1.6.0
OS: Ubuntu 18.04.6 LTS
How you installed PyTorch: conda
Python version: 3.9.7
CUDA/cuDNN version: 11.4
GPU models and configuration: 2x NVIDIA GeForce RTX 2080 Ti (First device not used)
Any other relevant information: The same error occurs during sanity check if trainer.num_sanity_val_steps=-1 is used, as in I see AssertionError when running QA example command. #184

The text was updated successfully, but these errors were encountered:

mariomeissner · 2022-03-19T03:24:20Z

Strangely, I got the KeyError: 0 at some point earlier today without using trainer.num_sanity_val_steps=0, but I haven't been able to reproduce it, nor do I get it when adding trainer.num_sanity_val_steps=0 as you say. Could caching be involved?

mariomeissner · 2022-03-19T03:56:27Z

Ah, nevermind, this happens at the evaluation step, so we got to let it finish training the epoch first. I confirm I see this error too.

mariomeissner · 2022-03-19T07:12:37Z

self.example_id_strings seems to be empty at the time we use it to create reverse_lookup, which will also be empty.

mariomeissner · 2022-03-21T07:46:06Z

I attempt to fix this issue with PR #235.

Borda · 2022-04-10T05:46:49Z

@SeanNaren ^^ 🐰

mariomeissner · 2022-06-12T23:36:04Z

Bad bot.

Strangely I cant close this issue myself?

SeanNaren · 2022-06-23T10:17:36Z

The QA task is really broken... I don't have time to debug it but if anyone can help would appreciate it!

Borda · 2022-09-14T22:31:44Z

@mariomeissner, would you be interested in diving in and debugging this issue? 🐰

mariomeissner · 2022-09-15T00:18:04Z

I've been away for a while and don't know the current situation. Was PR #235 not enough? I'd be happy to dig into this again if you point me in some direction 😄

Borda · 2022-11-07T21:04:09Z

I've been away for a while and don't know the current situation. Was PR #235 not enough? I'd be happy to dig into this again if you point me in some direction smile

I may say that the best would be just to check it out :)

Pointy-Hat added bug / fix Something isn't working help wanted Extra attention is needed labels Feb 22, 2022

Borda assigned SeanNaren Feb 22, 2022

mariomeissner mentioned this issue Mar 19, 2022

Don't load from cache for SQuAD validation dataset #235

Merged

stale bot added the wontfix This will not be worked on label Jun 12, 2022

SeanNaren removed the wontfix This will not be worked on label Jun 23, 2022

Lightning-Universe deleted a comment from stale bot Nov 21, 2022

Borda closed this as completed Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question answering example throws an exception even if sanity check is skipped #233

Question answering example throws an exception even if sanity check is skipped #233

Pointy-Hat commented Feb 22, 2022

mariomeissner commented Mar 19, 2022

mariomeissner commented Mar 19, 2022

mariomeissner commented Mar 19, 2022 •

edited

mariomeissner commented Mar 21, 2022

Borda commented Apr 10, 2022

mariomeissner commented Jun 12, 2022

SeanNaren commented Jun 23, 2022

Borda commented Sep 14, 2022

mariomeissner commented Sep 15, 2022

Borda commented Nov 7, 2022

Question answering example throws an exception even if sanity check is skipped #233

Question answering example throws an exception even if sanity check is skipped #233

Comments

Pointy-Hat commented Feb 22, 2022

🐛 Bug

To Reproduce

Environment

mariomeissner commented Mar 19, 2022

mariomeissner commented Mar 19, 2022

mariomeissner commented Mar 19, 2022 • edited

mariomeissner commented Mar 21, 2022

Borda commented Apr 10, 2022

mariomeissner commented Jun 12, 2022

SeanNaren commented Jun 23, 2022

Borda commented Sep 14, 2022

mariomeissner commented Sep 15, 2022

Borda commented Nov 7, 2022

mariomeissner commented Mar 19, 2022 •

edited