Skip to content
This repository has been archived by the owner on Nov 21, 2022. It is now read-only.

Question answering example throws an exception even if sanity check is skipped #233

Closed
Pointy-Hat opened this issue Feb 22, 2022 · 10 comments
Assignees
Labels
bug / fix Something isn't working help wanted Extra attention is needed

Comments

@Pointy-Hat
Copy link

馃悰 Bug

Running the squad example python train.py task=nlp/question_answering dataset=nlp/question_answering/squad trainer.gpus=[1] training.batch_size=8 trainer.num_sanity_val_steps=0 throws an exception while finalizing training. This is not a replication of #218

To Reproduce

Steps to reproduce the behavior:

  1. Run python train.py task=nlp/question_answering dataset=nlp/question_answering/squad trainer.gpus=[1] training.batch_size=8 trainer.num_sanity_val_steps=0
  2. See error
Epoch 0: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻墊 12442/12445 [44:35<00:00,  4.65it/s, loss=0.957
Error executing job with overrides: ['task=nlp/question_answering', 'dataset=nlp/question_answering/squad', 'trainer.gpus=[1]', 'training.batch_size=8', 'trainer.num_sanity_val_steps=0']3 [01:39<00:00, 13.59it/s]
Traceback (most recent call last):
  File "/home/vrt/lightning-transformers/train.py", line 10, in hydra_entry
    main(cfg)
  File "/home/vrt/lightning-transformers/lightning_transformers/cli/train.py", line 69, in main
    run(
  File "/home/vrt/lightning-transformers/lightning_transformers/cli/train.py", line 60, in run
    trainer.fit(model, datamodule=data_module)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
    self.fit_loop.run()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 146, in run
    self.on_advance_end()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 242, in on_advance_end
    self._run_validation()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 337, in _run_validation
    self.val_loop.run()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 151, in run
    output = self.on_run_end()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 134, in on_run_end
    self._on_evaluation_epoch_end()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 241, in _on_evaluation_epoch_end
    self.trainer.call_hook(hook_name)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1501, in call_hook
    output = model_fx(*args, **kwargs)
  File "/home/vrt/lightning-transformers/lightning_transformers/task/nlp/question_answering/model.py", line 59, in on_validation_epoch_end
    metric_dict = self.metric.compute()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/torchmetrics/metric.py", line 380, in wrapped_func
    value = compute(*args, **kwargs)
  File "/home/vrt/lightning-transformers/lightning_transformers/task/nlp/question_answering/datasets/squad/metric.py", line 23, in compute
    example_ids = [reverse_lookup[i.item()] for i in self.example_ids]
  File "/home/vrt/lightning-transformers/lightning_transformers/task/nlp/question_answering/datasets/squad/metric.py", line 23, in <listcomp>
    example_ids = [reverse_lookup[i.item()] for i in self.example_ids]
KeyError: 0

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Environment

  • PyTorch Version: 1.6.0
  • OS: Ubuntu 18.04.6 LTS
  • How you installed PyTorch: conda
  • Python version: 3.9.7
  • CUDA/cuDNN version: 11.4
  • GPU models and configuration: 2x NVIDIA GeForce RTX 2080 Ti (First device not used)
  • Any other relevant information: The same error occurs during sanity check if trainer.num_sanity_val_steps=-1 is used, as in I see AssertionError when running QA example command.聽#184
@Pointy-Hat Pointy-Hat added bug / fix Something isn't working help wanted Extra attention is needed labels Feb 22, 2022
@mariomeissner
Copy link
Contributor

Strangely, I got the KeyError: 0 at some point earlier today without using trainer.num_sanity_val_steps=0, but I haven't been able to reproduce it, nor do I get it when adding trainer.num_sanity_val_steps=0 as you say. Could caching be involved?

@mariomeissner
Copy link
Contributor

Ah, nevermind, this happens at the evaluation step, so we got to let it finish training the epoch first. I confirm I see this error too.

@mariomeissner
Copy link
Contributor

mariomeissner commented Mar 19, 2022

self.example_id_strings seems to be empty at the time we use it to create reverse_lookup, which will also be empty.

@mariomeissner
Copy link
Contributor

I attempt to fix this issue with PR #235.

@Borda
Copy link
Member

Borda commented Apr 10, 2022

@SeanNaren ^^ 馃惏

@stale stale bot added the wontfix This will not be worked on label Jun 12, 2022
@mariomeissner
Copy link
Contributor

Bad bot.

Strangely I cant close this issue myself?

@SeanNaren SeanNaren removed the wontfix This will not be worked on label Jun 23, 2022
@SeanNaren
Copy link
Contributor

The QA task is really broken... I don't have time to debug it but if anyone can help would appreciate it!

@Borda
Copy link
Member

Borda commented Sep 14, 2022

@mariomeissner, would you be interested in diving in and debugging this issue? 馃惏

@mariomeissner
Copy link
Contributor

I've been away for a while and don't know the current situation. Was PR #235 not enough? I'd be happy to dig into this again if you point me in some direction 馃槃

@Borda
Copy link
Member

Borda commented Nov 7, 2022

I've been away for a while and don't know the current situation. Was PR #235 not enough? I'd be happy to dig into this again if you point me in some direction smile

I may say that the best would be just to check it out :)

@Lightning-Universe Lightning-Universe deleted a comment from stale bot Nov 21, 2022
@Borda Borda closed this as completed Nov 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug / fix Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants