`wmt14-en-fr` deadlock issue #1485

ayulockin · 2024-02-27T15:02:42Z

While running evaluation on this task, during ter metric computation, the program gets stuck forever.

The command:

lm_eval --model hf --model_args pretrained=microsoft/phi-2,trust_remote_code=True --tasks wmt14-en-fr --device cuda:0 --batch_size 32 --output_path output/phi-2-mmlu-arc --wandb_args project=lm-eval-harness-integration --log_samples

The stdout on a VM with A100:

2024-02-27 14:43:36.170433: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-27 14:43:37.148896: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.16.3
wandb: Run data is saved locally in /home/ubuntu/ayusht-a100/lm-eval/llm-leaderboard-fr-de/wandb/run-20240227_144346-ptzgnbdv
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run lucky-fire-22
wandb: ⭐️ View project at https://wandb.ai/ayut/lm-eval-harness-integration
wandb: 🚀 View run at https://wandb.ai/ayut/lm-eval-harness-integration/runs/ptzgnbdv
2024-02-27:14:43:54,135 INFO     [__main__.py:209] Verbosity set to INFO
2024-02-27:14:43:54,136 INFO     [__init__.py:358] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-02-27:14:44:06,445 WARNING  [__main__.py:271] File already exists at output/phi-2-mmlu-arc. Results will be overwritten.
2024-02-27:14:44:06,445 INFO     [__main__.py:285] Selected Tasks: ['wmt14-en-fr']
2024-02-27:14:44:06,445 INFO     [__main__.py:286] Loading selected tasks...
2024-02-27:14:44:06,445 INFO     [evaluator.py:95] Setting random seed to 0
2024-02-27:14:44:06,446 INFO     [evaluator.py:99] Setting numpy seed to 1234
2024-02-27:14:44:06,446 INFO     [evaluator.py:103] Setting torch manual seed to 1234
2024-02-27:14:44:06,476 INFO     [huggingface.py:161] Using device 'cuda:0'
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.12s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-02-27:14:44:10,103 INFO     [evaluator.py:150] get_task_dict has been updated to accept an optional argument, `task_manager`Read more here:https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage
2024-02-27:14:44:10,111 WARNING  [task.py:664] [Task: wmt14-en-fr] metric bleu is defined, but aggregation is not. using default aggregation=bleu
2024-02-27:14:44:10,111 WARNING  [task.py:676] [Task: wmt14-en-fr] metric bleu is defined, but higher_is_better is not. using default higher_is_better=True
2024-02-27:14:44:10,111 WARNING  [task.py:664] [Task: wmt14-en-fr] metric ter is defined, but aggregation is not. using default aggregation=ter
2024-02-27:14:44:10,111 WARNING  [task.py:676] [Task: wmt14-en-fr] metric ter is defined, but higher_is_better is not. using default higher_is_better=True
2024-02-27:14:44:10,111 WARNING  [task.py:664] [Task: wmt14-en-fr] metric chrf is defined, but aggregation is not. using default aggregation=chrf
2024-02-27:14:44:10,112 WARNING  [task.py:676] [Task: wmt14-en-fr] metric chrf is defined, but higher_is_better is not. using default higher_is_better=True
/home/ubuntu/.local/lib/python3.10/site-packages/datasets/load.py:1454: FutureWarning: The repository for wmt14 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/wmt14
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Loading dataset shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 43.98it/s]
2024-02-27:14:44:12,698 INFO     [task.py:361] Building contexts for wmt14-en-fr on rank 0...
2024-02-27:14:44:14,516 INFO     [evaluator.py:369] Running generate_until requests
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3003/3003 [04:32<00:00, 11.01it/s]
/home/ubuntu/miniconda3/envs/fr-de-lb/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
bootstrapping for stddev: bleu
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:04<00:00, 64.08s/it]
bootstrapping for stddev: ter
  0%|                                                                                                                                                                                                                          | 0/1 [00:00<?, ?it/s]

Stdout on a VM with V100:

wandb: Currently logged in as: ayut. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.16.3
wandb: Run data is saved locally in /home/ayushthakur/lm-eval/llm-leaderboard-fr-de/wandb/run-20240227_143821-90dkhfc5
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run morning-thunder-20
wandb: ⭐️ View project at https://wandb.ai/ayut/lm-eval-harness-integration
wandb: 🚀 View run at https://wandb.ai/ayut/lm-eval-harness-integration/runs/90dkhfc5
2024-02-27:14:38:22,311 INFO     [__main__.py:209] Verbosity set to INFO
2024-02-27:14:38:22,311 INFO     [__init__.py:358] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-02-27:14:38:27,663 WARNING  [__main__.py:271] File already exists at output/phi-2-mmlu-arc. Results will be overwritten.
2024-02-27:14:38:27,663 INFO     [__main__.py:285] Selected Tasks: ['wmt14-en-fr']
2024-02-27:14:38:27,663 INFO     [__main__.py:286] Loading selected tasks...
2024-02-27:14:38:27,664 INFO     [evaluator.py:95] Setting random seed to 0
2024-02-27:14:38:27,664 INFO     [evaluator.py:99] Setting numpy seed to 1234
2024-02-27:14:38:27,664 INFO     [evaluator.py:103] Setting torch manual seed to 1234
2024-02-27:14:38:27,691 WARNING  [logging.py:61] Detected kernel version 4.19.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-02-27:14:38:27,692 INFO     [huggingface.py:161] Using device 'cuda:0'
Loading checkpoint shards: 100%|███████████████████████████████| 2/2 [00:02<00:00,  1.13s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-02-27:14:38:31,105 INFO     [evaluator.py:150] get_task_dict has been updated to accept an optional argument, `task_manager`Read more here:https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage
2024-02-27:14:38:31,109 WARNING  [task.py:664] [Task: wmt14-en-fr] metric bleu is defined, but aggregation is not. using default aggregation=bleu
2024-02-27:14:38:31,109 WARNING  [task.py:676] [Task: wmt14-en-fr] metric bleu is defined, but higher_is_better is not. using default higher_is_better=True
2024-02-27:14:38:31,109 WARNING  [task.py:664] [Task: wmt14-en-fr] metric ter is defined, but aggregation is not. using default aggregation=ter
2024-02-27:14:38:31,109 WARNING  [task.py:676] [Task: wmt14-en-fr] metric ter is defined, but higher_is_better is not. using default higher_is_better=True
2024-02-27:14:38:31,110 WARNING  [task.py:664] [Task: wmt14-en-fr] metric chrf is defined, but aggregation is not. using default aggregation=chrf
2024-02-27:14:38:31,110 WARNING  [task.py:676] [Task: wmt14-en-fr] metric chrf is defined, but higher_is_better is not. using default higher_is_better=True
/opt/conda/envs/fr-de-lb/lib/python3.10/site-packages/datasets/load.py:1454: FutureWarning: The repository for wmt14 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/wmt14
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Loading dataset shards: 100%|████████████████████████████████| 30/30 [00:00<00:00, 34.35it/s]
2024-02-27:14:38:33,979 INFO     [task.py:361] Building contexts for wmt14-en-fr on rank 0...
2024-02-27:14:38:36,089 INFO     [evaluator.py:369] Running generate_until requests
100%|████████████████████████████████████████████████████| 3003/3003 [16:42<00:00,  3.00it/s]
bootstrapping for stddev: bleu
100%|██████████████████████████████████████████████████████████| 1/1 [01:10<00:00, 70.65s/it]
bootstrapping for stddev: ter
  0%|                                                                  | 0/1 [00:00<?, ?it/s]

cc: @haileyschoelkopf

The text was updated successfully, but these errors were encountered:

afcruzs · 2024-03-17T16:41:01Z

Did you find a workaround yet @ayulockin ?

haileyschoelkopf · 2024-03-18T12:00:58Z

A temporary workaround would be to disable bootstrapping fully in the case of the translation tasks' metrics!

afcruzs · 2024-03-18T13:03:57Z

indeed that worked for me :) I'm on a somewhat old version of the codebase, so I had to comment out the calls to stderr_for_metric and not log the stderr. On the current code, I think passing bootstrap_iters as None here would do it, although it's not obvious how to do it from the cli.

afcruzs · 2024-03-18T13:06:52Z

also I looked a bit into my own (rather old) version of the code, and I found this commit: 82ec4f5, so looks like this might have been a problem already

I also figured some of my evals took many hours over the weekend to complete the bootstrap step. So I suspect it's just bleu/ter taking a very long time (maybe more in some models than others, it depends on how long the completions could be), so it might not even be a hang.

haileyschoelkopf · 2024-03-18T13:12:52Z

Ah, good point--yes, they already have their bootstrap iterations lowered compared to other metrics, I believe for this reason. Will make it a todo to see why they take this long...

Re: disabling bootstrapping from the CLI, we should add a CLI flag --bootstrap_iters which disables stderrs if <=0.

haileyschoelkopf added the bug Something isn't working. label Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`wmt14-en-fr` deadlock issue #1485

`wmt14-en-fr` deadlock issue #1485

ayulockin commented Feb 27, 2024

afcruzs commented Mar 17, 2024

haileyschoelkopf commented Mar 18, 2024

afcruzs commented Mar 18, 2024

afcruzs commented Mar 18, 2024 •

edited

haileyschoelkopf commented Mar 18, 2024

wmt14-en-fr deadlock issue #1485

wmt14-en-fr deadlock issue #1485

Comments

ayulockin commented Feb 27, 2024

afcruzs commented Mar 17, 2024

haileyschoelkopf commented Mar 18, 2024

afcruzs commented Mar 18, 2024

afcruzs commented Mar 18, 2024 • edited

haileyschoelkopf commented Mar 18, 2024

`wmt14-en-fr` deadlock issue #1485

`wmt14-en-fr` deadlock issue #1485

afcruzs commented Mar 18, 2024 •

edited