Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wmt14-en-fr deadlock issue #1485

Open
ayulockin opened this issue Feb 27, 2024 · 5 comments
Open

wmt14-en-fr deadlock issue #1485

ayulockin opened this issue Feb 27, 2024 · 5 comments
Labels
bug Something isn't working.

Comments

@ayulockin
Copy link
Contributor

While running evaluation on this task, during ter metric computation, the program gets stuck forever.

The command:

lm_eval --model hf --model_args pretrained=microsoft/phi-2,trust_remote_code=True --tasks wmt14-en-fr --device cuda:0 --batch_size 32 --output_path output/phi-2-mmlu-arc --wandb_args project=lm-eval-harness-integration --log_samples

The stdout on a VM with A100:

2024-02-27 14:43:36.170433: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-27 14:43:37.148896: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.16.3
wandb: Run data is saved locally in /home/ubuntu/ayusht-a100/lm-eval/llm-leaderboard-fr-de/wandb/run-20240227_144346-ptzgnbdv
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run lucky-fire-22
wandb: ⭐️ View project at https://wandb.ai/ayut/lm-eval-harness-integration
wandb: 🚀 View run at https://wandb.ai/ayut/lm-eval-harness-integration/runs/ptzgnbdv
2024-02-27:14:43:54,135 INFO     [__main__.py:209] Verbosity set to INFO
2024-02-27:14:43:54,136 INFO     [__init__.py:358] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-02-27:14:44:06,445 WARNING  [__main__.py:271] File already exists at output/phi-2-mmlu-arc. Results will be overwritten.
2024-02-27:14:44:06,445 INFO     [__main__.py:285] Selected Tasks: ['wmt14-en-fr']
2024-02-27:14:44:06,445 INFO     [__main__.py:286] Loading selected tasks...
2024-02-27:14:44:06,445 INFO     [evaluator.py:95] Setting random seed to 0
2024-02-27:14:44:06,446 INFO     [evaluator.py:99] Setting numpy seed to 1234
2024-02-27:14:44:06,446 INFO     [evaluator.py:103] Setting torch manual seed to 1234
2024-02-27:14:44:06,476 INFO     [huggingface.py:161] Using device 'cuda:0'
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.12s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-02-27:14:44:10,103 INFO     [evaluator.py:150] get_task_dict has been updated to accept an optional argument, `task_manager`Read more here:https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage
2024-02-27:14:44:10,111 WARNING  [task.py:664] [Task: wmt14-en-fr] metric bleu is defined, but aggregation is not. using default aggregation=bleu
2024-02-27:14:44:10,111 WARNING  [task.py:676] [Task: wmt14-en-fr] metric bleu is defined, but higher_is_better is not. using default higher_is_better=True
2024-02-27:14:44:10,111 WARNING  [task.py:664] [Task: wmt14-en-fr] metric ter is defined, but aggregation is not. using default aggregation=ter
2024-02-27:14:44:10,111 WARNING  [task.py:676] [Task: wmt14-en-fr] metric ter is defined, but higher_is_better is not. using default higher_is_better=True
2024-02-27:14:44:10,111 WARNING  [task.py:664] [Task: wmt14-en-fr] metric chrf is defined, but aggregation is not. using default aggregation=chrf
2024-02-27:14:44:10,112 WARNING  [task.py:676] [Task: wmt14-en-fr] metric chrf is defined, but higher_is_better is not. using default higher_is_better=True
/home/ubuntu/.local/lib/python3.10/site-packages/datasets/load.py:1454: FutureWarning: The repository for wmt14 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/wmt14
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Loading dataset shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 43.98it/s]
2024-02-27:14:44:12,698 INFO     [task.py:361] Building contexts for wmt14-en-fr on rank 0...
2024-02-27:14:44:14,516 INFO     [evaluator.py:369] Running generate_until requests
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3003/3003 [04:32<00:00, 11.01it/s]
/home/ubuntu/miniconda3/envs/fr-de-lb/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
bootstrapping for stddev: bleu
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:04<00:00, 64.08s/it]
bootstrapping for stddev: ter
  0%|                                                                                                                                                                                                                          | 0/1 [00:00<?, ?it/s]

Stdout on a VM with V100:

wandb: Currently logged in as: ayut. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.16.3
wandb: Run data is saved locally in /home/ayushthakur/lm-eval/llm-leaderboard-fr-de/wandb/run-20240227_143821-90dkhfc5
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run morning-thunder-20
wandb: ⭐️ View project at https://wandb.ai/ayut/lm-eval-harness-integration
wandb: 🚀 View run at https://wandb.ai/ayut/lm-eval-harness-integration/runs/90dkhfc5
2024-02-27:14:38:22,311 INFO     [__main__.py:209] Verbosity set to INFO
2024-02-27:14:38:22,311 INFO     [__init__.py:358] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-02-27:14:38:27,663 WARNING  [__main__.py:271] File already exists at output/phi-2-mmlu-arc. Results will be overwritten.
2024-02-27:14:38:27,663 INFO     [__main__.py:285] Selected Tasks: ['wmt14-en-fr']
2024-02-27:14:38:27,663 INFO     [__main__.py:286] Loading selected tasks...
2024-02-27:14:38:27,664 INFO     [evaluator.py:95] Setting random seed to 0
2024-02-27:14:38:27,664 INFO     [evaluator.py:99] Setting numpy seed to 1234
2024-02-27:14:38:27,664 INFO     [evaluator.py:103] Setting torch manual seed to 1234
2024-02-27:14:38:27,691 WARNING  [logging.py:61] Detected kernel version 4.19.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-02-27:14:38:27,692 INFO     [huggingface.py:161] Using device 'cuda:0'
Loading checkpoint shards: 100%|███████████████████████████████| 2/2 [00:02<00:00,  1.13s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-02-27:14:38:31,105 INFO     [evaluator.py:150] get_task_dict has been updated to accept an optional argument, `task_manager`Read more here:https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage
2024-02-27:14:38:31,109 WARNING  [task.py:664] [Task: wmt14-en-fr] metric bleu is defined, but aggregation is not. using default aggregation=bleu
2024-02-27:14:38:31,109 WARNING  [task.py:676] [Task: wmt14-en-fr] metric bleu is defined, but higher_is_better is not. using default higher_is_better=True
2024-02-27:14:38:31,109 WARNING  [task.py:664] [Task: wmt14-en-fr] metric ter is defined, but aggregation is not. using default aggregation=ter
2024-02-27:14:38:31,109 WARNING  [task.py:676] [Task: wmt14-en-fr] metric ter is defined, but higher_is_better is not. using default higher_is_better=True
2024-02-27:14:38:31,110 WARNING  [task.py:664] [Task: wmt14-en-fr] metric chrf is defined, but aggregation is not. using default aggregation=chrf
2024-02-27:14:38:31,110 WARNING  [task.py:676] [Task: wmt14-en-fr] metric chrf is defined, but higher_is_better is not. using default higher_is_better=True
/opt/conda/envs/fr-de-lb/lib/python3.10/site-packages/datasets/load.py:1454: FutureWarning: The repository for wmt14 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/wmt14
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Loading dataset shards: 100%|████████████████████████████████| 30/30 [00:00<00:00, 34.35it/s]
2024-02-27:14:38:33,979 INFO     [task.py:361] Building contexts for wmt14-en-fr on rank 0...
2024-02-27:14:38:36,089 INFO     [evaluator.py:369] Running generate_until requests
100%|████████████████████████████████████████████████████| 3003/3003 [16:42<00:00,  3.00it/s]
bootstrapping for stddev: bleu
100%|██████████████████████████████████████████████████████████| 1/1 [01:10<00:00, 70.65s/it]
bootstrapping for stddev: ter
  0%|                                                                  | 0/1 [00:00<?, ?it/s]

cc: @haileyschoelkopf

@afcruzs
Copy link

afcruzs commented Mar 17, 2024

Did you find a workaround yet @ayulockin ?

@haileyschoelkopf
Copy link
Contributor

A temporary workaround would be to disable bootstrapping fully in the case of the translation tasks' metrics!

@haileyschoelkopf haileyschoelkopf added the bug Something isn't working. label Mar 18, 2024
@afcruzs
Copy link

afcruzs commented Mar 18, 2024

indeed that worked for me :) I'm on a somewhat old version of the codebase, so I had to comment out the calls to stderr_for_metric and not log the stderr. On the current code, I think passing bootstrap_iters as None here would do it, although it's not obvious how to do it from the cli.

@afcruzs
Copy link

afcruzs commented Mar 18, 2024

also I looked a bit into my own (rather old) version of the code, and I found this commit: 82ec4f5, so looks like this might have been a problem already

I also figured some of my evals took many hours over the weekend to complete the bootstrap step. So I suspect it's just bleu/ter taking a very long time (maybe more in some models than others, it depends on how long the completions could be), so it might not even be a hang.

@haileyschoelkopf
Copy link
Contributor

Ah, good point--yes, they already have their bootstrap iterations lowered compared to other metrics, I believe for this reason. Will make it a todo to see why they take this long...

Re: disabling bootstrapping from the CLI, we should add a CLI flag --bootstrap_iters which disables stderrs if <=0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working.
Projects
Status: Backlog
Development

No branches or pull requests

3 participants