-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache problem while runing on multiple nodes with GPU #30859
Comments
@yuane4 can you please share your entire script? |
yes of course, is there the script, I made little adjustement to the original script, because the server I use to do my training create there own library to manage parallelisation called idr_torch, you can find the lines added in the main function :
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
System Info
Hi,
I am currently trying to use the script run_mlm_wwm.py to perform a continual pretrianing on the Whole Word mazking task, on a Bert model, my problem occured when I am trying to use multpiles GPU, when the number of GPU force my server to use severals nodes then I get an error.
I think there's a locking problem linked to the parallel file system when I go over several nodes, I would have to set a different cache for each process (for example by adding the global rank) or for each process on the same node but i don't succeed to do it so far
Is there the error message i get :
`Loading pytorch-gpu/py3/2.1.1
Loading requirement: cuda/11.8.0 nccl/2.18.5-1-cuda cudnn/8.7.0.84-cuda
gcc/8.5.0 openmpi/4.1.5-cuda intel-mkl/2020.4 magma/2.7.1-cuda sox/14.4.2
sparsehash/2.0.3 libjpeg-turbo/2.1.3 ffmpeg/4.4.4
srun: warning: can't honor --ntasks-per-node set to 8 which doesn't match the requested tasks 18 with the number of requested nodes 3. Ignoring --ntasks-per-node.
0: comet_ml is installed but
COMET_API_KEY
is not set.6: comet_ml is installed but
COMET_API_KEY
is not set.7: comet_ml is installed but
COMET_API_KEY
is not set.8: comet_ml is installed but
COMET_API_KEY
is not set.9: comet_ml is installed but
COMET_API_KEY
is not set.10: comet_ml is installed but
COMET_API_KEY
is not set.11: comet_ml is installed but
COMET_API_KEY
is not set.12: comet_ml is installed but
COMET_API_KEY
is not set.13: comet_ml is installed but
COMET_API_KEY
is not set.14: comet_ml is installed but
COMET_API_KEY
is not set.15: comet_ml is installed but
COMET_API_KEY
is not set.16: comet_ml is installed but
COMET_API_KEY
is not set.17: comet_ml is installed but
COMET_API_KEY
is not set.3: comet_ml is installed but
COMET_API_KEY
is not set.4: comet_ml is installed but
COMET_API_KEY
is not set.5: comet_ml is installed but
COMET_API_KEY
is not set.1: comet_ml is installed but
COMET_API_KEY
is not set.2: comet_ml is installed but
COMET_API_KEY
is not set.2: 05/04/2024 06:14:12 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: True
1: 05/04/2024 06:14:12 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: True
3: 05/04/2024 06:14:12 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: True
4: 05/04/2024 06:14:12 - WARNING - main - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: True
5: 05/04/2024 06:14:12 - WARNING - main - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: True
14: 05/04/2024 06:14:12 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: True
15: 05/04/2024 06:14:12 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: True
16: 05/04/2024 06:14:12 - WARNING - main - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: True
12: 05/04/2024 06:14:12 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
13: 05/04/2024 06:14:12 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: True
17: 05/04/2024 06:14:12 - WARNING - main - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: True
12: 05/04/2024 06:14:12 - INFO - main - Training/evaluation parameters TrainingArguments(
12: _n_gpu=1,
12: adafactor=False,
12: adam_beta1=0.9,
12: adam_beta2=0.999,
12: adam_epsilon=1e-08,
12: auto_find_batch_size=False,
12: bf16=False,
12: bf16_full_eval=False,
12: data_seed=None,
12: dataloader_drop_last=False,
12: dataloader_num_workers=0,
12: dataloader_pin_memory=True,
12: ddp_backend=None,
12: ddp_broadcast_buffers=None,
12: ddp_bucket_cap_mb=None,
12: ddp_find_unused_parameters=False,
12: ddp_timeout=600,
12: debug=[],
12: deepspeed=None,
12: disable_tqdm=False,
12: dispatch_batches=None,
12: do_eval=True,
12: do_predict=False,
12: do_train=True,
12: eval_accumulation_steps=None,
12: eval_delay=0,
12: eval_steps=None,
12: evaluation_strategy=IntervalStrategy.NO,
12: fp16=True,
12: fp16_backend=auto,
12: fp16_full_eval=False,
12: fp16_opt_level=O1,
12: fsdp=[],
12: fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
12: fsdp_min_num_params=0,
12: fsdp_transformer_layer_cls_to_wrap=None,
12: full_determinism=False,
12: gradient_accumulation_steps=1,
12: gradient_checkpointing=False,
12: gradient_checkpointing_kwargs=None,
12: greater_is_better=None,
12: group_by_length=False,
12: half_precision_backend=auto,
12: hub_always_push=False,
12: hub_model_id=None,
12: hub_private_repo=False,
12: hub_strategy=HubStrategy.EVERY_SAVE,
12: hub_token=<HUB_TOKEN>,
12: ignore_data_skip=False,
12: include_inputs_for_metrics=False,
12: include_tokens_per_second=False,
12: jit_mode_eval=False,
12: label_names=None,
12: label_smoothing_factor=0.0,
12: learning_rate=0.0001,
12: length_column_name=length,
12: load_best_model_at_end=False,
12: local_rank=0,
12: log_level=info,
12: log_level_replica=warning,
12: log_on_each_node=True,
12: logging_dir=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm/runs/May04_06-14-12_jean-zay-iam07,
12: logging_first_step=True,
12: logging_nan_inf_filter=True,
12: logging_steps=500,
12: logging_strategy=IntervalStrategy.STEPS,
12: lr_scheduler_type=SchedulerType.LINEAR,
12: max_grad_norm=1.0,
12: max_steps=-1,
12: metric_for_best_model=None,
12: mp_parameters=,
12: neftune_noise_alpha=None,
12: no_cuda=False,
12: num_train_epochs=6.0,
12: optim=OptimizerNames.ADAMW_TORCH,
12: optim_args=None,
12: output_dir=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm,
12: overwrite_output_dir=True,
12: past_index=-1,
12: per_device_eval_batch_size=8,
12: per_device_train_batch_size=96,
12: prediction_loss_only=False,
12: push_to_hub=False,
12: push_to_hub_model_id=None,
12: push_to_hub_organization=None,
12: push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
12: ray_scope=last,
12: remove_unused_columns=True,
12: report_to=['tensorboard'],
12: resume_from_checkpoint=None,
12: run_name=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm,
12: save_on_each_node=False,
12: save_safetensors=True,
12: save_steps=500,
12: save_strategy=IntervalStrategy.EPOCH,
12: save_total_limit=None,
12: seed=42,
12: skip_memory_metrics=False,
12: split_batches=False,
12: tf32=None,
12: torch_compile=False,
12: torch_compile_backend=None,
12: torch_compile_mode=None,
12: torchdynamo=None,
12: tpu_metrics_debug=False,
12: tpu_num_cores=None,
12: use_cpu=False,
12: use_ipex=False,
12: use_legacy_prediction_loop=False,
12: use_mps_device=False,
12: warmup_ratio=0.0,
12: warmup_steps=10000,
12: weight_decay=0.0,
12: )
Downloading data files: 100%|██████████| 2/2 [00:00<00:00, 21454.24it/s]
Downloading data files: 100%|██████████| 2/2 [00:00<00:00, 11259.88it/s]
0: 05/04/2024 06:14:13 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
0: 05/04/2024 06:14:13 - INFO - main - Training/evaluation parameters TrainingArguments(
0: _n_gpu=1,
0: adafactor=False,
0: adam_beta1=0.9,
0: adam_beta2=0.999,
0: adam_epsilon=1e-08,
0: auto_find_batch_size=False,
0: bf16=False,
0: bf16_full_eval=False,
0: data_seed=None,
0: dataloader_drop_last=False,
0: dataloader_num_workers=0,
0: dataloader_pin_memory=True,
0: ddp_backend=None,
0: ddp_broadcast_buffers=None,
0: ddp_bucket_cap_mb=None,
0: ddp_find_unused_parameters=False,
0: ddp_timeout=600,
0: debug=[],
0: deepspeed=None,
0: disable_tqdm=False,
0: dispatch_batches=None,
0: do_eval=True,
0: do_predict=False,
0: do_train=True,
0: eval_accumulation_steps=None,
0: eval_delay=0,
0: eval_steps=None,
0: evaluation_strategy=IntervalStrategy.NO,
0: fp16=True,
0: fp16_backend=auto,
0: fp16_full_eval=False,
0: fp16_opt_level=O1,
0: fsdp=[],
0: fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
0: fsdp_min_num_params=0,
0: fsdp_transformer_layer_cls_to_wrap=None,
0: full_determinism=False,
0: gradient_accumulation_steps=1,
0: gradient_checkpointing=False,
0: gradient_checkpointing_kwargs=None,
0: greater_is_better=None,
0: group_by_length=False,
0: half_precision_backend=auto,
0: hub_always_push=False,
0: hub_model_id=None,
0: hub_private_repo=False,
0: hub_strategy=HubStrategy.EVERY_SAVE,
0: hub_token=<HUB_TOKEN>,
0: ignore_data_skip=False,
0: include_inputs_for_metrics=False,
0: include_tokens_per_second=False,
0: jit_mode_eval=False,
0: label_names=None,
0: label_smoothing_factor=0.0,
0: learning_rate=0.0001,
0: length_column_name=length,
0: load_best_model_at_end=False,
0: local_rank=0,
0: log_level=info,
0: log_level_replica=warning,
0: log_on_each_node=True,
0: logging_dir=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm/runs/May04_06-14-13_jean-zay-iam05,
0: logging_first_step=True,
0: logging_nan_inf_filter=True,
0: logging_steps=500,
0: logging_strategy=IntervalStrategy.STEPS,
0: lr_scheduler_type=SchedulerType.LINEAR,
0: max_grad_norm=1.0,
0: max_steps=-1,
0: metric_for_best_model=None,
0: mp_parameters=,
0: neftune_noise_alpha=None,
0: no_cuda=False,
0: num_train_epochs=6.0,
0: optim=OptimizerNames.ADAMW_TORCH,
0: optim_args=None,
0: output_dir=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm,
0: overwrite_output_dir=True,
0: past_index=-1,
0: per_device_eval_batch_size=8,
0: per_device_train_batch_size=96,
0: prediction_loss_only=False,
0: push_to_hub=False,
0: push_to_hub_model_id=None,
0: push_to_hub_organization=None,
0: push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
0: ray_scope=last,
0: remove_unused_columns=True,
0: report_to=['tensorboard'],
0: resume_from_checkpoint=None,
0: run_name=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm,
0: save_on_each_node=False,
0: save_safetensors=True,
0: save_steps=500,
0: save_strategy=IntervalStrategy.EPOCH,
0: save_total_limit=None,
0: seed=42,
0: skip_memory_metrics=False,
0: split_batches=False,
0: tf32=None,
0: torch_compile=False,
0: torch_compile_backend=None,
0: torch_compile_mode=None,
0: torchdynamo=None,
0: tpu_metrics_debug=False,
0: tpu_num_cores=None,
0: use_cpu=False,
0: use_ipex=False,
0: use_legacy_prediction_loop=False,
0: use_mps_device=False,
0: warmup_ratio=0.0,
0: warmup_steps=10000,
0: weight_decay=0.0,
0: )
7: 05/04/2024 06:14:13 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: True
10: 05/04/2024 06:14:13 - WARNING - main - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: True
9: 05/04/2024 06:14:13 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: True
11: 05/04/2024 06:14:13 - WARNING - main - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: True
6: 05/04/2024 06:14:13 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
8: 05/04/2024 06:14:13 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: True
6: 05/04/2024 06:14:13 - INFO - main - Training/evaluation parameters TrainingArguments(
6: _n_gpu=1,
6: adafactor=False,
6: adam_beta1=0.9,
6: adam_beta2=0.999,
6: adam_epsilon=1e-08,
6: auto_find_batch_size=False,
6: bf16=False,
6: bf16_full_eval=False,
6: data_seed=None,
6: dataloader_drop_last=False,
6: dataloader_num_workers=0,
6: dataloader_pin_memory=True,
6: ddp_backend=None,
6: ddp_broadcast_buffers=None,
6: ddp_bucket_cap_mb=None,
6: ddp_find_unused_parameters=False,
6: ddp_timeout=600,
6: debug=[],
6: deepspeed=None,
6: disable_tqdm=False,
6: dispatch_batches=None,
6: do_eval=True,
6: do_predict=False,
6: do_train=True,
6: eval_accumulation_steps=None,
6: eval_delay=0,
6: eval_steps=None,
6: evaluation_strategy=IntervalStrategy.NO,
6: fp16=True,
6: fp16_backend=auto,
6: fp16_full_eval=False,
6: fp16_opt_level=O1,
6: fsdp=[],
6: fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
6: fsdp_min_num_params=0,
6: fsdp_transformer_layer_cls_to_wrap=None,
6: full_determinism=False,
6: gradient_accumulation_steps=1,
6: gradient_checkpointing=False,
6: gradient_checkpointing_kwargs=None,
6: greater_is_better=None,
6: group_by_length=False,
6: half_precision_backend=auto,
6: hub_always_push=False,
6: hub_model_id=None,
6: hub_private_repo=False,
6: hub_strategy=HubStrategy.EVERY_SAVE,
6: hub_token=<HUB_TOKEN>,
6: ignore_data_skip=False,
6: include_inputs_for_metrics=False,
6: include_tokens_per_second=False,
6: jit_mode_eval=False,
6: label_names=None,
6: label_smoothing_factor=0.0,
6: learning_rate=0.0001,
6: length_column_name=length,
6: load_best_model_at_end=False,
6: local_rank=0,
6: log_level=info,
6: log_level_replica=warning,
6: log_on_each_node=True,
6: logging_dir=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm/runs/May04_06-14-13_jean-zay-iam06,
6: logging_first_step=True,
6: logging_nan_inf_filter=True,
6: logging_steps=500,
6: logging_strategy=IntervalStrategy.STEPS,
6: lr_scheduler_type=SchedulerType.LINEAR,
6: max_grad_norm=1.0,
6: max_steps=-1,
6: metric_for_best_model=None,
6: mp_parameters=,
6: neftune_noise_alpha=None,
6: no_cuda=False,
6: num_train_epochs=6.0,
6: optim=OptimizerNames.ADAMW_TORCH,
6: optim_args=None,
6: output_dir=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm,
6: overwrite_output_dir=True,
6: past_index=-1,
6: per_device_eval_batch_size=8,
6: per_device_train_batch_size=96,
6: prediction_loss_only=False,
6: push_to_hub=False,
6: push_to_hub_model_id=None,
6: push_to_hub_organization=None,
6: push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
6: ray_scope=last,
6: remove_unused_columns=True,
6: report_to=['tensorboard'],
6: resume_from_checkpoint=None,
6: run_name=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm,
6: save_on_each_node=False,
6: save_safetensors=True,
6: save_steps=500,
6: save_strategy=IntervalStrategy.EPOCH,
6: save_total_limit=None,
6: seed=42,
6: skip_memory_metrics=False,
6: split_batches=False,
6: tf32=None,
6: torch_compile=False,
6: torch_compile_backend=None,
6: torch_compile_mode=None,
6: torchdynamo=None,
6: tpu_metrics_debug=False,
6: tpu_num_cores=None,
6: use_cpu=False,
6: use_ipex=False,
6: use_legacy_prediction_loop=False,
6: use_mps_device=False,
6: warmup_ratio=0.0,
6: warmup_steps=10000,
6: weight_decay=0.0,
6: )
Downloading data files: 100%|██████████| 2/2 [00:00<00:00, 21183.35it/s]
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 7.98it/s]
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 7.37it/s]
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 26.94it/s]
Generating train split: 735691 examples [00:02, 368808.23 examples/s]
Generating train split: 674152 examples [00:02, 277129.32 examples/s]
Generating train split: 674152 examples [00:03, 247846.23 examples/s]
Generating train split: 1593705 examples [00:04, 420185.94 examples/s]
Generating train split: 1501884 examples [00:05, 283821.86 examples/s]
Generating train split: 1410002 examples [00:05, 375326.74 examples/s]
Generating train split: 2513358 examples [00:06, 437244.14 examples/s]
Generating train split: 2421559 examples [00:07, 378112.29 examples/s]
Generating train split: 2329470 examples [00:08, 326257.49 examples/s]
Generating train split: 3370011 examples [00:08, 448849.21 examples/s]
Generating train split: 3278022 examples [00:10, 358206.71 examples/s]
Generating train split: 3155413 examples [00:10, 341364.95 examples/s]
Generating train split: 4290261 examples [00:11, 447049.42 examples/s]
Generating train split: 4198175 examples [00:12, 373207.24 examples/s]
Generating train split: 4074904 examples [00:12, 353087.99 examples/s]
Generating train split: 5147893 examples [00:13, 417013.34 examples/s]
Generating train split: 5056374 examples [00:14, 376059.59 examples/s]
Generating train split: 6067697 examples [00:15, 448228.75 examples/s]s/s]
Generating train split: 4933690 examples [00:15, 421004.81 examples/s]
Generating train split: 6926173 examples [00:17, 450815.07 examples/s]
Generating train split: 5975531 examples [00:17, 383399.07 examples/s]/s]
Generating train split: 5852750 examples [00:17, 365422.88 examples/s]s]
Generating train split: 7783637 examples [00:19, 467585.48 examples/s]
Generating train split: 6712504 examples [00:19, 422087.08 examples/s]
Generating train split: 6835032 examples [00:19, 295981.16 examples/s]
Generating train split: 8703429 examples [00:21, 438801.32 examples/s]
Generating train split: 7569889 examples [00:22, 405311.56 examples/s]
Generating train split: 7691972 examples [00:22, 384627.19 examples/s]
Generating train split: 9561456 examples [00:23, 430769.79 examples/s]
Generating train split: 8611880 examples [00:24, 368744.37 examples/s]
Generating train split: 8488889 examples [00:24, 309105.36 examples/s]
Generating train split: 10481374 examples [00:25, 434503.57 examples/s]
Generating train split: 9469794 examples [00:26, 366982.35 examples/s]
Generating train split: 9347416 examples [00:26, 439195.31 examples/s]
Generating train split: 11340042 examples [00:27, 446102.44 examples/s]
Generating train split: 12198580 examples [00:29, 434616.45 examples/s]
Generating train split: 10389743 examples [00:29, 398051.80 examples/s]
Generating train split: 10267411 examples [00:29, 368192.47 examples/s]
Generating train split: 13056612 examples [00:31, 439929.51 examples/s]
Generating train split: 11247710 examples [00:31, 398261.78 examples/s]
Generating train split: 11125153 examples [00:31, 419988.52 examples/s]
Generating train split: 13913907 examples [00:33, 458949.37 examples/s]
Generating train split: 12106665 examples [00:33, 380881.35 examples/s]
Generating train split: 11983873 examples [00:33, 358349.17 examples/s]
Generating train split: 14833307 examples [00:35, 423504.66 examples/s]
Generating train split: 12964439 examples [00:35, 400086.35 examples/s]
Generating train split: 12841924 examples [00:36, 400414.04 examples/s]
Generating train split: 15690649 examples [00:37, 438872.86 examples/s]
Generating train split: 13822319 examples [00:37, 398832.34 examples/s]
Generating train split: 13769754 examples [00:38, 425513.63 examples/s]
Generating train split: 16548238 examples [00:39, 450448.68 examples/s]
Generating train split: 14741800 examples [00:40, 390752.76 examples/s]
Generating train split: 14619130 examples [00:40, 395537.82 examples/s]
Generating train split: 17405436 examples [00:41, 442940.32 examples/s]
Generating train split: 15598726 examples [00:42, 356492.27 examples/s]
Generating train split: 15476523 examples [00:42, 369237.01 examples/s]
Generating train split: 18323721 examples [00:43, 459883.94 examples/s]
Generating train split: 16456225 examples [00:44, 403418.79 examples/s]
Generating train split: 19181845 examples [00:45, 444280.63 examples/s]
Generating train split: 16334014 examples [00:44, 431093.34 examples/s]
Generating train split: 20038448 examples [00:47, 438563.69 examples/s]
Generating train split: 17312794 examples [00:47, 402221.80 examples/s]/s]
Generating train split: 20130863 examples [00:47, 425298.18 examples/s]
Generating train split: 17190731 examples [00:47, 352356.97 examples/s]
Generating train split: 18231895 examples [00:49, 438039.86 examples/s]
Generating validation split: 795560 examples [00:01, 450074.53 examples/s]
Generating train split: 18109337 examples [00:49, 387274.11 examples/s]
Generating train split: 19090119 examples [00:51, 431144.93 examples/s]
Generating validation split: 1592633 examples [00:03, 447538.35 examples/s]
Generating train split: 18967698 examples [00:51, 400959.93 examples/s]
Generating train split: 19946306 examples [00:53, 444335.85 examples/s]
Generating validation split: 2450319 examples [00:05, 437344.97 examples/s]
Generating train split: 20130863 examples [00:53, 376271.19 examples/s]
14: Traceback (most recent call last):
14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/builder.py", line 902, in incomplete_dir
Generating train split: 19823979 examples [00:53, 427450.54 examples/s]
14: yield tmp_dir
14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/builder.py", line 948, in download_and_prepare
14: self._download_and_prepare(
14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/builder.py", line 1045, in _download_and_prepare
14: raise OSError(
14: OSError: Cannot find data file.
14: Original error:
14: [Errno 2] No such file or directory: '/scrip_continual_pretraining/Cache_mlm/text/default-d0870639fca1403e/0.0.0/c4a140d10f020282918b5dd1b8a49f0104729c6177f60a6b49ec2a365ec69f34.incomplete/text-train-00000-00000-of-NNNNN.arrow'
14:
14: During handling of the above exception, another exception occurred:
14:
14: Traceback (most recent call last):
14: File "/gpfsdswork/projects/rech/khy/uvb95lb/scrip_continual_pretraining/run_mlm_wwm.py", line 450, in
14: main()
14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
14: return f(*args, **kwargs)
14: ^^^^^^^^^^^^^^^^^^
14: File "/gpfsdswork/projects/rech/khy/uvb95lb/scrip_continual_pretraining/run_mlm_wwm.py", line 294, in main
14: datasets = load_dataset(extension, data_files=data_files, cache_dir="/scrip_continual_pretraining/Cache_mlm")
14: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/load.py", line 2152, in load_dataset
Downloading data files: 100%|██████████| 2/2 [00:00<00:00, 18117.94it/s]
14: builder_instance.download_and_prepare(
14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/builder.py", line 928, in download_and_prepare
14: with incomplete_dir(self._output_dir) as tmp_output_dir:
14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/contextlib.py", line 155, in exit
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 72.83it/s]
14: self.gen.throw(typ, value, traceback)
14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/builder.py", line 909, in incomplete_dir
14: shutil.rmtree(tmp_dir)
14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/shutil.py", line 738, in rmtree
14: onerror(os.rmdir, path, sys.exc_info())
14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/shutil.py", line 736, in rmtree
14: os.rmdir(path, dir_fd=dir_fd)
14: OSError: [Errno 39] Directory not empty: '/scrip_continual_pretraining/Cache_mlm/text/default-d0870639fca1403e/0.0.0/c4a140d10f020282918b5dd1b8a49f0104729c6177f60a6b49ec2a365ec69f34.incomplete'
Generating train split: 20130863 examples [00:54, 368921.36 examples/s]
9: Traceback (most recent call last):
9: File "/gpfsdswork/projects/rech/khy/uvb95lb/scrip_continual_pretraining/run_mlm_wwm.py", line 450, in
9: main()
9: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
9: return f(*args, **kwargs)
9: ^^^^^^^^^^^^^^^^^^
9: File "/gpfsdswork/projects/rech/khy/uvb95lb/scrip_continual_pretraining/run_mlm_wwm.py", line 294, in main
9: datasets = load_dataset(extension, data_files=data_files, cache_dir="/scrip_continual_pretraining/Cache_mlm")
9: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
9: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/load.py", line 2152, in load_dataset
9: builder_instance.download_and_prepare(
9: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/builder.py", line 948, in download_and_prepare
9: self._download_and_prepare(
9: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/builder.py", line 1045, in _download_and_prepare
9: raise OSError(
9: OSError: Cannot find data file.
9: Original error:
9: [Errno 2] No such file or directory: '/scrip_continual_pretraining/Cache_mlm/text/default-d0870639fca1403e/0.0.0/c4a140d10f020282918b5dd1b8a49f0104729c6177f60a6b49ec2a365ec69f34.incomplete/text-train-00000-00001-of-NNNNN.arrow'
Downloading data files: 100%|██████████| 2/2 [00:00<00:00, 17962.76it/s]
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 59.23it/s]
Generating validation split: 3246757 examples [00:07, 410442.68 examples/s]
srun: error: jean-zay-iam07: task 14: Exited with exit code 1
srun: Terminating StepId=1741717.0
0: slurmstepd: error: *** STEP 1741717.0 ON jean-zay-iam05 CANCELLED AT 2024-05-04T06:15:08 ***
Generating train split: 122206 examples [00:00, 396783.29 examples/s]
2: split: 3308013 examples [00:07, 424316.76 examples/s]
srun: error: jean-zay-iam07: tasks 12-13,15-16: Terminated
srun: error: jean-zay-iam05: tasks 0-2,4-5: Terminated
srun: error: jean-zay-iam06: tasks 7-11: Terminated
Generating train split: 489938 examples [00:01, 440569.16 examples/s]
srun: error: jean-zay-iam07: task 17: Terminated
srun: error: jean-zay-iam05: task 3: Terminated
srun: error: jean-zay-iam06: task 6: Terminated
srun: Force Terminated StepId=1741717.0`
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
14: OSError: [Errno 39] Directory not empty: '/scrip_continual_pretraining/Cache_mlm/text/default-d0870639fca1403e/0.0.0/c4a140d10f020282918b5dd1b8a49f0104729c6177f60a6b49ec2a365ec69f34.incomplete'
Generating train split: 20130863 examples [00:54, 368921.36 examples/s]
9: Traceback (most recent call last):
9: File "/gpfsdswork/projects/rech/khy/uvb95lb/scrip_continual_pretraining/run_mlm_wwm.py", line 450, in
9: main()
9: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
9: return f(*args, **kwargs)
9: ^^^^^^^^^^^^^^^^^^
9: File "/scrip_continual_pretraining/run_mlm_wwm.py", line 294, in main
9: datasets = load_dataset(extension, data_files=data_files, cache_dir="/scrip_continual_pretraining/Cache_mlm")
9: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
9: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/load.py", line 2152, in load_dataset
9: builder_instance.download_and_prepare(
9: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/builder.py", line 948, in download_and_prepare
9: self._download_and_prepare(
9: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/builder.py", line 1045, in _download_and_prepare
9: raise OSError(
9: OSError: Cannot find data file.
9: Original error:
9: [Errno 2] No such file or directory: '/scrip_continual_pretraining/Cache_mlm/text/default-d0870639fca1403e/0.0.0/c4a140d10f020282918b5dd1b8a49f0104729c6177f60a6b49ec2a365ec69f34.incomplete/text-train-00000-00001-of-NNNNN.arrow'
Downloading data files: 100%|██████████| 2/2 [00:00<00:00, 17962.76it/s]
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 59.23it/s]
Generating validation split: 3246757 examples [00:07, 410442.68 examples/s]
srun: error: jean-zay-iam07: task 14: Exited with exit code 1
srun: Terminating StepId=1741717.0
0: slurmstepd: error: *** STEP 1741717.0 ON jean-zay-iam05 CANCELLED AT 2024-05-04T06:15:08 ***
Generating train split: 122206 examples [00:00, 396783.29 examples/s]
2: split: 3308013 examples [00:07, 424316.76 examples/s]
srun: error: jean-zay-iam07: tasks 12-13,15-16: Terminated
srun: error: jean-zay-iam05: tasks 0-2,4-5: Terminated
Expected behavior
I expect that my training succed to run at least one epoch
The text was updated successfully, but these errors were encountered: