-
Notifications
You must be signed in to change notification settings - Fork 835
Description
Hi,
I actually described the problem in the following issuecomment shortly, and I am submiting this issue as a new one for a more complete description.
#431 (comment)
While initiating the SFT, I see a coonection error, which shows that it fails to fetch the models from huggingface.
I actually tried some ways like using VPN / not using VPN, but still fails.
It is quite weird that I managed to fetch the models from huggingface several days ago in the other project, which using the similar way as shown below:
tokenizer = transformers.AutoTokenizer.from_pretrained(
model_name_or_path,
cache_dir=output_dir,
model_max_length=per_device_train_batch_size,
padding_side="right",
use_fast=False,
)
I believe this problem could be temporary since the local internet could be blocked by some issue in the period, however, as you metioned in the above issue: #431
if we can manually download the model and allocate it in a right format, the error could be bypassed the problem.
I am wondering what is the proper way to allocate the files?
For example, I am trying to fine tune bloom-560m:
https://huggingface.co/bigscience/bloom-560m/tree/main
Traceback (most recent call last):
File "/venv/lib/python3.9/site-packages/transformers/utils/hub.py", line 417, in cached_file
resolved_file = hf_hub_download(
File "/venv/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/venv/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1291, in hf_hub_download
raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/LMFlow/examples/finetune.py", line 70, in <module>
main()
File "/LMFlow/examples/finetune.py", line 55, in main
model = AutoModel.get_model(model_args)
File "/LMFlow/src/lmflow/models/auto_model.py", line 14, in get_model
return HFDecoderModel(model_args, *args, **kwargs)
File "/LMFlow/src/lmflow/models/hf_decoder_model.py", line 113, in __init__
config = AutoConfig.from_pretrained(model_args.model_name_or_path, **config_kwargs)
File "/venv/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 944, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/venv/lib/python3.9/site-packages/transformers/configuration_utils.py", line 574, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/venv/lib/python3.9/site-packages/transformers/configuration_utils.py", line 629, in _get_config_dict
resolved_config_file = cached_file(
File "/venv/lib/python3.9/site-packages/transformers/utils/hub.py", line 452, in cached_file
raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like bigscience/bloom-560m is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
[2023-06-20 03:42:23,283] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 606506
[2023-06-20 03:42:23,417] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 606507
[2023-06-20 03:42:23,419] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 606508
[2023-06-20 03:42:23,421] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 606509
[2023-06-20 03:42:23,423] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 606573
[2023-06-20 03:42:23,424] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 606574
[2023-06-20 03:42:23,426] [ERROR] [launch.py:324:sigkill_handler] ['/venv/bin/python3.9', '-u', 'examples/finetune.py', '--local_rank=5', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', '--run_name', 'finetune_with_lora', '--model_name_or_path', 'bigscience/bloom-560m', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--dataset_path', '/LMFlow/data/alpaca/train', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--validation_split_percentage', '0', '--logging_steps', '20', '--block_size', '512', '--do_train', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1