Skip to content

[Bug]: Ernie-ctm云端存储模型参数丢失 #10439

@hanlintang

Description

@hanlintang

软件环境

- paddlepaddle: -
- paddlepaddle-gpu: 3.0.0
- paddlenlp: 3.0.0b4

重复问题

  • I have searched the existing issues

错误描述

在完成https://github.com/PaddlePaddle/PaddleNLP/issues/9763 [No. 17]的时候,尝试运行/examples/text_to_knowledge/nptag下执行NPTag模型训练时报错,显示找不到在线的模型文件。

根据源代码/transformers/ernie_ctm/configuration.py找到可选的模型:

ERNIE_CTM_PRETRAINED_RESOURCE_FILES_MAP = {
    "model_state": {
        "ernie-ctm": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_ctm/ernie_ctm_v3.pdparams",
        "wordtag": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_ctm/wordtag_v3.pdparams",
        "nptag": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_ctm/nptag_v3.pdparams",
    }
}

依次直接访问,全部返回类似结果,找不到模型文件。

{"code":"NoSuchKey","message":"The specified key does not exist.","requestId":"10410205-3348-437c-80ee-d775719c8dfd"}

稳定复现步骤 & 代码

  1. 进入到指定目录
cd ~/PaddleNLP/slm/examples/text_to_knowledge/nptag
  1. 根据README文档运行模型训练命令
python -m paddle.distributed.launch --gpus "0" train.py \
    --batch_size 64 \
    --learning_rate 1e-6 \
    --num_train_epochs 3 \
    --logging_steps 10 \
    --save_steps 100 \
    --output_dir ./output \
    --device "gpu"
  1. 程序报错,显示没有在线模型
aistudio@jupyter-227232-8957468:~/PaddleNLP/slm/examples/text_to_knowledge/nptag$ python -m paddle.distributed.launch --gpus "0" train.py \
>     --batch_size 64 \
>     --learning_rate 1e-6 \
>     --num_train_epochs 3 \
>     --logging_steps 10 \
>     --save_steps 100 \
>     --output_dir ./output \
>     --device "gpu"
LAUNCH INFO 2025-04-17 13:48:06,465 -----------  Configuration  ----------------------
LAUNCH INFO 2025-04-17 13:48:06,465 auto_cluster_config: 0
LAUNCH INFO 2025-04-17 13:48:06,465 auto_parallel_config: None
LAUNCH INFO 2025-04-17 13:48:06,465 auto_tuner_json: None
LAUNCH INFO 2025-04-17 13:48:06,465 devices: 0
LAUNCH INFO 2025-04-17 13:48:06,465 elastic_level: -1
LAUNCH INFO 2025-04-17 13:48:06,465 elastic_timeout: 30
LAUNCH INFO 2025-04-17 13:48:06,465 enable_gpu_log: True
LAUNCH INFO 2025-04-17 13:48:06,465 gloo_port: 6767
LAUNCH INFO 2025-04-17 13:48:06,465 host: None
LAUNCH INFO 2025-04-17 13:48:06,465 ips: None
LAUNCH INFO 2025-04-17 13:48:06,465 job_id: default
LAUNCH INFO 2025-04-17 13:48:06,465 legacy: False
LAUNCH INFO 2025-04-17 13:48:06,465 log_dir: log
LAUNCH INFO 2025-04-17 13:48:06,465 log_level: INFO
LAUNCH INFO 2025-04-17 13:48:06,465 log_overwrite: False
LAUNCH INFO 2025-04-17 13:48:06,465 master: None
LAUNCH INFO 2025-04-17 13:48:06,465 max_restart: 3
LAUNCH INFO 2025-04-17 13:48:06,465 nnodes: 1
LAUNCH INFO 2025-04-17 13:48:06,465 nproc_per_node: None
LAUNCH INFO 2025-04-17 13:48:06,465 rank: -1
LAUNCH INFO 2025-04-17 13:48:06,465 run_mode: collective
LAUNCH INFO 2025-04-17 13:48:06,465 server_num: None
LAUNCH INFO 2025-04-17 13:48:06,465 servers: 
LAUNCH INFO 2025-04-17 13:48:06,466 sort_ip: False
LAUNCH INFO 2025-04-17 13:48:06,466 start_port: 6070
LAUNCH INFO 2025-04-17 13:48:06,466 trainer_num: None
LAUNCH INFO 2025-04-17 13:48:06,466 trainers: 
LAUNCH INFO 2025-04-17 13:48:06,466 training_script: train.py
LAUNCH INFO 2025-04-17 13:48:06,466 training_script_args: ['--batch_size', '64', '--learning_rate', '1e-6', '--num_train_epochs', '3', '--logging_steps', '10', '--save_steps', '100', '--output_dir', './output', '--device', 'gpu']
LAUNCH INFO 2025-04-17 13:48:06,466 with_gloo: 1
LAUNCH INFO 2025-04-17 13:48:06,466 --------------------------------------------------
LAUNCH INFO 2025-04-17 13:48:06,467 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2025-04-17 13:48:06,494 Run Pod: lrmnfn, replicas 1, status ready
LAUNCH INFO 2025-04-17 13:48:06,551 Watching Pod: lrmnfn, replicas 1, status running
/home/aistudio/.local/lib/python3.8/site-packages/_distutils_hack/__init__.py:26: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
-----------  Configuration Arguments -----------
adam_epsilon: 1e-08
batch_size: 64
data_dir: ./data
device: gpu
init_from_ckpt: None
learning_rate: 1e-06
logging_steps: 10
max_seq_len: 64
num_train_epochs: 3
output_dir: ./output
save_steps: 100
seed: 1000
warmup_proportion: 0.0
weight_decay: 0.0
------------------------------------------------

(…)/models/transformers/ernie_ctm/vocab.txt:   0%|          | 0.00/91.7k [00:00<?, ?B/s]
(…)/models/transformers/ernie_ctm/vocab.txt: 100%|██████████| 91.7k/91.7k [00:00<00:00, 2.52MB/s]
[2025-04-17 13:48:17,942] [    INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/nptag/tokenizer_config.json
[2025-04-17 13:48:17,943] [    INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/nptag/special_tokens_map.json
Traceback (most recent call last):
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddlenlp/utils/download/common.py", line 597, in raise_for_status
    response.raise_for_status()
  File "/home/aistudio/.local/lib/python3.8/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://bj.bcebos.com/paddlenlp/models/transformers/ernie_ctm/nptag_v3.pdparams

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddlenlp/utils/download/__init__.py", line 169, in resolve_file_path
    cached_file = bos_download(
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddlenlp/utils/download/bos_download.py", line 241, in bos_download
    http_get(
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddlenlp/utils/download/common.py", line 138, in http_get
    r = _request_wrapper(
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddlenlp/utils/download/common.py", line 369, in _request_wrapper
    raise_for_status(response)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddlenlp/utils/download/common.py", line 601, in raise_for_status
    raise EntryNotFoundError(message, None) from e
huggingface_hub.errors.EntryNotFoundError: 404 Client Error.

Entry Not Found for url: https://bj.bcebos.com/paddlenlp/models/transformers/ernie_ctm/nptag_v3.pdparams.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 191, in <module>
    do_train(args)
  File "train.py", line 103, in do_train
    model = ErnieCtmNptagModel.from_pretrained("nptag")
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddlenlp/transformers/model_utils.py", line 2462, in from_pretrained
    resolved_archive_file, resolved_sharded_files, sharded_metadata, is_sharded = cls._resolve_model_file_path(
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddlenlp/transformers/model_utils.py", line 1827, in _resolve_model_file_path
    resolved_archive_file = resolve_file_path(
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddlenlp/utils/download/__init__.py", line 275, in resolve_file_path
    raise EnvironmentError(f"Does not appear one of the {filenames} in {repo_id}.")
OSError: Does not appear one of the ['https://bj.bcebos.com/paddlenlp/models/transformers/ernie_ctm/nptag_v3.pdparams'] in nptag.
LAUNCH INFO 2025-04-17 13:48:19,566 Pod failed
LAUNCH ERROR 2025-04-17 13:48:19,566 Container failed !!!
Container rank 0 status failed cmd ['/usr/bin/python', '-u', 'train.py', '--batch_size', '64', '--learning_rate', '1e-6', '--num_train_epochs', '3', '--logging_steps', '10', '--save_steps', '100', '--output_dir', './output', '--device', 'gpu'] code 1 log log/workerlog.0
LAUNCH INFO 2025-04-17 13:48:19,566 ------------------------- ERROR LOG DETAIL -------------------------
:00<?, ?B/s]
(…)/models/transformers/ernie_ctm/vocab.txt: 100%|██████████| 91.7k/91.7k [00:00<00:00, 2.52MB/s]
[2025-04-17 13:48:17,942] [    INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/nptag/tokenizer_config.json
[2025-04-17 13:48:17,943] [    INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/nptag/special_tokens_map.json
Traceback (most recent call last):
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddlenlp/utils/download/common.py", line 597, in raise_for_status
    response.raise_for_status()
  File "/home/aistudio/.local/lib/python3.8/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://bj.bcebos.com/paddlenlp/models/transformers/ernie_ctm/nptag_v3.pdparams

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddlenlp/utils/download/__init__.py", line 169, in resolve_file_path
    cached_file = bos_download(
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddlenlp/utils/download/bos_download.py", line 241, in bos_download
    http_get(
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddlenlp/utils/download/common.py", line 138, in http_get
    r = _request_wrapper(
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddlenlp/utils/download/common.py", line 369, in _request_wrapper
    raise_for_status(response)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddlenlp/utils/download/common.py", line 601, in raise_for_status
    raise EntryNotFoundError(message, None) from e
huggingface_hub.errors.EntryNotFoundError: 404 Client Error.

Entry Not Found for url: https://bj.bcebos.com/paddlenlp/models/transformers/ernie_ctm/nptag_v3.pdparams.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 191, in <module>
    do_train(args)
  File "train.py", line 103, in do_train
    model = ErnieCtmNptagModel.from_pretrained("nptag")
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddlenlp/transformers/model_utils.py", line 2462, in from_pretrained
    resolved_archive_file, resolved_sharded_files, sharded_metadata, is_sharded = cls._resolve_model_file_path(
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddlenlp/transformers/model_utils.py", line 1827, in _resolve_model_file_path
    resolved_archive_file = resolve_file_path(
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddlenlp/utils/download/__init__.py", line 275, in resolve_file_path
    raise EnvironmentError(f"Does not appear one of the {filenames} in {repo_id}.")
OSError: Does not appear one of the ['https://bj.bcebos.com/paddlenlp/models/transformers/ernie_ctm/nptag_v3.pdparams'] in nptag.
LAUNCH INFO 2025-04-17 13:48:19,567 Exit code 1

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions