Qwen1.5-0.5b-chat 使用example中fintune.py 报错 #77

128Ghe980 · 2024-02-22T10:11:07Z

bash文件中的--lazy_preprocess试过True和False，都是报同样的错

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`

# Guide:
# This script supports distributed training on multi-gpu workers (as well as single-worker training).
# Please set the options below according to the comments.
# For multi-gpu workers training, these options should be manually set for each worker.
# After setting the options, please run the script on each worker.

# Number of GPUs per GPU worker
GPUS_PER_NODE=1

# Number of GPU workers, for single-worker training, please set to 1
NNODES=${NNODES:-1}

# The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
NODE_RANK=${NODE_RANK:-0}

# The ip address of the rank-0 worker, for single-worker training, please set to localhost
MASTER_ADDR=${MASTER_ADDR:-localhost}

# The port for communication
MASTER_PORT=${MASTER_PORT:-6001}

MODEL="/home/tione/notebook/model/Qwen1.5-0.5b-chat" # Set the path if you do not want to load from huggingface directly
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="/home/tione/notebook/data/summary/train_summary_aft_supplement_0221.json"
DS_CONFIG_PATH="/home/tione/notebook/code/Qwen1_5/ds_config_zero3.json"
USE_LORA=False
Q_LORA=False

function usage() {
    echo '
Usage: bash finetune/finetune_lora_ds.sh [-m MODEL_PATH] [-d DATA_PATH] [--deepspeed DS_CONFIG_PATH] [--use_lora USE_LORA] [--q_lora Q_LORA]
'
}

while [[ "$1" != "" ]]; do
    case $1 in
        -m | --model )
            shift
            MODEL=$1
            ;;
        -d | --data )
            shift
            DATA=$1
            ;;
        --deepspeed )
            shift
            DS_CONFIG_PATH=$1
            ;;
        --use_lora  )
            shift
            USE_LORA=$1
            ;;
        --q_lora    )
            shift
            Q_LORA=$1
            ;;
        -h | --help )
            usage
            exit 0
            ;;
        * )
            echo "Unknown argument ${1}"
            exit 1
            ;;
    esac
    shift
done

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

torchrun $DISTRIBUTED_ARGS finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --bf16 True \
    --output_dir output/0222 \
    --num_train_epochs 7 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 10 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --model_max_length 512 \
    --lazy_preprocess False \
    --use_lora ${USE_LORA} \
    --q_lora ${Q_LORA} \
    --gradient_checkpointing \
    --deepspeed ${DS_CONFIG_PATH}

报错：

[2024-02-22 17:45:02,301] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-22 17:45:08,507] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-02-22 17:45:08,507] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-02-22 17:45:08,507] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
  File "/home/tione/notebook/code/Qwen1_5/finetune.py", line 375, in <module>
    train()
  File "/home/tione/notebook/code/Qwen1_5/finetune.py", line 296, in train
    config = transformers.AutoConfig.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1022, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 723, in __getitem__
    raise KeyError(key)
KeyError: 'qwen2'
[2024-02-22 17:45:10,304] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 13542) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+b5021ba', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-22_17:45:10
  host      : nb-994692758484892032-7k255ty76cu8
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 13542)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

这个qwen2 好像不在configuration_auto.py的CONFIG_MAPPING_NAMES中

The text was updated successfully, but these errors were encountered:

smhd001 · 2024-02-22T15:42:02Z

you are probably using a old version of transformers
from doc:

Requirements

transformers>=4.37.0.

Warning
🚨 This is a must because transformers integrated Qwen2 codes since 4.37.0.

128Ghe980 · 2024-02-23T02:13:13Z

you are probably using a old version of transformers from doc:

Requirements
transformers>=4.37.0.
Warning
🚨 This is a must because transformers integrated Qwen2 codes since 4.37.0.

but i used qwen1 model（qwen1.8b-chat） before, it loaded successfully with:

# Set RoPE scaling factor
    config = transformers.AutoConfig.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=training_args.cache_dir,
        trust_remote_code=True,
    )
    config.use_cache = False

    # Load model and tokenizer
    model = transformers.AutoModelForCausalLM.from_pretrained(
        model_args.model_name_or_path,
        config=config,
        cache_dir=training_args.cache_dir,
        device_map=device_map,
        trust_remote_code=True,
        quantization_config=GPTQConfig(
            bits=4, disable_exllama=True
        )
        if training_args.use_lora and lora_args.q_lora
        else None,
        **model_load_kwargs,
    )
    tokenizer = transformers.AutoTokenizer.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=training_args.cache_dir,
        model_max_length=training_args.model_max_length,
        padding_side="right",
        use_fast=False,
        trust_remote_code=True,
    )
    tokenizer.pad_token_id = tokenizer.eod_id

no much difference with the script i'm using now, the only difference:

trust_remote_code=True,

and i already checked all other version transformers, there is no 'qwen' in configuration_auto.py-CONFIG_MAPPING_NAMES

wangz1200 · 2024-03-03T04:57:35Z

model=/home/tione/notebook/model/Qwen1.5-0.5b-chat中，路径名“Qwen1.5”不要有小数点。我跟踪transformer在加载模块时，会将小数点误认为模块分隔符。

samir1224 · 2024-03-29T02:58:08Z

finetune.sh中把torchrun命令那边的$DISTRIBUTED_ARGS去掉。再去finetune.py的import os下，指定GPU分片即可。

zhanghaobucunzai · 2024-04-09T01:05:23Z

请问，这个文件（DATA="/home/tione/notebook/data/summary/train_summary_aft_supplement_0221.json"）是如何得到的

jklj077 closed this as completed Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen1.5-0.5b-chat 使用example中fintune.py 报错 #77

Qwen1.5-0.5b-chat 使用example中fintune.py 报错 #77

128Ghe980 commented Feb 22, 2024

smhd001 commented Feb 22, 2024

128Ghe980 commented Feb 23, 2024

wangz1200 commented Mar 3, 2024

samir1224 commented Mar 29, 2024

zhanghaobucunzai commented Apr 9, 2024

Qwen1.5-0.5b-chat 使用example中fintune.py 报错 #77

Qwen1.5-0.5b-chat 使用example中fintune.py 报错 #77

Comments

128Ghe980 commented Feb 22, 2024

smhd001 commented Feb 22, 2024

128Ghe980 commented Feb 23, 2024

wangz1200 commented Mar 3, 2024

samir1224 commented Mar 29, 2024

zhanghaobucunzai commented Apr 9, 2024