Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

数据集数量加载缺失,缺失之后训练的模型输出全部为空 #3800

Closed
1 task done
pangpang-xuan opened this issue May 18, 2024 · 2 comments
Closed
1 task done
Labels
solved This problem has been already solved.

Comments

@pangpang-xuan
Copy link

Reminder

  • I have read the README and searched the existing issues.

Reproduction

dataset=Bilingual-code-v10-en
output_dir=outputs/Bilingual-code-v10-en-LLaMA3-8B-5epoch
ds_config=configs/deepspeed/ds_config_zero2.json #zero2deepspeed
model_name_or_path=/home/LLM/LLaMA3-8B
template=自定义的模板
date +"%Y-%m-%d %H:%M:%S"
torchrun --nnodes ${NODES}
--nproc_per_node ${NUM_GPUS}
--node_rank=${NODE_RANK}
--master_addr=${MASTER_ADDR}
--master_port=${MASTER_PORT}
src/train_bash.py
--deepspeed ${ds_config}
--stage sft
--do_train
--finetuning_type lora
--lora_target all
--lora_rank 64
--model_name_or_path ${model_name_or_path}
--template ${template}
--dataset ${dataset}
--output_dir ${output_dir}
--per_device_train_batch_size 1
--gradient_accumulation_steps 64
--lr_scheduler_type cosine
--logging_steps 2
--save_strategy epoch
--learning_rate 3e-4
--num_train_epochs 5.0
--warmup_ratio 0.1
--plot_loss
--fp16
--flash_attn
--seed 42
--ddp_timeout 1800000
--dataloader_num_workers 1
--cutoff_len 2048 >> $OUTPUT_LOG 2>&1
--quantization_bit 4

Expected behavior

数据集总数量应该是94822 并且检查了全部数据集token长度全部小于cutoff的值,但是训练的时候只加载了其中的60646条
数据集的格式如第一张图,正式训练加载的数据集数量如第二张图,并且60646条训练之后的模型全部输出为空
微调之后的lora权重文件如图三,lora合并之后的全部模型权重如图四
image
image
image
image

System Info

No response

Others

No response

@hiyouga
Copy link
Owner

hiyouga commented May 18, 2024

加载数量没有实际多通常是因为数据集中包含不规范的样本

@hiyouga hiyouga added the pending This problem is yet to be addressed. label May 18, 2024
@pangpang-xuan
Copy link
Author

加载数量没有实际多通常是因为数据集中包含不规范的样本

好的 感谢了 我明白了原因 数据集只会加载prompt和output同时不为“”的数据
我的某些输出数据中含有“” 因此导致数据量减少了 感谢

@hiyouga hiyouga added solved This problem has been already solved. and removed pending This problem is yet to be addressed. labels May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved.
Projects
None yet
Development

No branches or pull requests

2 participants