Skip to content

[Bug]: float division by zero when training #5679

@xiehuanyi

Description

@xiehuanyi

软件环境

paddle-bfloat                  0.1.7
paddle2onnx                    1.0.0
paddlefsl                      1.1.0
paddlehub                      2.3.0
paddlenlp                      2.4.2
paddlepaddle-gpu               2.4.0.post112
tb-paddle                      0.3.6

重复问题

  • I have searched the existing issues

错误描述

我在使用paddlenlp的时候遇到了如下问题

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
/tmp/ipykernel_595/4032920361.py in <module>
----> 1 trainer.train()

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/trainer/trainer_base.py in train(self, resume_from_checkpoint, ignore_keys_for_eval)
    650 
    651         self._total_loss_scalar += tr_loss.item()
--> 652         train_loss = self._total_loss_scalar / self.state.global_step
    653 
    654         metrics = speed_metrics("train",

ZeroDivisionError: float division by zero```

我的一些参数是这样的

args = TrainingArguments(
output_dir='output',
do_train=True,
do_eval=True
)
trainer = Trainer(
model=model,
criterion=paddle.nn.CrossEntropyLoss(),
data_collator=collate_fn,
train_dataset=train_set,
eval_dataset=None,
tokenizer=tokenzier,
args=args
)
trainer.train()

稳定复现步骤 & 代码

这个没法给,比赛的项目代码不好给吧

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtriage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions