New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the log and enable to print the number of tokens each second. #7853
Conversation
Thanks for your contribution! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7853 +/- ##
===========================================
- Coverage 56.56% 56.42% -0.15%
===========================================
Files 589 589
Lines 89964 90258 +294
===========================================
+ Hits 50889 50924 +35
- Misses 39075 39334 +259 ☔ View full report in Codecov by Sentry. |
@@ -1230,12 +1230,14 @@ def _maybe_log_save_evaluate(self, tr_loss, model, epoch, ignore_keys_for_eval, | |||
self.args.train_batch_size * self.args.gradient_accumulation_steps * self.args.dataset_world_size | |||
) | |||
num_steps = self.state.global_step - self._globalstep_last_logged | |||
seq_length = getattr(self.model.config, "seq_length", None) if hasattr(self.model, "config") else None | |||
logs.update( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个 seq_length 方便到 config 中获取吗?还有就是,你确定这个字段 在训练阶段都会根据训练配置重设?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sft sel_len 不确定的话,确实有问题的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个 seq_length 方便到 config 中获取吗?还有就是,你确定这个字段 在训练阶段都会根据训练配置重设?
预训练中,当前会使用用户指定的max_seq_length
设置config.seq_length
,代码如下:
Line 428 in 37e85e6
config.seq_length = data_args.max_seq_length |
sft sel_len 不确定的话,确实有问题的。
当前预训练统一使用PretrainingTrainer
类,我在该类中加了个is_pretraining
的标记,不影响精调的日志,你看如何?
paddlenlp/trainer/trainer_utils.py
Outdated
tokens_per_second_per_device = samples_per_second * seq_length / paddle.distributed.get_world_size() | ||
result[f"{split}_tokens(tokens/sec/device)"] = round(tokens_per_second_per_device, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的话,trainer 里面大部分打印的都是 global 的指标。有一点点和其他指标不一样了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在预训练都是用tokens/s/card
比较多,可以看下是否还需要添加一下tokens/s
指标,不过日志会比较长了
paddlenlp/trainer/trainer_utils.py
Outdated
tokens_per_second_per_device = samples_per_second * seq_length / paddle.distributed.get_world_size() | ||
result[f"{split}_tokens(tokens/sec/device)"] = round(tokens_per_second_per_device, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokens_per_second_per_device = samples_per_second * seq_length / paddle.distributed.get_world_size() | |
result[f"{split}_tokens(tokens/sec/device)"] = round(tokens_per_second_per_device, 4) | |
tokens_per_second_per_device = samples_per_second * seq_length / paddle.distributed.get_world_size() | |
result[f"{split}_tokens_per_sec_per_device)"] = round(tokens_per_second_per_device, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
PR types
Others
PR changes
Others
Description
优化日志,训练每个step打印
tokens/s/device
数据。预训练日志如下(有更新,保留了原来的interval_samples_per_second,只是预训练新增了一个interval_tokens_per_second_per_device的条目):