Optimize the log and enable to print the number of tokens each second. #7853

Xreki · 2024-01-17T04:03:01Z

PR types

Others

PR changes

Others

Description

优化日志，训练每个step打印tokens/s/device数据。预训练日志如下（有更新，保留了原来的interval_samples_per_second，只是预训练新增了一个interval_tokens_per_second_per_device的条目）：

paddle-bot · 2024-01-17T04:03:06Z

Thanks for your contribution!

codecov · 2024-02-28T08:16:17Z

Codecov Report

Attention: Patch coverage is 77.77778% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 56.42%. Comparing base (37e85e6) to head (e5b85ff).
Report is 3 commits behind head on develop.

Files	Patch %	Lines
paddlenlp/trainer/trainer_utils.py	66.66%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #7853      +/-   ##
===========================================
- Coverage    56.56%   56.42%   -0.15%     
===========================================
  Files          589      589              
  Lines        89964    90258     +294     
===========================================
+ Hits         50889    50924      +35     
- Misses       39075    39334     +259

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ZHUI · 2024-02-28T08:27:42Z

paddlenlp/trainer/trainer.py

@@ -1230,12 +1230,14 @@ def _maybe_log_save_evaluate(self, tr_loss, model, epoch, ignore_keys_for_eval,
                self.args.train_batch_size * self.args.gradient_accumulation_steps * self.args.dataset_world_size
            )
            num_steps = self.state.global_step - self._globalstep_last_logged
+            seq_length = getattr(self.model.config, "seq_length", None) if hasattr(self.model, "config") else None
            logs.update(


这个 seq_length 方便到 config 中获取吗？还有就是，你确定这个字段在训练阶段都会根据训练配置重设？

sft sel_len 不确定的话，确实有问题的。

这个 seq_length 方便到 config 中获取吗？还有就是，你确定这个字段在训练阶段都会根据训练配置重设？

预训练中，当前会使用用户指定的max_seq_length设置config.seq_length，代码如下：

PaddleNLP/llm/run_pretrain.py

Line 428 in 37e85e6

config.seq_length = data_args.max_seq_length

sft sel_len 不确定的话，确实有问题的。

当前预训练统一使用PretrainingTrainer类，我在该类中加了个is_pretraining的标记，不影响精调的日志，你看如何？

ZHUI · 2024-02-28T08:30:34Z

paddlenlp/trainer/trainer_utils.py

+            tokens_per_second_per_device = samples_per_second * seq_length / paddle.distributed.get_world_size()
+            result[f"{split}_tokens(tokens/sec/device)"] = round(tokens_per_second_per_device, 4)


这里的话，trainer 里面大部分打印的都是 global 的指标。有一点点和其他指标不一样了。

现在预训练都是用tokens/s/card比较多，可以看下是否还需要添加一下tokens/s指标，不过日志会比较长了

ZHUI · 2024-02-28T08:31:24Z

paddlenlp/trainer/trainer_utils.py

+            tokens_per_second_per_device = samples_per_second * seq_length / paddle.distributed.get_world_size()
+            result[f"{split}_tokens(tokens/sec/device)"] = round(tokens_per_second_per_device, 4)


Suggested change

tokens_per_second_per_device = samples_per_second * seq_length / paddle.distributed.get_world_size()

result[f"{split}_tokens(tokens/sec/device)"] = round(tokens_per_second_per_device, 4)

tokens_per_second_per_device = samples_per_second * seq_length / paddle.distributed.get_world_size()

result[f"{split}_tokens_per_sec_per_device)"] = round(tokens_per_second_per_device, 4)

…ens.

Optimize the log and enable to print the number of tokens each second.

80ead67

Merge branch 'develop' into log_ips

db3029c

ZHUI reviewed Feb 28, 2024

View reviewed changes

Xreki added 3 commits February 29, 2024 09:16

Update the prompt message of logging.

bb259c6

Merge branch 'develop' into log_ips

4c505ee

Add is_pretraining to switch whether calcaulating and printing in tok…

5316a6c

…ens.

ZHUI previously approved these changes Feb 29, 2024

View reviewed changes

Add the interval_samples_per_second back.

e5b85ff

Xreki dismissed ZHUI’s stale review via e5b85ff March 1, 2024 02:17

wawltor merged commit 092c845 into PaddlePaddle:develop Mar 6, 2024
7 of 10 checks passed

Xreki deleted the log_ips branch March 6, 2024 02:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the log and enable to print the number of tokens each second. #7853

Optimize the log and enable to print the number of tokens each second. #7853

Xreki commented Jan 17, 2024 •

edited

paddle-bot bot commented Jan 17, 2024

codecov bot commented Feb 28, 2024 •

edited

ZHUI Feb 28, 2024

ZHUI Feb 28, 2024

Xreki Feb 29, 2024

ZHUI Feb 28, 2024

Xreki Feb 29, 2024

ZHUI Feb 28, 2024

Xreki Feb 29, 2024

		tokens_per_second_per_device = samples_per_second * seq_length / paddle.distributed.get_world_size()
		result[f"{split}_tokens(tokens/sec/device)"] = round(tokens_per_second_per_device, 4)

Optimize the log and enable to print the number of tokens each second. #7853

Optimize the log and enable to print the number of tokens each second. #7853

Conversation

Xreki commented Jan 17, 2024 • edited

PR types

PR changes

Description

paddle-bot bot commented Jan 17, 2024

codecov bot commented Feb 28, 2024 • edited

Codecov Report

ZHUI Feb 28, 2024

Choose a reason for hiding this comment

ZHUI Feb 28, 2024

Choose a reason for hiding this comment

Xreki Feb 29, 2024

Choose a reason for hiding this comment

ZHUI Feb 28, 2024

Choose a reason for hiding this comment

Xreki Feb 29, 2024

Choose a reason for hiding this comment

ZHUI Feb 28, 2024

Choose a reason for hiding this comment

Xreki Feb 29, 2024

Choose a reason for hiding this comment

Xreki commented Jan 17, 2024 •

edited

codecov bot commented Feb 28, 2024 •

edited