[Trainer] remove redundant memory metrics and set enable as default #8374

SylarTiaNII · 2024-05-07T07:56:38Z

PR types

Others

PR changes

Others

Description

remove redundant memory metrics and enable memory metrics print as default

paddle-bot · 2024-05-07T07:56:42Z

Thanks for your contribution!

ZHUI · 2024-05-07T08:01:30Z

paddlenlp/trainer/trainer.py

+                    logs["current_memory_allocated"] = current_memory_allocated / divisor
+                    logs["current_memory_reserved"] = current_memory_reserved / divisor
+                    logs["max_memory_allocated"] = max_memory_allocated / divisor
+                    logs["max_memory_reserved"] = max_memory_reserved / divisor


PaddleNLP/scripts/distribute/ci_case_dy.sh

Line 454 in 09a0ce7

mem=`cat $log_dir/workerlog.0 | grep 'global_step: 30' | awk -F 'gpu_max_memory_reserved: ' '{print $2}' | awk -F ',' '{print $1}'`

这处也对应改了吧，gpu_max_memory_reserved -> max_memory_reserved

ZHUI · 2024-05-07T08:04:27Z

paddlenlp/trainer/trainer.py

+                    max_memory_reserved = core.device_memory_stat_peak_value("Reserved", device_id)
+                    logs["current_memory_allocated"] = current_memory_allocated / divisor
+                    logs["current_memory_reserved"] = current_memory_reserved / divisor
+                    logs["max_memory_allocated"] = max_memory_allocated / divisor


Suggested change

logs["max_memory_allocated"] = max_memory_allocated / divisor

logs["max_memory_allocated"] = max_memory_allocated >> 20

这个之前是MB单位，建议不要改变了，保持原来写法。用除法的话，是浮点数，还有小数位的问题，建议直接位运算，MB为单位

codecov · 2024-05-07T08:27:03Z

Codecov Report

Attention: Patch coverage is 13.33333% with 13 lines in your changes are missing coverage. Please review.

Project coverage is 55.41%. Comparing base (09a0ce7) to head (4821ce2).
Report is 16 commits behind head on develop.

Files	Patch %	Lines
paddlenlp/trainer/trainer.py	13.33%	13 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #8374      +/-   ##
===========================================
+ Coverage    55.36%   55.41%   +0.04%     
===========================================
  Files          614      615       +1     
  Lines        96016    96241     +225     
===========================================
+ Hits         53164    53335     +171     
- Misses       42852    42906      +54

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ZHUI · 2024-05-07T08:47:48Z

paddlenlp/trainer/training_args.py

@@ -738,7 +738,7 @@ class TrainingArguments:
        metadata={"help": "The path to a folder with a valid checkpoint for your model."},
    )
    skip_memory_metrics: bool = field(
-        default=True, metadata={"help": "Whether or not to skip adding of memory profiler reports to metrics."}
+        default=False, metadata={"help": "Whether or not to skip adding of memory profiler reports to metrics."}


这个别改了吧，你们需要用，自己打看

ZHUI

LGTM

ZHUI reviewed May 7, 2024

View reviewed changes

SylarTiaNII force-pushed the modify_logger branch from 1059bb4 to 69a2c35 Compare May 8, 2024 06:11

[Trainer] remove redundant memory metrics and set enable as default

4821ce2

SylarTiaNII force-pushed the modify_logger branch from 69a2c35 to 4821ce2 Compare May 8, 2024 07:05

ZHUI approved these changes May 8, 2024

View reviewed changes

PaddlePaddle locked and limited conversation to collaborators May 8, 2024

PaddlePaddle unlocked this conversation May 8, 2024

ZHUI closed this May 9, 2024

ZHUI reopened this May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Trainer] remove redundant memory metrics and set enable as default #8374

[Trainer] remove redundant memory metrics and set enable as default #8374

SylarTiaNII commented May 7, 2024

paddle-bot bot commented May 7, 2024

ZHUI May 7, 2024

SylarTiaNII May 8, 2024

ZHUI May 7, 2024

SylarTiaNII May 8, 2024

codecov bot commented May 7, 2024 •

edited

ZHUI May 7, 2024

ZHUI left a comment

	logs["max_memory_allocated"] = max_memory_allocated / divisor
	logs["max_memory_allocated"] = max_memory_allocated >> 20

[Trainer] remove redundant memory metrics and set enable as default #8374

Are you sure you want to change the base?

[Trainer] remove redundant memory metrics and set enable as default #8374

Conversation

SylarTiaNII commented May 7, 2024

PR types

PR changes

Description

paddle-bot bot commented May 7, 2024

ZHUI May 7, 2024

Choose a reason for hiding this comment

SylarTiaNII May 8, 2024

Choose a reason for hiding this comment

ZHUI May 7, 2024

Choose a reason for hiding this comment

SylarTiaNII May 8, 2024

Choose a reason for hiding this comment

codecov bot commented May 7, 2024 • edited

Codecov Report

ZHUI May 7, 2024

Choose a reason for hiding this comment

ZHUI left a comment

Choose a reason for hiding this comment

codecov bot commented May 7, 2024 •

edited