Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Trainer] remove redundant memory metrics and set enable as default #8374

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

SylarTiaNII
Copy link
Contributor

PR types

Others

PR changes

Others

Description

remove redundant memory metrics and enable memory metrics print as default

Copy link

paddle-bot bot commented May 7, 2024

Thanks for your contribution!

logs["current_memory_allocated"] = current_memory_allocated / divisor
logs["current_memory_reserved"] = current_memory_reserved / divisor
logs["max_memory_allocated"] = max_memory_allocated / divisor
logs["max_memory_reserved"] = max_memory_reserved / divisor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mem=`cat $log_dir/workerlog.0 | grep 'global_step: 30' | awk -F 'gpu_max_memory_reserved: ' '{print $2}' | awk -F ',' '{print $1}'`

这处也对应改了吧,gpu_max_memory_reserved -> max_memory_reserved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

max_memory_reserved = core.device_memory_stat_peak_value("Reserved", device_id)
logs["current_memory_allocated"] = current_memory_allocated / divisor
logs["current_memory_reserved"] = current_memory_reserved / divisor
logs["max_memory_allocated"] = max_memory_allocated / divisor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logs["max_memory_allocated"] = max_memory_allocated / divisor
logs["max_memory_allocated"] = max_memory_allocated >> 20

这个之前是MB单位,建议不要改变了,保持原来写法。用除法的话,是浮点数,还有小数位的问题,建议直接位运算,MB为单位

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

Copy link

codecov bot commented May 7, 2024

Codecov Report

Attention: Patch coverage is 13.33333% with 13 lines in your changes are missing coverage. Please review.

Project coverage is 55.41%. Comparing base (09a0ce7) to head (4821ce2).
Report is 16 commits behind head on develop.

Files Patch % Lines
paddlenlp/trainer/trainer.py 13.33% 13 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #8374      +/-   ##
===========================================
+ Coverage    55.36%   55.41%   +0.04%     
===========================================
  Files          614      615       +1     
  Lines        96016    96241     +225     
===========================================
+ Hits         53164    53335     +171     
- Misses       42852    42906      +54     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -738,7 +738,7 @@ class TrainingArguments:
metadata={"help": "The path to a folder with a valid checkpoint for your model."},
)
skip_memory_metrics: bool = field(
default=True, metadata={"help": "Whether or not to skip adding of memory profiler reports to metrics."}
default=False, metadata={"help": "Whether or not to skip adding of memory profiler reports to metrics."}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个别改了吧,你们需要用,自己打看

Copy link
Contributor

@ZHUI ZHUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators May 8, 2024
@PaddlePaddle PaddlePaddle unlocked this conversation May 8, 2024
@ZHUI ZHUI closed this May 9, 2024
@ZHUI ZHUI reopened this May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants