Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoTuner] Add auto tuner to obtain optima configuration #54460

Merged
merged 8 commits into from
Jun 14, 2023

Conversation

Caozhou1995
Copy link
Contributor

@Caozhou1995 Caozhou1995 commented Jun 8, 2023

PR types

New features

PR changes

Others

Description

Pcard-72023

The optimal configuration for large model with distributed training/inference often requires designing multiple sets of experiments based on experiences (network, parameter size, gpu memory or flops, etc.), and comparing the results to determine the optimal configuration. This process heavily relies on human experience, and the determined optimal configuration may not be the global optimal configuration. When any condition changes, the above process needs to be repeated repeatedly, resulting in poor usability of large models.

To address the above issues, we have implemented AutoTuner based on Profiling, with the main modules as follows:

  1. Provide clear json configuration for users to directly use AutoTuner, avoiding additional coding work for users
  2. launch multi tasks one by one and automatically schedule and monitor.
  3. Implement search module and pruning module, support multiple search algorithms and pruning strategies.

At present, we have built-in grid search support for 8 dimensions, including dp degree, mp degree, pp degree, mbs, sharding degree, sharding stage, recompute, and recompute granularity. The example JSON is as follows:
image

The usage is as follows:
python -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" --auto_tuner_json=test.json your_train.py your_args

NOTE: Since the auto_tuner is non-invasive, users need to expose args in their script to enable the configuration generated by auto_tuner be executed.

@paddle-bot
Copy link

paddle-bot bot commented Jun 8, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@XieYunshen XieYunshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for set_tests_properties(test_auto_tuner PROPERTIES LABELS "RUN_TYPE=EXCLUSIVE" TIMEOUT 100)

Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, u can refine code with the comments in the next pr.


process = subprocess.Popen(cmd)
process.wait()
self.assertEqual(process.returncode, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the config searched?

Comment on lines +297 to +305
import copy
import json
import signal
import sys
import time

from ..auto_tuner.tuner import AutoTuner
from ..auto_tuner.utils import gen_new_args
from . import controllers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better import at the top of the file

cur_cfg = auto_tuner.search_once()

# get max time per task run
max_time_per_task = tuner_cfg.get("max_time_per_task", 1800)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_time_per_task -> max_time_in_seconds_per_task?


def __init__(self, tuner_cfg):
self.cur_task_id = 1
self.task_limit = tuner_cfg.get("task_limit", 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT_MAX_TASK_LIMIT = 100 ?

@zhiqiu zhiqiu merged commit e12d286 into PaddlePaddle:develop Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants