Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi client launch #5372

Merged
merged 53 commits into from
Jul 5, 2021
Merged

multi client launch #5372

merged 53 commits into from
Jul 5, 2021

Conversation

daquexian
Copy link
Contributor

#5008 里和 multi client 本身相关的部分。一部分改动来自 binbin。

  1. 在 ProcessCtx 里添加 is_multi_client 属性,暴露 LocalRank() 和 IsMultiClient() 接口
  2. 去掉 BootstrapConf 的 num_process_per_node 属性
  3. 添加和 SingleClientSync 对应的 MultiClientSync
  4. 修改 init.py 的逻辑,在检测到五个环境变量时,自动执行 multi client 的 env_init(),否则和原来一样执行 init_default_physical_env()

daquexian and others added 30 commits May 26, 2021 18:22
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
…f_multi_devices

Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 4, 2021 05:15
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 4, 2021 14:46
Signed-off-by: daquexian <daquexian566@gmail.com>
@oneflow-ci-bot oneflow-ci-bot removed their request for review July 5, 2021 01:41
oneflow/init.py Outdated
@@ -69,7 +69,7 @@


if env_util.HasAllMultiClientEnvVars():
env_util.env_init(True)
env_util.api_env_init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里错了吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对,改错了

Signed-off-by: daquexian <daquexian566@gmail.com>
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 5, 2021 02:58
@oneflow-ci-bot oneflow-ci-bot self-requested a review July 5, 2021 03:51
@clackhan clackhan requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 5, 2021 03:54
@oneflow-ci-bot oneflow-ci-bot removed their request for review July 5, 2021 06:59
@oneflow-ci-bot oneflow-ci-bot merged commit 5806e2b into master Jul 5, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the multi_client branch July 5, 2021 06:59
@@ -33,4 +33,7 @@ ONEFLOW_API_PYBIND11_MODULE("", m) {

m.def("GetRank", &GetRank);
m.def("GetWorldSize", &GetWorldSize);
m.def("GetNodeSize", &GetNodeSize);
m.def("GetLocalRank", &GetLocalRank);
m.def("IsMultiClient", &IsMultiClient);
Copy link
Contributor

@chengtbf chengtbf Jul 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python端的使用方式是不是

import oneflow._oneflow_internal

oneflow._oneflow_internal.IsMultiClient()

这样?

@@ -203,3 +208,8 @@ def get_world_size():

"""
return oneflow._oneflow_internal.GetWorldSize()


@oneflow_export("distributed.is_multi_client")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

哦哦哦看到这里了

@@ -388,6 +384,33 @@ def GetEnvDefaultParallelConf(device_tag):
return device_tag2default_parallel_conf[device_tag]


def HasAllMultiClientEnvVars():
return (
os.getenv("MASTER_ADDR")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个判断条件应该是不对的,getenv 是拿到对应的 string 值,但是 and 的结果是 0,实测:

os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "12139"
os.environ["WORLD_SIZE"] = "1"
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"

is_multi_client = (os.getenv("MASTER_ADDR") and os.getenv("MASTER_PORT")                            
and os.getenv("WORLD_SIZE") and os.getenv("RANK") and os.getenv("LOCAL_RANK")) 
print("is_multi_client", is_multi_client)

输出:

is_multi_client 0

@daquexian @clackhan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

试验了一下,这个 0 是字符串 0,是最后一个 string(os.getenv("LOCAL_RANK"))的值,if is_multi_client 还是会走 True 的分支,但 HasAllMultiClientEnvVars() 这个函数确实该返回 True/False,我改下

@chengtbf chengtbf mentioned this pull request Jul 11, 2021
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants