Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

分布式fate训练更慢了:相同数据、相同分区、相同并行度、相同训练参数 #567

Open
henshy opened this issue Feb 22, 2022 · 4 comments

Comments

@henshy
Copy link

henshy commented Feb 22, 2022

数据维度:双边40w+450维
模型:secureboost
机器与环境:多台(32核、64G)、内网
分布式:k8s集群2台:2台
任务参数:
"common": {
"job_type": "train",
"task_cores": 32,
"task_parallelism": 1,
"computing_partitions": 32
}
算法参数:
"common": {
"hetero_secure_boost_0": {
"task_type": "classification",
"objective_param": {
"objective": "cross_entropy"
},
"validation_freqs": 1,
"encrypt_param": {
"method": "Paillier",
"key_length": 2048
},
"learning_rate": 0.1,
"num_trees": 10,
"tree_param": {
"max_depth": 5
}
},
"evaluation_0": {
"eval_type": "binary"
},
"data_transform_0": {
"input_format": "sparse"
},
"data_transform_1": {
"input_format": "sparse"
}
}
训练耗时(host为例):
分布式计算:
image
单机计算:
image
分布式通讯:
image
单机通讯:
image
耗时总结:
分布式的mapReducePartitions花了8000多秒,单机反而是4000多秒;
网络通讯也一样,分布式get下的encrypted_grad_and_hess花费4343秒,单机2511秒。

@sunyinggang
Copy link

麻烦问一下,您这个数据集是真实的还是模拟生成的

@gxcuit
Copy link

gxcuit commented May 19, 2022

请问您这是什么版本?问题解决了吗

@JayzzJie
Copy link

我也遇到相同问题:2.2.0版本单机部署(nodemanager1)比2.4.3版本集群(nodemanager2)部署耗时少一半

@dylan-fan
Copy link
Collaborator

分布式节点间有调度的耗时。分布式优势应该是可以利用更多的core。你试试分布式下用更多的的core看看

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants