分布式fate训练更慢了：相同数据、相同分区、相同并行度、相同训练参数 #567

henshy · 2022-02-22T03:30:02Z

数据维度：双边40w+450维
模型：secureboost
机器与环境：多台（32核、64G）、内网
分布式：k8s集群2台：2台
任务参数：
"common": {
"job_type": "train",
"task_cores": 32,
"task_parallelism": 1,
"computing_partitions": 32
}
算法参数：
"common": {
"hetero_secure_boost_0": {
"task_type": "classification",
"objective_param": {
"objective": "cross_entropy"
},
"validation_freqs": 1,
"encrypt_param": {
"method": "Paillier",
"key_length": 2048
},
"learning_rate": 0.1,
"num_trees": 10,
"tree_param": {
"max_depth": 5
}
},
"evaluation_0": {
"eval_type": "binary"
},
"data_transform_0": {
"input_format": "sparse"
},
"data_transform_1": {
"input_format": "sparse"
}
}
训练耗时（host为例）：
分布式计算：

单机计算：

分布式通讯：

单机通讯：

耗时总结：
分布式的mapReducePartitions花了8000多秒，单机反而是4000多秒；
网络通讯也一样，分布式get下的encrypted_grad_and_hess花费4343秒，单机2511秒。

sunyinggang · 2022-05-07T13:28:35Z

麻烦问一下，您这个数据集是真实的还是模拟生成的

gxcuit · 2022-05-19T06:12:38Z

请问您这是什么版本？问题解决了吗

JayzzJie · 2022-05-20T09:22:00Z

我也遇到相同问题：2.2.0版本单机部署（nodemanager1）比2.4.3版本集群（nodemanager2）部署耗时少一半

dylan-fan · 2022-07-14T02:22:36Z

分布式节点间有调度的耗时。分布式优势应该是可以利用更多的core。你试试分布式下用更多的的core看看

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

分布式fate训练更慢了：相同数据、相同分区、相同并行度、相同训练参数 #567

分布式fate训练更慢了：相同数据、相同分区、相同并行度、相同训练参数 #567

henshy commented Feb 22, 2022

sunyinggang commented May 7, 2022

gxcuit commented May 19, 2022

JayzzJie commented May 20, 2022

dylan-fan commented Jul 14, 2022

分布式fate训练更慢了：相同数据、相同分区、相同并行度、相同训练参数 #567

分布式fate训练更慢了：相同数据、相同分区、相同并行度、相同训练参数 #567

Comments

henshy commented Feb 22, 2022

sunyinggang commented May 7, 2022

gxcuit commented May 19, 2022

JayzzJie commented May 20, 2022

dylan-fan commented Jul 14, 2022