Add distributed training robust cases into fluid benchmark test #11206

velconia · 2018-06-05T13:57:48Z

the PaddlePaddle robust feature combination includes:

Parallel Executor	slice_var_up	sync_mode	Place	Parameters Update Method	Trainer Count	model	LearningRate Decay	Regularization	Gradient Clipping	Error Clipping
True	True	True	GPU	pserver	2	Mnist	True	L1	True	True
False	False	False	CPU	nccl2			False	L2	False	False
								None

I construct the test-case-pairs with AllPair as following:

import os
import metacomm.combinatorics.all_pairs2
all_pairs = metacomm.combinatorics.all_pairs2.all_pairs2

parameters = [ [ "--gpus 2", "--gpus 1" ] # If use ParallelExecutor or NOT
             , [ "", "--no_split_var"] # If split variables into blocks or NOT
             , [ "", "--async_mode" ] # if ASGD
             , [ "--device GPU", "--device CPU" ] # if train with GPU
             , [ "--model mnist" ] # model
             , [ "--update_method pserver", "--update_method nccl2" ] # parameters update method
             , [ "--learning_rate_decay_method exponential", "--learning_rate_decay_method natural_exp" ] # learning rate decay
             , [ "--weight_decay_regularizer_method L1", "--weight_decay_regularizer_method L2" ]
             , [ "--gradient_clip_method Norm", "--gradient_clip_method GlobalNorm" ]
             , [ "--error_clip_method Value", "" ]
             ]

pairwise = all_pairs(parameters)

print "PAIRWISE:"
for i, v in enumerate(pairwise):
    cmd = "python fluid_benchmark.py"
    for arg in v:
        cmd += ' ' + arg
    print "%i:\t%s" % (i, cmd)

The text was updated successfully, but these errors were encountered:

panyx0718 · 2018-06-08T03:50:47Z

when this is done. I recommend adding more model variations to the suite

velconia · 2018-06-08T06:39:55Z

After communicating with @typhoonzero , I divide the table above into two parts, CI part and CE part.

The features in CE part should be added into test cases in ce-latest-kpi, and implemented in fluid benchmark, features in CE part include:

Parallel Executor	slice_var_up	sync_mode	Place	Parameters Update Method	Trainer Count	model
True	True	True	GPU	pserver	2	resnet
False	False	False	CPU	nccl2		seq2seq

The features in CI part should be covered by unit test, as we listed in #11213, the features in CI part include:

LearningRate Decay	Regularization	Gradient Clipping	Error Clipping
True	L1	True	True
False	L2	False	False
	None

velconia · 2018-06-08T07:56:49Z

The test-case-pairs in CE part was construct with AllPairs as following:

import os
import metacomm.combinatorics.all_pairs2

all_pairs = metacomm.combinatorics.all_pairs2.all_pairs2

def generate_aws_pserver_cmd(i, v):
    args = []
    for arg in v:
        if arg:
            args.append(arg)
    trainer_command = ','.join(args)
    pserver_command = trainer_command
    print trainer_command


aws_parameters = [ [ "gpus:2", "gpus:1" ] # If use ParallelExecutor or NOT
                , [ "", "no_split_var:"] # If split variables into blocks or NOT
                , [ "", "async_mode:" ] # if ASGD
                , [ "device:GPU", "device:CPU" ] # if train with GPU
                , [ "model:resnet", "model:machine_translation" ] # models
                , [ "update_method:pserver", "update_method:nccl2" ] # parameters update method
                ]

aws_pairwise = all_pairs(aws_parameters)

print "PAIRWISE:"
for i, v in enumerate(aws_pairwise):
    print("%i:" % i)
    generate_aws_pserver_cmd(i, v)

And 7 cases was contructted as follows:

PAIRWISE:
0:                                      
gpus:2,device:GPU,model:resnet,update_method:pserver
1:                                                      
gpus:1,no_split_var:,async_mode:,device:CPU,model:machine_translation,update_method:pserver
2:                                 
gpus:1,async_mode:,device:GPU,model:machine_translation,update_method:nccl2
3:               
gpus:2,no_split_var:,device:CPU,model:resnet,update_method:nccl2
4:                          
gpus:2,no_split_var:,async_mode:,device:GPU,model:resnet,update_method:nccl2
5:                                   
gpus:1,device:CPU,model:resnet,update_method:nccl2
6:
gpus:2,device:CPU,model:machine_translation,update_method:nccl2

kolinwei · 2018-07-10T08:45:13Z

分布式CE需要校验的模型

方向	模型	数据
图像分类	resnet50	flowers
	Se-resnext50	imagenet1000
OCR识别	RCNN-CTC	单行汉语字符图片数据集
语言模型	GRU-RNN	ptb
机器翻译	transformer	vmt16

另外在训练的规模上需要考虑：多机单CPU、多机单GPU、多机多CPU、多机多GPU
（图像的模型可以值校验GPU，nlp的需要校验CPU和GPU）

gongweibao · 2018-07-11T09:07:34Z

分布式CE需要校验的模型

方向	模型	数据
图像分类	resnet50	implemention
	object detection	implemention
	OCR识别	implemention
NLP	machine translation	implemention
	transformer	implemention

另外在训练的规模上需要考虑：多机单CPU、多机单GPU、多机多CPU、多机多GPU
（图像的模型可以值校验GPU，nlp的需要校验CPU和GPU）

shanyi15 · 2018-08-15T11:01:38Z

您好，此issue在近一个月内暂无更新，我们将于今天内关闭。若在关闭后您仍需跟进提问，可重新开启此问题，我们将在24小时内回复您。因关闭带来的不便我们深表歉意，请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

velconia mentioned this issue Jun 5, 2018

Add some dist-training robust cases into fluid benchmark test #11207

Merged

typhoonzero mentioned this issue Jun 8, 2018

分布式稳定性测试 #11289

Closed

typhoonzero closed this as completed in #11207 Jun 11, 2018

gongweibao self-assigned this Jul 10, 2018

gongweibao reopened this Jul 10, 2018

kolinwei closed this as completed Jul 10, 2018

kolinwei reopened this Jul 10, 2018

shanyi15 closed this as completed Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed training robust cases into fluid benchmark test #11206

Add distributed training robust cases into fluid benchmark test #11206

velconia commented Jun 5, 2018 •

edited

Loading

panyx0718 commented Jun 8, 2018

velconia commented Jun 8, 2018 •

edited

Loading

velconia commented Jun 8, 2018 •

edited

Loading

kolinwei commented Jul 10, 2018

gongweibao commented Jul 11, 2018 •

edited

Loading

shanyi15 commented Aug 15, 2018

Add distributed training robust cases into fluid benchmark test #11206

Add distributed training robust cases into fluid benchmark test #11206

Comments

velconia commented Jun 5, 2018 • edited Loading

panyx0718 commented Jun 8, 2018

velconia commented Jun 8, 2018 • edited Loading

velconia commented Jun 8, 2018 • edited Loading

kolinwei commented Jul 10, 2018

gongweibao commented Jul 11, 2018 • edited Loading

shanyi15 commented Aug 15, 2018

velconia commented Jun 5, 2018 •

edited

Loading

velconia commented Jun 8, 2018 •

edited

Loading

velconia commented Jun 8, 2018 •

edited

Loading

gongweibao commented Jul 11, 2018 •

edited

Loading