Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distributed training robust cases into fluid benchmark test #11206

Closed
velconia opened this issue Jun 5, 2018 · 6 comments
Closed

Add distributed training robust cases into fluid benchmark test #11206

velconia opened this issue Jun 5, 2018 · 6 comments
Assignees

Comments

@velconia
Copy link
Collaborator

velconia commented Jun 5, 2018

the PaddlePaddle robust feature combination includes:

Parallel Executor slice_var_up sync_mode Place Parameters Update Method Trainer Count model LearningRate Decay Regularization Gradient Clipping Error Clipping
True True True GPU pserver 2 Mnist True L1 True True
False False False CPU nccl2     False L2 False False
                None    

I construct the test-case-pairs with AllPair as following:

import os
import metacomm.combinatorics.all_pairs2
all_pairs = metacomm.combinatorics.all_pairs2.all_pairs2

parameters = [ [ "--gpus 2", "--gpus 1" ] # If use ParallelExecutor or NOT
             , [ "", "--no_split_var"] # If split variables into blocks or NOT
             , [ "", "--async_mode" ] # if ASGD
             , [ "--device GPU", "--device CPU" ] # if train with GPU
             , [ "--model mnist" ] # model
             , [ "--update_method pserver", "--update_method nccl2" ] # parameters update method
             , [ "--learning_rate_decay_method exponential", "--learning_rate_decay_method natural_exp" ] # learning rate decay
             , [ "--weight_decay_regularizer_method L1", "--weight_decay_regularizer_method L2" ]
             , [ "--gradient_clip_method Norm", "--gradient_clip_method GlobalNorm" ]
             , [ "--error_clip_method Value", "" ]
             ]

pairwise = all_pairs(parameters)

print "PAIRWISE:"
for i, v in enumerate(pairwise):
    cmd = "python fluid_benchmark.py"
    for arg in v:
        cmd += ' ' + arg
    print "%i:\t%s" % (i, cmd)
@panyx0718
Copy link
Contributor

when this is done. I recommend adding more model variations to the suite

@velconia
Copy link
Collaborator Author

velconia commented Jun 8, 2018

After communicating with @typhoonzero , I divide the table above into two parts, CI part and CE part.

The features in CE part should be added into test cases in ce-latest-kpi, and implemented in fluid benchmark, features in CE part include:

Parallel Executor slice_var_up sync_mode Place Parameters Update Method Trainer Count model
True True True GPU pserver 2 resnet
False False False CPU nccl2   seq2seq

The features in CI part should be covered by unit test, as we listed in #11213, the features in CI part include:

LearningRate Decay Regularization Gradient Clipping Error Clipping
True L1 True True
False L2 False False
  None    

@velconia
Copy link
Collaborator Author

velconia commented Jun 8, 2018

The test-case-pairs in CE part was construct with AllPairs as following:

import os
import metacomm.combinatorics.all_pairs2

all_pairs = metacomm.combinatorics.all_pairs2.all_pairs2

def generate_aws_pserver_cmd(i, v):
    args = []
    for arg in v:
        if arg:
            args.append(arg)
    trainer_command = ','.join(args)
    pserver_command = trainer_command
    print trainer_command


aws_parameters = [ [ "gpus:2", "gpus:1" ] # If use ParallelExecutor or NOT
                , [ "", "no_split_var:"] # If split variables into blocks or NOT
                , [ "", "async_mode:" ] # if ASGD
                , [ "device:GPU", "device:CPU" ] # if train with GPU
                , [ "model:resnet", "model:machine_translation" ] # models
                , [ "update_method:pserver", "update_method:nccl2" ] # parameters update method
                ]

aws_pairwise = all_pairs(aws_parameters)

print "PAIRWISE:"
for i, v in enumerate(aws_pairwise):
    print("%i:" % i)
    generate_aws_pserver_cmd(i, v)

And 7 cases was contructted as follows:

PAIRWISE:
0:                                      
gpus:2,device:GPU,model:resnet,update_method:pserver
1:                                                      
gpus:1,no_split_var:,async_mode:,device:CPU,model:machine_translation,update_method:pserver
2:                                 
gpus:1,async_mode:,device:GPU,model:machine_translation,update_method:nccl2
3:               
gpus:2,no_split_var:,device:CPU,model:resnet,update_method:nccl2
4:                          
gpus:2,no_split_var:,async_mode:,device:GPU,model:resnet,update_method:nccl2
5:                                   
gpus:1,device:CPU,model:resnet,update_method:nccl2
6:
gpus:2,device:CPU,model:machine_translation,update_method:nccl2

@gongweibao gongweibao self-assigned this Jul 10, 2018
@gongweibao gongweibao reopened this Jul 10, 2018
@kolinwei
Copy link
Contributor

分布式CE需要校验的模型

方向 模型 数据
图像分类 resnet50 flowers
Se-resnext50 imagenet1000
OCR识别 RCNN-CTC 单行汉语字符图片数据集
语言模型 GRU-RNN ptb
机器翻译 transformer vmt16

另外在训练的规模上需要考虑:多机单CPU、多机单GPU、多机多CPU、多机多GPU
(图像的模型可以值校验GPU,nlp的需要校验CPU和GPU)

@kolinwei kolinwei reopened this Jul 10, 2018
@gongweibao
Copy link
Contributor

gongweibao commented Jul 11, 2018

分布式CE需要校验的模型

方向 模型 数据
图像分类 resnet50 implemention
object detection implemention
OCR识别 implemention
NLP machine translation implemention
transformer implemention

另外在训练的规模上需要考虑:多机单CPU、多机单GPU、多机多CPU、多机多GPU
(图像的模型可以值校验GPU,nlp的需要校验CPU和GPU)

@shanyi15
Copy link
Collaborator

您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants