Skip to content
This repository has been archived by the owner on Jan 24, 2024. It is now read-only.

CE模型对齐 #45

Open
guochaorong opened this issue May 14, 2018 · 1 comment
Open

CE模型对齐 #45

guochaorong opened this issue May 14, 2018 · 1 comment

Comments

@guochaorong
Copy link
Collaborator

guochaorong commented May 14, 2018

CE模型添加多卡支持,待验证Model CE多卡加速比指标

@guochaorong guochaorong created this issue from a note in Evaluations coming to bowl (To do) May 14, 2018
@guochaorong
Copy link
Collaborator Author

对CE中模型进行梳理(见后面所附表),
模型如下:
image_classification vgg16 mnist object_detection resnet30 resnet50
seq2seq sequence_tagging_for_ner text_classification transformer language_model lstm

需要考虑增加和对齐的内容如下:

  1. 模型都改成多卡跑(4卡)(后续,我把指定卡放到外边,单卡、多卡均跑一遍)

  2. 每个模型的评价指标需要包含这4个数据(acc/ppl,cost ,mem 和 duration)

  3. 目前只监控了上述4个评价指标的diff,我观察到两种非预期情况,1 .跑得时间很短, acc 很低(0.1),2. 跑了很多轮, acc很低(0.1,模型自身有问题)。
    暂时方案, 我们将轮数很低的加长(跑30min左右),将acc都统一调到0.5以上。
    (后续我加上acc基数阈值告警。)

  4. 数据集统一使用现成的(而不是每次都下载), 放在默认的/root/.cache/paddle/dataset目录

模型 数据集 Pass  轮数, 当前执行情况 评价指标 参数
Lstm 影评   Layers:words DynamicRNN paddle.dataset.imdb as imdb     http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz 1轮 Pass = 0, Iter = 49, Loss = 0.713064, Accuracy = 0.593750     nvidia-smi --id=%s --query-compute-apps=used_memory --format=csv -lms 1 > memory.txt imdb_32_train_speed imdb_32_gpu_memory batch_size: 32 device: GPU emb_dim: 512 gpu_id: 0 hidden_dim: 512 iterations: 50 skip_batch_num: 5
object_detection dataset: pascalvoc 和coco 数据集 指定在/data/目录, 但没有 Pass轮数:2 IOError: [Errno 2] No such file or directory: '/data/pascalvoc/label_list' 需要在/data目录防止数据 train_cost_kpi train_speed_kpi batch_size: 64 is_toy: 0 iterations: 120 learning_rate: 0.001 num_passes: 2 parallel: True use_gpu: True
Resnet50 Flowers cifar     http://www.robots.ox.ac.uk/~vgg/ data/flowers/102/102flowers.tgz Pass 轮数:29(不收敛) Pass:2, Loss:3.229035, Train Accuray:0.247656, Test Accuray:0.176471, Handle Images Duration: 63.949636 cifar10_128_train_acc_kpi, cifar10_128_train_speed_kpi,   cifar10_128_gpu_memory_kpi, flowers_64_train_speed_kpi,   flowers_64_gpu_memory_kpi,   起了个线程取mem信息, 并没有评价acc等 batch_size: 64 data_format: NCHW data_set: flowers device: GPU infer_only: False iterations: 80 model: resnet_imagenet pass_num: 3 skip_batch_num: 5
Pass:29, Loss:0.026319, Train Accuray:0.993359, Test Accuray:0.559400,  Handle Images Duration: 22.501337
language_model /root/.cache/paddle/dataset/imikolov/ simple-examples.tgz   ppl:61.667 time_cost(s):18.544248    
sequence_tagging_for_ner 数据集 http://cs224d.stanford.edu/assignment2/ assignment2.zip Pass轮数: 22轮 download data error! 增加目录data后ok [TestSet] pass_id:2200 【pass num 每次增加100】pass_precision:[0.18181819] pass_recall:[0.125] pass_f1_score:[0.14814815] train_acc_kpi,  pass_duration_kpi,  
text_classification Imdb http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz Pass:14 avg_acc: 0.999800, avg_cost: 0.002255    
Vgg16 flowers/imagelabels.mat http://www.robots.ox.ac.uk/~vgg/data/ flowers/102/imagelabels.mat 1轮 cifar10 Pass: 1, Loss: 1.810090, Train Accuray: 0.234375 cifar10_128_train_speed_kpi,  cifar10_128_gpu_memory_kpi, flowers_32_train_speed_kpi,   flowers_32_gpu_memory_kpi, 起了个线程取mem信息, 并没有评价acc等  
Pass: 49, Loss: 3.561218, Train Accuray: 0.093750
           

@guochaorong guochaorong moved this from To do to In progress in Evaluations coming to bowl May 14, 2018
@guochaorong guochaorong moved this from In progress to Done in Evaluations coming to bowl Aug 13, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Development

No branches or pull requests

1 participant