random failure in processing seqpool_concat_fuse_pass when doing stress test for test_analyzer_small_dam #16586

LeoZhao-Habana · 2019-04-01T07:26:30Z

System information
-PaddlePaddle version: develop branch
-CPU: CPUMKL ON/OFF
-GPU: No
-OS Platform: Linux
-Python version: N/A
-Cmake orders
-C++version.txt
-API information
To Reproduce
Run "ctest -R test_analyzer_small_dam -V" for more than 5000 tests

Describe your current behavior
Random failure in seqpool_concat_fuse_pass processing in both builds with cmake option -DWITH_MKL=ON/OFF

Code to reproduce the issue

Other info / logs

116: Test command: /home/leozhao/Paddle/build/paddle/fluid/inference/tests/api/test_analyzer_small_dam "--inference_model_dir=/home/leozha
o/Paddle/build/third_party/inference_demo/word2vec/word2vec.inference.model" "--infer_model=/home/leozhao/Paddle/build/third_party/inferen
ce_demo/small_dam/model" "--infer_data=/home/leozhao/Paddle/build/third_party/inference_demo/small_dam/data.txt" "--max_turn_num=1"
116: Environment variables:
116:  FLAGS_cudnn_deterministic=true
116: Test timeout computed to be: 600
116: [==========] Running 8 tests from 1 test case.
116: [----------] Global test environment set-up.
116: [----------] 8 tests from Analyzer_dam
116: [ RUN      ] Analyzer_dam.profile
116: WARNING: Logging before InitGoogleLogging() is written to STDERR
116: I0331 21:58:43.398690 27293 analyzer_dam_tester.cc:178] The number of samples to be test: 1
116: I0331 21:58:43.398905 27293 tester_helper.h:65] AnalysisConfig {
116:   NativeConfig {
116:     PaddlePredictor::Config {
116:       model_dir:
116:     }
116:     use_gpu: 0
116:     device: 0
116:     fraction_of_gpu_memory: 0
116:     specify_input_name: 1
116:     cpu_num_threads: 1
116:   }
116:   prog_file: /home/leozhao/Paddle/build/third_party/inference_demo/small_dam/model/__model__
116:   param_file: /home/leozhao/Paddle/build/third_party/inference_demo/small_dam/model/param
116:   enable_ir_optim: 1
116:   enable_ir_optim: 1
116:   use_feed_fetch_ops: 1
116:   use_tensorrt: 0
116:   use_mkldnn: 0
116: }
116: ^[[1m^[[35m--- Running analysis [ir_graph_build_pass]^[[0m
116: ^[[1m^[[35m--- Running analysis [ir_analysis_pass]^[[0m
116: ^[[32m--- Running IR pass [infer_clean_graph_pass]^[[0m
116: ^[[32m--- Running IR pass [attention_lstm_fuse_pass]^[[0m
116: ^[[32m--- Running IR pass [seqpool_concat_fuse_pass]^[[0m
1/1 Test #116: test_analyzer_small_dam ..........***Exception: Other  1.25 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   1.28 sec

The text was updated successfully, but these errors were encountered:

LeoZhao-Habana · 2019-04-01T07:34:57Z

withlog.zip

luotao1 · 2019-04-01T07:59:19Z

请用```或者`来标注代码块，issue格式略乱。

tensor-tang · 2019-04-02T09:21:12Z

[32m--- Running IR pass [seqpool_concat_fuse_pass]
1/1 Test : test_analyzer_small_dam ..........***Exception: Other 1.25 sec

How to confirm failed inside seqpool_concat_fuse_pass? Do you have trackback?

LeoZhao-Habana · 2019-04-02T09:26:59Z

can't confirm, just from log analysis, it doesn't enter next fuse pass processing.
I didn't have chance to back trace, the reproduce rate is much low.

tensor-tang · 2019-04-04T09:00:13Z

I still did not reproduce it with both release and debug build.

Does it related with #16688 ?

LeoZhao-Habana · 2019-04-04T09:17:30Z

quit hard to reproduce, and not sure if it is related with #16688

LeoZhao-Habana · 2019-04-05T00:53:11Z

I think it is due to some system issues, I ran 3 processes to do same test simultaneously, and these tests hit same failure almost within closed 1 second.
sometimes failure in another place, like

118: [       OK ] Analyzer_dam.profile_mkldnn (1801 ms)
118: [ RUN      ] Analyzer_dam.fuse_statis
118: ^[[1m^[[35m--- Running analysis [ir_graph_build_pass]^[[0m
1/1 Test #118: test_analyzer_small_dam ..........***Exception: Other  4.01 sec

tensor-tang · 2019-04-08T06:12:21Z

I ran 10000 times in both release and debug model with command ./paddle/fluid/inference/tests/api/test_analyzer_small_dam --infer_model=third_party/inference_demo/small_dam/model/ --infer_data=third_party/inference_demo/small_dam/data.txt --max_turn_num=1 --gtest_filter=Analyzer_dam.profile, all passed.

LeoZhao-Habana · 2019-04-08T06:16:29Z

my cmd is "ctest -R test_analyzer_small_dam -V" in a batch run script, it needs a test suite sequence instead of just one case, otherwise some failures can't reproduce.

tensor-tang · 2019-04-08T06:23:46Z

make test ARGS="-R test_analyzer_small_dam -V" is running to 2181 times, still pass.

it needs a test suite sequence

Is it possible related with some env exports?

luotao1 · 2019-04-08T06:27:00Z

Since seqpool_concat_fuse_pass is not common, how about disable it in default? @tensor-tang

tensor-tang · 2019-04-08T06:29:31Z

OK for me, we can only enable seqpool_concat_fuse_pass at the specific tests.

But the problem is still there as @LeoZhao-Intel said sometimes failure in another place

LeoZhao-Habana · 2019-04-08T06:30:28Z

I suspect this is a kind of system issue, and I see it impacts all test processes in same time not just one, and reproduce rate is much much low, I suggest we can keep here to monitor later.

tensor-tang · 2019-04-09T03:58:46Z

Attach more tests running all night, make test ARGS="-R test_analyzer_small_dam -V" passed 3977 times in release mode, 1155 times in debug mode.

LeoZhao-Habana · 2019-04-09T05:23:31Z

what's seqpool_concat_fuse used for? From log, seems this pass takes much time on graph pattern detection.

tensor-tang · 2019-04-09T05:40:08Z

Paddle/paddle/fluid/framework/ir/seqpool_concat_fuse_pass.h

Lines 26 to 39 in 1c8b34d

    
           /** 
        
            * Fuse SequencePool(with sum pooltype yet) and Concat; 
        
            * 
        
            * Before fuse: 
        
            *    |         |             | 
        
            * seq_pool, seq_pool, ... seq_pool 
        
            *    \         |      ...   / 
        
            *            concat 
        
            *              | 
        
            * After fuse: 
        
            *    \      |       / 
        
            *   FusionSeqPoolConcat 
        
            *           | 
        
            */

Because pattern is very large and support at most 200 inputs.

luotao1 · 2019-04-09T05:55:37Z

Thus, could you create a PR to disable seqpool_concat_fuse_pass for common, and to reduce the CI time? @tensor-tang

tensor-tang · 2019-04-09T06:16:03Z

Yes, WIP.

LeoZhao-Habana mentioned this issue Apr 1, 2019

test_analyzer_small_dam random fails on accuracy #16473

Closed

luotao1 added the Intel label Apr 1, 2019

kuke assigned luotao1 Apr 2, 2019

luotao1 assigned tensor-tang Apr 2, 2019

tensor-tang mentioned this issue Apr 9, 2019

disable seqpool concat pass by default saving CI time #16725

Merged

yihuaxu mentioned this issue Apr 10, 2019

Fix the order while sorting the operators #16756

Merged

luotao1 closed this as completed in #16756 Apr 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

random failure in processing seqpool_concat_fuse_pass when doing stress test for test_analyzer_small_dam #16586

random failure in processing seqpool_concat_fuse_pass when doing stress test for test_analyzer_small_dam #16586

LeoZhao-Habana commented Apr 1, 2019 •

edited

LeoZhao-Habana commented Apr 1, 2019

luotao1 commented Apr 1, 2019

tensor-tang commented Apr 2, 2019 •

edited

LeoZhao-Habana commented Apr 2, 2019

tensor-tang commented Apr 4, 2019

LeoZhao-Habana commented Apr 4, 2019

LeoZhao-Habana commented Apr 5, 2019

tensor-tang commented Apr 8, 2019

LeoZhao-Habana commented Apr 8, 2019 •

edited

tensor-tang commented Apr 8, 2019

luotao1 commented Apr 8, 2019

tensor-tang commented Apr 8, 2019

LeoZhao-Habana commented Apr 8, 2019

tensor-tang commented Apr 9, 2019

LeoZhao-Habana commented Apr 9, 2019

tensor-tang commented Apr 9, 2019

luotao1 commented Apr 9, 2019

tensor-tang commented Apr 9, 2019

random failure in processing seqpool_concat_fuse_pass when doing stress test for test_analyzer_small_dam #16586

random failure in processing seqpool_concat_fuse_pass when doing stress test for test_analyzer_small_dam #16586

Comments

LeoZhao-Habana commented Apr 1, 2019 • edited

LeoZhao-Habana commented Apr 1, 2019

luotao1 commented Apr 1, 2019

tensor-tang commented Apr 2, 2019 • edited

LeoZhao-Habana commented Apr 2, 2019

tensor-tang commented Apr 4, 2019

LeoZhao-Habana commented Apr 4, 2019

LeoZhao-Habana commented Apr 5, 2019

tensor-tang commented Apr 8, 2019

LeoZhao-Habana commented Apr 8, 2019 • edited

tensor-tang commented Apr 8, 2019

luotao1 commented Apr 8, 2019

tensor-tang commented Apr 8, 2019

LeoZhao-Habana commented Apr 8, 2019

tensor-tang commented Apr 9, 2019

LeoZhao-Habana commented Apr 9, 2019

tensor-tang commented Apr 9, 2019

luotao1 commented Apr 9, 2019

tensor-tang commented Apr 9, 2019

LeoZhao-Habana commented Apr 1, 2019 •

edited

tensor-tang commented Apr 2, 2019 •

edited

LeoZhao-Habana commented Apr 8, 2019 •

edited