Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

random failure in processing seqpool_concat_fuse_pass when doing stress test for test_analyzer_small_dam #16586

Closed
LeoZhao-Habana opened this issue Apr 1, 2019 · 18 comments · Fixed by #16756
Assignees
Labels

Comments

@LeoZhao-Habana
Copy link
Contributor

LeoZhao-Habana commented Apr 1, 2019

System information
-PaddlePaddle version: develop branch
-CPU: CPUMKL ON/OFF
-GPU: No
-OS Platform: Linux
-Python version: N/A
-Cmake orders
-C++version.txt
-API information
To Reproduce
Run "ctest -R test_analyzer_small_dam -V" for more than 5000 tests

Describe your current behavior
Random failure in seqpool_concat_fuse_pass processing in both builds with cmake option -DWITH_MKL=ON/OFF

Code to reproduce the issue

Other info / logs

116: Test command: /home/leozhao/Paddle/build/paddle/fluid/inference/tests/api/test_analyzer_small_dam "--inference_model_dir=/home/leozha
o/Paddle/build/third_party/inference_demo/word2vec/word2vec.inference.model" "--infer_model=/home/leozhao/Paddle/build/third_party/inferen
ce_demo/small_dam/model" "--infer_data=/home/leozhao/Paddle/build/third_party/inference_demo/small_dam/data.txt" "--max_turn_num=1"
116: Environment variables:
116:  FLAGS_cudnn_deterministic=true
116: Test timeout computed to be: 600
116: [==========] Running 8 tests from 1 test case.
116: [----------] Global test environment set-up.
116: [----------] 8 tests from Analyzer_dam
116: [ RUN      ] Analyzer_dam.profile
116: WARNING: Logging before InitGoogleLogging() is written to STDERR
116: I0331 21:58:43.398690 27293 analyzer_dam_tester.cc:178] The number of samples to be test: 1
116: I0331 21:58:43.398905 27293 tester_helper.h:65] AnalysisConfig {
116:   NativeConfig {
116:     PaddlePredictor::Config {
116:       model_dir:
116:     }
116:     use_gpu: 0
116:     device: 0
116:     fraction_of_gpu_memory: 0
116:     specify_input_name: 1
116:     cpu_num_threads: 1
116:   }
116:   prog_file: /home/leozhao/Paddle/build/third_party/inference_demo/small_dam/model/__model__
116:   param_file: /home/leozhao/Paddle/build/third_party/inference_demo/small_dam/model/param
116:   enable_ir_optim: 1
116:   enable_ir_optim: 1
116:   use_feed_fetch_ops: 1
116:   use_tensorrt: 0
116:   use_mkldnn: 0
116: }
116: ^[[1m^[[35m--- Running analysis [ir_graph_build_pass]^[[0m
116: ^[[1m^[[35m--- Running analysis [ir_analysis_pass]^[[0m
116: ^[[32m--- Running IR pass [infer_clean_graph_pass]^[[0m
116: ^[[32m--- Running IR pass [attention_lstm_fuse_pass]^[[0m
116: ^[[32m--- Running IR pass [seqpool_concat_fuse_pass]^[[0m
1/1 Test #116: test_analyzer_small_dam ..........***Exception: Other  1.25 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   1.28 sec
@LeoZhao-Habana
Copy link
Contributor Author

withlog.zip

@luotao1
Copy link
Contributor

luotao1 commented Apr 1, 2019

请用```或者`来标注代码块,issue格式略乱。

@tensor-tang
Copy link
Contributor

tensor-tang commented Apr 2, 2019

[32m--- Running IR pass [seqpool_concat_fuse_pass]
1/1 Test : test_analyzer_small_dam ..........***Exception: Other 1.25 sec

How to confirm failed inside seqpool_concat_fuse_pass? Do you have trackback?

@LeoZhao-Habana
Copy link
Contributor Author

can't confirm, just from log analysis, it doesn't enter next fuse pass processing.
I didn't have chance to back trace, the reproduce rate is much low.

@tensor-tang
Copy link
Contributor

I still did not reproduce it with both release and debug build.

Does it related with #16688 ?

@LeoZhao-Habana
Copy link
Contributor Author

quit hard to reproduce, and not sure if it is related with #16688

@LeoZhao-Habana
Copy link
Contributor Author

I think it is due to some system issues, I ran 3 processes to do same test simultaneously, and these tests hit same failure almost within closed 1 second.
sometimes failure in another place, like

118: [       OK ] Analyzer_dam.profile_mkldnn (1801 ms)
118: [ RUN      ] Analyzer_dam.fuse_statis
118: ^[[1m^[[35m--- Running analysis [ir_graph_build_pass]^[[0m
1/1 Test #118: test_analyzer_small_dam ..........***Exception: Other  4.01 sec

@tensor-tang
Copy link
Contributor

I ran 10000 times in both release and debug model with command ./paddle/fluid/inference/tests/api/test_analyzer_small_dam --infer_model=third_party/inference_demo/small_dam/model/ --infer_data=third_party/inference_demo/small_dam/data.txt --max_turn_num=1 --gtest_filter=Analyzer_dam.profile, all passed.

@LeoZhao-Habana
Copy link
Contributor Author

LeoZhao-Habana commented Apr 8, 2019

my cmd is "ctest -R test_analyzer_small_dam -V" in a batch run script, it needs a test suite sequence instead of just one case, otherwise some failures can't reproduce.

@tensor-tang
Copy link
Contributor

make test ARGS="-R test_analyzer_small_dam -V" is running to 2181 times, still pass.

it needs a test suite sequence

Is it possible related with some env exports?

@luotao1
Copy link
Contributor

luotao1 commented Apr 8, 2019

Since seqpool_concat_fuse_pass is not common, how about disable it in default? @tensor-tang

@tensor-tang
Copy link
Contributor

OK for me, we can only enable seqpool_concat_fuse_pass at the specific tests.

But the problem is still there as @LeoZhao-Intel said sometimes failure in another place

@LeoZhao-Habana
Copy link
Contributor Author

I suspect this is a kind of system issue, and I see it impacts all test processes in same time not just one, and reproduce rate is much much low, I suggest we can keep here to monitor later.

@tensor-tang
Copy link
Contributor

Attach more tests running all night, make test ARGS="-R test_analyzer_small_dam -V" passed 3977 times in release mode, 1155 times in debug mode.

@LeoZhao-Habana
Copy link
Contributor Author

what's seqpool_concat_fuse used for? From log, seems this pass takes much time on graph pattern detection.

@tensor-tang
Copy link
Contributor

/**
* Fuse SequencePool(with sum pooltype yet) and Concat;
*
* Before fuse:
* | | |
* seq_pool, seq_pool, ... seq_pool
* \ | ... /
* concat
* |
* After fuse:
* \ | /
* FusionSeqPoolConcat
* |
*/

Because pattern is very large and support at most 200 inputs.

@luotao1
Copy link
Contributor

luotao1 commented Apr 9, 2019

Thus, could you create a PR to disable seqpool_concat_fuse_pass for common, and to reduce the CI time? @tensor-tang

@tensor-tang
Copy link
Contributor

Yes, WIP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants