Fix Jenkins issue by atamazov · Pull Request #240 · ROCm/MIOpen

atamazov · 2020-05-25T19:32:27Z

Thix PR contains several changes. PLEASE DO NOT SQUASH COMMITS WHEN MERGING. Also I recommend reviewing each commit individually.

Fixes:
- adf4fdb Fixes file-based binary cache, https://github.com/AMDComputeLibraries/MLOpen/issues/2450
- 1ff1a09 Tidy fix that is required after merging of add igemm bwd v4r1 xdlops kernel #167 but missing in [NFC] Fix annoying build warnings and clang-tidy errors in ROCm 3.3 #212.
- 167a6da Permanent workaround for ConvOclBwdWrW53 (NaN on gfx908)
Workarounds:
- 0335a64 Switches to using the file-based binary cache by default.
  - This should resolve this [Jenkins] Segmentation Fault in test_conv_extra config #226 (comment)
- a019ef7 [test_sqlite_perfdb] Reduce amount of work.
- 8ea9512 Disabled ConvHipImplicitGemmV4R4GenXdlopsWrWFp32. Reworked special case in ConvHipImplicitGemmV4R4GenWrWXdlops::IsApplicable().
- 661308b [test_conv_for_implicit_gemm] Disable config that fails often.
- b4d6805 [test_soft_max] Workaround for cases that
  - often fail on Jenkins
  - always fail with ROCm3.3/Radeon VII
- ea710fe [test_rnn_vanilla] [test_rnn_extra] The "Forward Inference RNN vanilla" sub-test is disabled.
- 9899bdd [test_ctc] Disabled some tests that fail on Jenkins from time to time.
- c71697a [test_lstm] The "Forward Inference LSTM" sub-test is disabled.
- a983655 [test_conv_group] Disabled WrW configs that fail during "Full long tests"
Improvements:
- 441bb75 + d15e895 [Jenkins] Added possibility to customize env settings for "make check".
- a9a1a1e Tests reordered to better utilize parallelizm of ctest.
  https://github.com/ROCmSoftwarePlatform/MIOpen/blob/1ff1a09352d7d14f0ca26ecab31820cd4e36797e/test/CMakeLists.txt#L153-L163
- 0ae054c Disabled test_conv_for_implicit_gemm for INT8. Disabled many custom tests that were previously run with implied --float during non-FP32 Jenkins jobs. These tests are just wasting time when run during INT8 or HALF stages.
- d45bba4 [Jenkins] Permanently enabled "MIOPEN_LOG_LEVEL=5" in "FP32 gfx908 Hip Release All subset". Why: those who do not have gfx908 HW will be able to see failing Solvers at least.
- 0a78938 Full tests: No more than 3 parallel tests to reduce probability of failure of each individual stage and thus lower the occupancy of CI machines when the task is restarted after a failure.

Next steps

Create a ticket for each workaround introduced in this PR and/or for each /// \todo comment added to the source code.
Create ticket related to 0ae054c. I am not sure if the initial intent was to run these custom tests with FP16, INT8, BF16. These shall be reconsidered and properly enabled, if necessary.

…om/AMDComputeLibraries/MLOpen/issues/2450

…vars for "make check"

…allelizm.

This reverts commit ef1c192.

atamazov · 2020-05-25T20:19:19Z

I've missed to revert #228. These configs worked fine in my branch. Done now.

Let's keep previous job (#2) running on Jenkins, to collect statistics.

JehandadKhan · 2020-05-26T15:24:17Z



-def cmake_build(compiler, flags, prefixpath="/opt/rocm"){
+def cmake_build(compiler, flags, env4make, prefixpath){


Perhaps: makeenv or makeEnv

This stuff is intended to allow to modifying environment for make (e.g. MIOPEN_LOG_LEVEL=5) so "environment for make" -> env4make. To me, makeEnv sounds like "make environment", beginning with a noun.

Of course I can change to makeEnv, if you wish, just confirm please.

Just a suggestion, since we never use variables in that format. I trust your judgement

JehandadKhan · 2020-05-26T15:24:47Z

 set(MIOPEN_ENABLE_SQLITE On CACHE BOOL "")
 # Use SQLITE for compiled kernels, when turned off this will use raw files
-set(MIOPEN_ENABLE_SQLITE_KERN_CACHE On CACHE BOOL "")
+set(MIOPEN_ENABLE_SQLITE_KERN_CACHE Off CACHE BOOL "")


This would disable a feature which is a customer requirement, if there is a bug we should fix it rather than disabling the feature. It is not clear to me from the linked comment whether we know for sure if the kernel cache is buggy.

Yes, @JehandadKhan please create a branch and investigate. This issue is blocking our Jenkins CI. We need help on resolution. Disable all testing on your branch except the long tests and track this down.

Does the CI pass if we disable this ?

Yes. I started preparing this PR when my experimental branch passed 3 times out of 4 (only the "Full Long Tests" stage was used). Without this, I've often saw failures like shown in #226 (comment).

JehandadKhan · 2020-05-26T15:29:56Z

+# Please notice that each list is also sorted and it is highly recommended
+# to keep this sorting when adding new tests.
+
+set( LONG_TESTS


This would make it complicated to add new tests, which so far is automatic, ( ie. only requires adding a test to a directory). I suggest we just identify the long tests and add them to one list manually and add everything else to the short list.

From the other hand this allows disabling individual tests, which may be handy sometimes.

I suggest we just identify the long tests and add them to one list manually and add everything else to the short list.

I do not know how to do this with cmake.

You can add blacklist variables to CMakeLists.txt
Anyways we should confine this PR to fixing Jenkins not modifying the overall design of the tests.

When I was working on the experimental branch, I needed to go through the tests as quickly as possible, and I found such a solution. It saved me a couple of hours, I guess. This, of course, can be moved to a separate PR. Please note that if we revert this, then level of parallelism will be reduced (some tests will be performed alone at the end), and the likelihood of a failure will decrease.

From the other hand, if we do not squash this PR, then nothing prevents us from reverting this specific commit in two clicks and re-implementing it in whatever way we prefer.

JehandadKhan · 2020-05-26T15:31:35Z

 unsigned int DBMultiThreadedTestWork::threads_count    = 16;
-unsigned int DBMultiThreadedTestWork::common_part_size = 32;
-unsigned int DBMultiThreadedTestWork::unique_part_size = 32;
+unsigned int DBMultiThreadedTestWork::common_part_size = 16;


Reducing the parallelism can potentially mask errors which only show up in under stress.

There is clearly something going on. Please help investigate.

There is clearly something going on. Please help investigate.

This was done to reduce work, not because there is an error. Please see description of the PR

We have problems with perfdb tests (both file-based and sqlite-based). I was working on it when comgr arrived so I had to postpone it. Reducing load helps to reduce propability of test failure.

How does the perf db error manifest ?

terminate called after throwing an instance of 'miopen::Exception' what(): /var/jenkins/workspace/en_wrw-igemm-v4r4xdlops-fp32-fix/src/sqlite_db.cpp:114: Internal error while accessing SQLite database: UNIQUE constraint failed: perf_db.solver, perf_db.config, perf_db.arch, perf_db.num_cu

See http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/wrw-igemm-v4r4xdlops-fp32-fix/14/pipeline/13/ for an example.

#227 fixes this.

Great! Just revert 9154e70 in your branch after rebasing to develop, when/if this PR will be merged in, and everything will be restored.

I agree, lets get things in shape first, that PR can wait for now.

@pfultz2

Since these tests use a lot of threads we should set the RUN_SERIAL property in cmake for the tests instead of lowering the number of threads.

Good to know, thanks, we'll make this change. The problem is that the test itself is unstable -- e.g. may fail even if run alone, depending on the compiler (clang or GCC), build options (-O) etc. Moreover I believe it shouldn't fail even when system is under high load, but this is not so(((

Requested changes not important for the purpose at hand.

daniellowell · 2020-05-26T18:46:25Z

I don't think you need all these modifications to the Jenkinsfile this can probably be done with test/CMakeLists.txt. What design are you trying to achieve?

daniellowell · 2020-05-26T18:47:53Z

If we are just fixing the reliably failing Jenkins, don't we just want to start here?

Switches to using the file-based binary cache by default.
This should resolve this #226 (comment)

atamazov · 2020-05-26T18:54:58Z

I don't think you need all these modifications to the Jenkinsfile this can probably be done with test/CMakeLists.txt. What design are you trying to achieve?

If you mean 441bb75, then this is a side product of my investigations. This allows to easily customize CTEST_PARALLEL_LEVEL or MIOPEN_LOG_LEVEL or disable algorithms etc. for each testing step individually.

The design is cumbersome, but to me, it is definitely better than nothing. Of course I can revert it, if we do not need this feature. Just confirm.

daniellowell · 2020-05-26T18:56:01Z

Proceed. Let's see where this goes.

atamazov · 2020-05-26T18:58:16Z

@daniellowell

If we are just fixing the reliably failing Jenkins, don't we just want to start here?

I have already gone the way from that starting point.

atamazov · 2020-05-26T19:06:25Z

Switches to using the file-based binary cache by default.
This should resolve this #226 (comment)

Please note that this resolves only one comment from #226, but not the entire ticket. There is some other stuff that affects reliability. One you already know -- unreliable perfdb tests, and reducing load helps. It is quite possible that more will come... stuff is under testing now.

…ests"/ "Hip Clang Release All"

…p32. Rework ConvHipImplicitGemmV4R4GenWrWXdlops::IsApplicable().

… that fails often.

…il often on Jenkins and for stable ROCm3.3/Radeon VII failures.

…dWrW53 (NaN on gfx908)

atamazov · 2020-05-28T20:50:41Z

Added another three W/As. PR description updated.

atamazov · 2020-05-28T20:53:50Z

@daniellowell

Hmm the error and diff look fine on the rnn vanilla test.

I guess that the test does more than one verification on each pass. The test prints error/diff only for the first verification (test harness limitation?)

…8. Disabled many custom tests that were previously run with implied "--float" during non-FP32 Jenkins jobs, thus wasting time.

This reverts commit 9154e70. RESOLVED Conflicts: test/CMakeLists.txt

atamazov · 2020-05-29T15:07:13Z

Description updated. Please notice bullet that describes 0ae054c.

…fail during "Full long tests"

…VEL=5" in "FP32 gfx908 Hip Release All subset"

atamazov · 2020-05-30T11:33:53Z

Two more commits. Description updated.

…l tests to reduce probability of failure of each individual stage and thus lower the occupancy of CI machines when the task is restarted after a failure.

atamazov · 2020-05-30T23:36:15Z

As of d45bba4, the tests completed successfully. We can merge it to unblock our CI. However, I recommend waiting a little longer and merging the next commit.

Right now "Full Long Tests" consists of 7 jobs. The probability of a random failure seems pretty high. And each re-run occupies 7 machines.

The next commit, 0a78938, rearranges "Full tests" so that each stage contains no more than 3 jobs running in parallel. This is expected to reduce probability of failure of each individual stage. The cost is extra ~2 hours in the pipeline (~6.5 hours instead of purely theoretical ~4.5), but the total load on CI machines should be much less (provided that our CI will continue suffering from random failures).

atamazov · 2020-05-31T07:03:43Z

0a78938 -- all tests passed at 1st attempt. Running the 2nd one...

atamazov · 2020-05-31T20:49:09Z

2nd attempt passed, but with one restart. Running the 3rd...

atamazov · 2020-05-31T20:56:37Z

And two more attempts, in parallel with the 3rd, ETA is by morning.

atamazov · 2020-05-31T21:05:36Z

I would like to know the stats for the baseline. This would allow us to evaluate if a PR is more or less stable than develop, and, thus, decide, is it good to merge or it needs more testing/fixing. For example, is it Ok to merge a PR that has been tested once (and, of course, successfully), but with two restarts? or do we need to run the tests again?

(I am assuming that our CI testing is affected by some random factor.)

atamazov · 2020-06-01T12:00:48Z

Final stats:

9 runs total, run #30 is not counted (is was not actually failed, but terminated, reason unknown). So we have 8 actual runs, 5 of them succeeded, 3 failed. This could serve us as a baseline for acceptable failure rate:

NumOfFailedRuns / NumOfActualRuns <= 3/8 (0.375)

atamazov · 2020-06-01T12:01:01Z

Ready to merge.

atamazov · 2020-06-01T12:20:32Z

Acceptable CI Testing Failure Rate (proposal)

Base acceptance criterion:

NumOfFailedRuns / NumOfActualRuns <= 3/8 (0.375)

All the attempts shall be started from the same commit. The runs that that didn't passed due to termination, guthib access problems etc -- not counted.
The PR that does not satisfy the condition is considering as introducing instability. It shall be investigated and fixed.
If a developer is quite sure that there are no bugs, then they are allowed to made additional testing attempts shall be made (it's Ok to skip static checks) until the condition is met.
PRs that do not meet the basic criteria should not be merged into develop under any conditions.

Examples:

The 1st attempt passed (no restarts): Ok to merge.
The 1st attempt passed with one restart, the 2nd one passed: 1/3, Ok.
Both 1st and 2nd attempts passed, each with one restart: 2/4, NOT Ok.
Same as previous, plus one more (3rd) attempt passed: 2/5 (0.4), NOT Ok.
Same as previous, plus 4th attempt passed: 2/6 (0.333), Ok.
The 1st attempt passed with two restarts, 2nd and 3rd passed: 2/5, NOT Ok.

atamazov added 6 commits May 25, 2020 19:32

fix-jenkins-failures(01) Fix file-based binary cache https://github.c…

adf4fdb

…om/AMDComputeLibraries/MLOpen/issues/2450

fix-jenkins-failures(02) Switch to file-based binary cache by default.

0335a64

fix-jenkins-failures(03) [test_sqlite_perfdb] Reduce amount of work.

a019ef7

fix-jenkins-failures(04) [Jenkins] Added possibility to customize env…

441bb75

…vars for "make check"

fix-jenkins-failures(05) [tests] Test reordered to better utilize par…

a9a1a1e

…allelizm.

fix-jenkins-failures(06) Tidy fix.

1ff1a09

atamazov requested review from JehandadKhan, daniellowell and pfultz2 May 25, 2020 19:32

atamazov added the value_high label May 25, 2020

fix-jenkins-failures(07) Revert "Disabled 2 configs. (#228)"

9154e70

This reverts commit ef1c192.

JehandadKhan previously requested changes May 26, 2020

View reviewed changes

This comment has been minimized.

Sign in to view

atamazov mentioned this pull request May 26, 2020

Use RUN_SERIAL property in cmake for the tests that use a lot threads. #242

Closed

atamazov added 4 commits May 28, 2020 00:44

fix-jenkins-failures(08) Fix bug Jekinsfile that affects "Full Long T…

d15e895

…ests"/ "Hip Clang Release All"

fix-jenkins-failures(09) Disable ConvHipImplicitGemmV4R4GenXdlopsWrWF…

8ea9512

…p32. Rework ConvHipImplicitGemmV4R4GenWrWXdlops::IsApplicable().

fix-jenkins-failures(10) [test_conv_for_implicit_gemm] Disable config…

661308b

… that fails often.

fix-jenkins-failures(11) [test_soft_max] Workaround for cases that fa…

b4d6805

…il often on Jenkins and for stable ROCm3.3/Radeon VII failures.

This comment has been minimized.

Sign in to view

atamazov requested a review from ce1adon May 27, 2020 22:03

fix-jenkins-failures(15) [library] Permanent workaround for ConvOclBw…

167a6da

…dWrW53 (NaN on gfx908)

atamazov added 2 commits May 29, 2020 00:14

fix-jenkins-failures(16) Tidy fix.

f027a64

fix-jenkins-failures(17) Disabled test_conv_for_implicit_gemm for INT…

0ae054c

…8. Disabled many custom tests that were previously run with implied "--float" during non-FP32 Jenkins jobs, thus wasting time.

aserio mentioned this pull request May 29, 2020

[Jenkins] Segmentation Fault in test_conv_extra config #226

Closed

fix-jenkins-failures(18) Reapply "Disabled 2 configs. (#228)"

28cd1ea

This reverts commit 9154e70. RESOLVED Conflicts: test/CMakeLists.txt

atamazov mentioned this pull request May 29, 2020

Workspace needed for sub-sample #218

Closed

atamazov added 2 commits May 30, 2020 14:24

fix-jenkins-failures(19) [test_conv_group] Disabled WrW configs that …

a983655

…fail during "Full long tests"

fix-jenkins-failures(20) [Jenkins] Permanently enabled "MIOPEN_LOG_LE…

d45bba4

…VEL=5" in "FP32 gfx908 Hip Release All subset"

fix-jenkins-failures(21) [Jenkins] Full tests: No more than 3 paralle…

0a78938

…l tests to reduce probability of failure of each individual stage and thus lower the occupancy of CI machines when the task is restarted after a failure.

daniellowell approved these changes Jun 1, 2020

View reviewed changes

daniellowell merged commit 4d82698 into develop Jun 1, 2020

This was referenced Jun 8, 2020

lstm unit test segfault on CI #268

Closed

[Test][CI][gfx1030][test_soft_max] Investigate & fix soft_max failures and instabilities #285

Closed

atamazov mentioned this pull request Jun 16, 2020

Re-enable SQLite binary cache #297

Merged

atamazov deleted the fix-jenkins-failures branch June 30, 2020 17:51

junliume added a commit that referenced this pull request Sep 22, 2021

fix #1167: revert WA from #240

d2e7e72

atamazov mentioned this pull request Sep 24, 2021

Properly fix W/A from #240, commit ea710fe - The "Forward Inference RNN vanilla" sub-test fails too often #1177

Closed



		def cmake_build(compiler, flags, prefixpath="/opt/rocm"){
		def cmake_build(compiler, flags, env4make, prefixpath){

Conversation

atamazov commented May 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Next steps

Uh oh!

atamazov commented May 25, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atamazov May 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atamazov May 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

daniellowell commented May 26, 2020

Uh oh!

daniellowell commented May 26, 2020

Uh oh!

atamazov commented May 26, 2020

Uh oh!

daniellowell commented May 26, 2020

Uh oh!

atamazov commented May 26, 2020

Uh oh!

atamazov commented May 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

atamazov commented May 28, 2020

Uh oh!

atamazov commented May 28, 2020

Uh oh!

atamazov commented May 29, 2020

Uh oh!

atamazov commented May 30, 2020

Uh oh!

atamazov commented May 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

atamazov commented May 25, 2020 •

edited

Loading

atamazov May 26, 2020 •

edited

Loading

atamazov May 26, 2020 •

edited

Loading

atamazov commented May 26, 2020 •

edited

Loading

atamazov commented May 30, 2020 •

edited

Loading

atamazov commented May 31, 2020 •

edited

Loading

atamazov commented May 31, 2020 •

edited

Loading

atamazov commented May 31, 2020 •

edited

Loading

atamazov commented May 31, 2020 •

edited

Loading