Conversation
…vars for "make check"
This reverts commit ef1c192.
|
I've missed to revert #228. These configs worked fine in my branch. Done now. Let's keep previous job ( |
|
|
||
|
|
||
| def cmake_build(compiler, flags, prefixpath="/opt/rocm"){ | ||
| def cmake_build(compiler, flags, env4make, prefixpath){ |
There was a problem hiding this comment.
Perhaps: makeenv or makeEnv
There was a problem hiding this comment.
This stuff is intended to allow to modifying environment for make (e.g. MIOPEN_LOG_LEVEL=5) so "environment for make" -> env4make. To me, makeEnv sounds like "make environment", beginning with a noun.
Of course I can change to makeEnv, if you wish, just confirm please.
There was a problem hiding this comment.
Just a suggestion, since we never use variables in that format. I trust your judgement
| set(MIOPEN_ENABLE_SQLITE On CACHE BOOL "") | ||
| # Use SQLITE for compiled kernels, when turned off this will use raw files | ||
| set(MIOPEN_ENABLE_SQLITE_KERN_CACHE On CACHE BOOL "") | ||
| set(MIOPEN_ENABLE_SQLITE_KERN_CACHE Off CACHE BOOL "") |
There was a problem hiding this comment.
This would disable a feature which is a customer requirement, if there is a bug we should fix it rather than disabling the feature. It is not clear to me from the linked comment whether we know for sure if the kernel cache is buggy.
There was a problem hiding this comment.
Yes, @JehandadKhan please create a branch and investigate. This issue is blocking our Jenkins CI. We need help on resolution. Disable all testing on your branch except the long tests and track this down.
There was a problem hiding this comment.
Does the CI pass if we disable this ?
There was a problem hiding this comment.
Yes. I started preparing this PR when my experimental branch passed 3 times out of 4 (only the "Full Long Tests" stage was used). Without this, I've often saw failures like shown in #226 (comment).
| # Please notice that each list is also sorted and it is highly recommended | ||
| # to keep this sorting when adding new tests. | ||
|
|
||
| set( LONG_TESTS |
There was a problem hiding this comment.
This would make it complicated to add new tests, which so far is automatic, ( ie. only requires adding a test to a directory). I suggest we just identify the long tests and add them to one list manually and add everything else to the short list.
There was a problem hiding this comment.
From the other hand this allows disabling individual tests, which may be handy sometimes.
I suggest we just identify the long tests and add them to one list manually and add everything else to the short list.
I do not know how to do this with cmake.
There was a problem hiding this comment.
You can add blacklist variables to CMakeLists.txt
Anyways we should confine this PR to fixing Jenkins not modifying the overall design of the tests.
There was a problem hiding this comment.
When I was working on the experimental branch, I needed to go through the tests as quickly as possible, and I found such a solution. It saved me a couple of hours, I guess. This, of course, can be moved to a separate PR. Please note that if we revert this, then level of parallelism will be reduced (some tests will be performed alone at the end), and the likelihood of a failure will decrease.
From the other hand, if we do not squash this PR, then nothing prevents us from reverting this specific commit in two clicks and re-implementing it in whatever way we prefer.
| unsigned int DBMultiThreadedTestWork::threads_count = 16; | ||
| unsigned int DBMultiThreadedTestWork::common_part_size = 32; | ||
| unsigned int DBMultiThreadedTestWork::unique_part_size = 32; | ||
| unsigned int DBMultiThreadedTestWork::common_part_size = 16; |
There was a problem hiding this comment.
Reducing the parallelism can potentially mask errors which only show up in under stress.
There was a problem hiding this comment.
There is clearly something going on. Please help investigate.
There was a problem hiding this comment.
There is clearly something going on. Please help investigate.
This was done to reduce work, not because there is an error. Please see description of the PR
There was a problem hiding this comment.
We have problems with perfdb tests (both file-based and sqlite-based). I was working on it when comgr arrived so I had to postpone it. Reducing load helps to reduce propability of test failure.
There was a problem hiding this comment.
How does the perf db error manifest ?
There was a problem hiding this comment.
terminate called after throwing an instance of 'miopen::Exception'
what(): /var/jenkins/workspace/en_wrw-igemm-v4r4xdlops-fp32-fix/src/sqlite_db.cpp:114: Internal error while accessing SQLite database: UNIQUE constraint failed: perf_db.solver, perf_db.config, perf_db.arch, perf_db.num_cu
See http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/wrw-igemm-v4r4xdlops-fp32-fix/14/pipeline/13/ for an example.
There was a problem hiding this comment.
Great! Just revert 9154e70 in your branch after rebasing to develop, when/if this PR will be merged in, and everything will be restored.
There was a problem hiding this comment.
I agree, lets get things in shape first, that PR can wait for now.
There was a problem hiding this comment.
Since these tests use a lot of threads we should set the
RUN_SERIALproperty in cmake for the tests instead of lowering the number of threads.
Good to know, thanks, we'll make this change. The problem is that the test itself is unstable -- e.g. may fail even if run alone, depending on the compiler (clang or GCC), build options (-O) etc. Moreover I believe it shouldn't fail even when system is under high load, but this is not so(((
Requested changes not important for the purpose at hand.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
I don't think you need all these modifications to the Jenkinsfile this can probably be done with test/CMakeLists.txt. What design are you trying to achieve? |
|
If we are just fixing the reliably failing Jenkins, don't we just want to start here?
|
If you mean 441bb75, then this is a side product of my investigations. This allows to easily customize The design is cumbersome, but to me, it is definitely better than nothing. Of course I can revert it, if we do not need this feature. Just confirm. |
|
Proceed. Let's see where this goes. |
I have already gone the way from that starting point. |
Please note that this resolves only one comment from #226, but not the entire ticket. There is some other stuff that affects reliability. One you already know -- unreliable perfdb tests, and reducing load helps. It is quite possible that more will come... stuff is under testing now. |
…ests"/ "Hip Clang Release All"
…p32. Rework ConvHipImplicitGemmV4R4GenWrWXdlops::IsApplicable().
… that fails often.
…il often on Jenkins and for stable ROCm3.3/Radeon VII failures.
This comment has been minimized.
This comment has been minimized.
…dWrW53 (NaN on gfx908)
|
Added another three W/As. PR description updated. |
I guess that the test does more than one verification on each pass. The test prints error/diff only for the first verification (test harness limitation?) |
…8. Disabled many custom tests that were previously run with implied "--float" during non-FP32 Jenkins jobs, thus wasting time.
This reverts commit 9154e70. RESOLVED Conflicts: test/CMakeLists.txt
|
Description updated. Please notice bullet that describes 0ae054c. |
…fail during "Full long tests"
…VEL=5" in "FP32 gfx908 Hip Release All subset"
|
Two more commits. Description updated. |
…l tests to reduce probability of failure of each individual stage and thus lower the occupancy of CI machines when the task is restarted after a failure.
|
As of d45bba4, the tests completed successfully. We can merge it to unblock our CI. However, I recommend waiting a little longer and merging the next commit. Right now "Full Long Tests" consists of 7 jobs. The probability of a random failure seems pretty high. And each re-run occupies 7 machines. The next commit, 0a78938, rearranges "Full tests" so that each stage contains no more than 3 jobs running in parallel. This is expected to reduce probability of failure of each individual stage. The cost is extra ~2 hours in the pipeline (~6.5 hours instead of purely theoretical ~4.5), but the total load on CI machines should be much less (provided that our CI will continue suffering from random failures). |
|
0a78938 -- all tests passed at 1st attempt. Running the 2nd one... |
|
2nd attempt passed, but with one restart. Running the 3rd... |
|
And two more attempts, in parallel with the 3rd, ETA is by morning. |
|
I would like to know the stats for the baseline. This would allow us to evaluate if a PR is more or less stable than develop, and, thus, decide, is it good to merge or it needs more testing/fixing. For example, is it Ok to merge a PR that has been tested once (and, of course, successfully), but with two restarts? or do we need to run the tests again? (I am assuming that our CI testing is affected by some random factor.) |
|
Ready to merge. |
Acceptable CI Testing Failure Rate (proposal)Base acceptance criterion:
Examples:
|

Thix PR contains several changes. PLEASE DO NOT SQUASH COMMITS WHEN MERGING. Also I recommend reviewing each commit individually.
ConvHipImplicitGemmV4R4GenXdlopsWrWFp32. Reworked special case inConvHipImplicitGemmV4R4GenWrWXdlops::IsApplicable().ctest.https://github.com/ROCmSoftwarePlatform/MIOpen/blob/1ff1a09352d7d14f0ca26ecab31820cd4e36797e/test/CMakeLists.txt#L153-L163
test_conv_for_implicit_gemmfor INT8. Disabled many custom tests that were previously run with implied--floatduring non-FP32 Jenkins jobs. These tests are just wasting time when run during INT8 or HALF stages.Next steps
/// \todocomment added to the source code.