Skip to content

Fix Jenkins issue#240

Merged
daniellowell merged 21 commits intodevelopfrom
fix-jenkins-failures
Jun 1, 2020
Merged

Fix Jenkins issue#240
daniellowell merged 21 commits intodevelopfrom
fix-jenkins-failures

Conversation

@atamazov
Copy link
Copy Markdown
Contributor

@atamazov atamazov commented May 25, 2020

Thix PR contains several changes. PLEASE DO NOT SQUASH COMMITS WHEN MERGING. Also I recommend reviewing each commit individually.

Next steps

  • Create a ticket for each workaround introduced in this PR and/or for each /// \todo comment added to the source code.
  • Create ticket related to 0ae054c. I am not sure if the initial intent was to run these custom tests with FP16, INT8, BF16. These shall be reconsidered and properly enabled, if necessary.

@atamazov
Copy link
Copy Markdown
Contributor Author

I've missed to revert #228. These configs worked fine in my branch. Done now.

Let's keep previous job (#2) running on Jenkins, to collect statistics.

Comment thread Jenkinsfile


def cmake_build(compiler, flags, prefixpath="/opt/rocm"){
def cmake_build(compiler, flags, env4make, prefixpath){
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps: makeenv or makeEnv

Copy link
Copy Markdown
Contributor Author

@atamazov atamazov May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stuff is intended to allow to modifying environment for make (e.g. MIOPEN_LOG_LEVEL=5) so "environment for make" -> env4make. To me, makeEnv sounds like "make environment", beginning with a noun.

Of course I can change to makeEnv, if you wish, just confirm please.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion, since we never use variables in that format. I trust your judgement

Comment thread CMakeLists.txt
set(MIOPEN_ENABLE_SQLITE On CACHE BOOL "")
# Use SQLITE for compiled kernels, when turned off this will use raw files
set(MIOPEN_ENABLE_SQLITE_KERN_CACHE On CACHE BOOL "")
set(MIOPEN_ENABLE_SQLITE_KERN_CACHE Off CACHE BOOL "")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would disable a feature which is a customer requirement, if there is a bug we should fix it rather than disabling the feature. It is not clear to me from the linked comment whether we know for sure if the kernel cache is buggy.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @JehandadKhan please create a branch and investigate. This issue is blocking our Jenkins CI. We need help on resolution. Disable all testing on your branch except the long tests and track this down.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the CI pass if we disable this ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I started preparing this PR when my experimental branch passed 3 times out of 4 (only the "Full Long Tests" stage was used). Without this, I've often saw failures like shown in #226 (comment).

Comment thread test/CMakeLists.txt
# Please notice that each list is also sorted and it is highly recommended
# to keep this sorting when adding new tests.

set( LONG_TESTS
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make it complicated to add new tests, which so far is automatic, ( ie. only requires adding a test to a directory). I suggest we just identify the long tests and add them to one list manually and add everything else to the short list.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the other hand this allows disabling individual tests, which may be handy sometimes.

I suggest we just identify the long tests and add them to one list manually and add everything else to the short list.

I do not know how to do this with cmake.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add blacklist variables to CMakeLists.txt
Anyways we should confine this PR to fixing Jenkins not modifying the overall design of the tests.

Copy link
Copy Markdown
Contributor Author

@atamazov atamazov May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was working on the experimental branch, I needed to go through the tests as quickly as possible, and I found such a solution. It saved me a couple of hours, I guess. This, of course, can be moved to a separate PR. Please note that if we revert this, then level of parallelism will be reduced (some tests will be performed alone at the end), and the likelihood of a failure will decrease.

From the other hand, if we do not squash this PR, then nothing prevents us from reverting this specific commit in two clicks and re-implementing it in whatever way we prefer.

Comment thread test/sqlite_perfdb.cpp
unsigned int DBMultiThreadedTestWork::threads_count = 16;
unsigned int DBMultiThreadedTestWork::common_part_size = 32;
unsigned int DBMultiThreadedTestWork::unique_part_size = 32;
unsigned int DBMultiThreadedTestWork::common_part_size = 16;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducing the parallelism can potentially mask errors which only show up in under stress.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is clearly something going on. Please help investigate.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is clearly something going on. Please help investigate.

This was done to reduce work, not because there is an error. Please see description of the PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have problems with perfdb tests (both file-based and sqlite-based). I was working on it when comgr arrived so I had to postpone it. Reducing load helps to reduce propability of test failure.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the perf db error manifest ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

terminate called after throwing an instance of 'miopen::Exception'
  what():  /var/jenkins/workspace/en_wrw-igemm-v4r4xdlops-fp32-fix/src/sqlite_db.cpp:114: Internal error while accessing SQLite database: UNIQUE constraint failed: perf_db.solver, perf_db.config, perf_db.arch, perf_db.num_cu

See http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/wrw-igemm-v4r4xdlops-fp32-fix/14/pipeline/13/ for an example.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#227 fixes this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Just revert 9154e70 in your branch after rebasing to develop, when/if this PR will be merged in, and everything will be restored.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, lets get things in shape first, that PR can wait for now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pfultz2

Since these tests use a lot of threads we should set the RUN_SERIAL property in cmake for the tests instead of lowering the number of threads.

Good to know, thanks, we'll make this change. The problem is that the test itself is unstable -- e.g. may fail even if run alone, depending on the compiler (clang or GCC), build options (-O) etc. Moreover I believe it shouldn't fail even when system is under high load, but this is not so(((

@JehandadKhan JehandadKhan dismissed their stale review May 26, 2020 18:34

Requested changes not important for the purpose at hand.

@atamazov

This comment has been minimized.

@JehandadKhan

This comment has been minimized.

@atamazov

This comment has been minimized.

@daniellowell
Copy link
Copy Markdown

I don't think you need all these modifications to the Jenkinsfile this can probably be done with test/CMakeLists.txt. What design are you trying to achieve?

@daniellowell
Copy link
Copy Markdown

If we are just fixing the reliably failing Jenkins, don't we just want to start here?

Switches to using the file-based binary cache by default.
This should resolve this #226 (comment)

@atamazov
Copy link
Copy Markdown
Contributor Author

I don't think you need all these modifications to the Jenkinsfile this can probably be done with test/CMakeLists.txt. What design are you trying to achieve?

If you mean 441bb75, then this is a side product of my investigations. This allows to easily customize CTEST_PARALLEL_LEVEL or MIOPEN_LOG_LEVEL or disable algorithms etc. for each testing step individually.

The design is cumbersome, but to me, it is definitely better than nothing. Of course I can revert it, if we do not need this feature. Just confirm.

@daniellowell
Copy link
Copy Markdown

Proceed. Let's see where this goes.

@atamazov
Copy link
Copy Markdown
Contributor Author

@daniellowell

If we are just fixing the reliably failing Jenkins, don't we just want to start here?

I have already gone the way from that starting point.

@atamazov
Copy link
Copy Markdown
Contributor Author

atamazov commented May 26, 2020

Switches to using the file-based binary cache by default.
This should resolve this #226 (comment)

Please note that this resolves only one comment from #226, but not the entire ticket. There is some other stuff that affects reliability. One you already know -- unreliable perfdb tests, and reducing load helps. It is quite possible that more will come... stuff is under testing now.

@atamazov

This comment has been minimized.

@atamazov atamazov requested a review from ce1adon May 27, 2020 22:03
@atamazov
Copy link
Copy Markdown
Contributor Author

Added another three W/As. PR description updated.

@atamazov
Copy link
Copy Markdown
Contributor Author

@daniellowell

Hmm the error and diff look fine on the rnn vanilla test.

I guess that the test does more than one verification on each pass. The test prints error/diff only for the first verification (test harness limitation?)

atamazov added 2 commits May 29, 2020 00:14
…8. Disabled many custom tests that were previously run with implied "--float" during non-FP32 Jenkins jobs, thus wasting time.
This reverts commit 9154e70.

RESOLVED Conflicts:
	test/CMakeLists.txt
@atamazov
Copy link
Copy Markdown
Contributor Author

Description updated. Please notice bullet that describes 0ae054c.

@atamazov
Copy link
Copy Markdown
Contributor Author

Two more commits. Description updated.

…l tests to reduce probability of failure of each individual stage and thus lower the occupancy of CI machines when the task is restarted after a failure.
@atamazov
Copy link
Copy Markdown
Contributor Author

atamazov commented May 30, 2020

As of d45bba4, the tests completed successfully. We can merge it to unblock our CI. However, I recommend waiting a little longer and merging the next commit.

Right now "Full Long Tests" consists of 7 jobs. The probability of a random failure seems pretty high. And each re-run occupies 7 machines.

The next commit, 0a78938, rearranges "Full tests" so that each stage contains no more than 3 jobs running in parallel. This is expected to reduce probability of failure of each individual stage. The cost is extra ~2 hours in the pipeline (~6.5 hours instead of purely theoretical ~4.5), but the total load on CI machines should be much less (provided that our CI will continue suffering from random failures).

@atamazov
Copy link
Copy Markdown
Contributor Author

atamazov commented May 31, 2020

0a78938 -- all tests passed at 1st attempt. Running the 2nd one...

@atamazov
Copy link
Copy Markdown
Contributor Author

atamazov commented May 31, 2020

2nd attempt passed, but with one restart. Running the 3rd...

@atamazov
Copy link
Copy Markdown
Contributor Author

atamazov commented May 31, 2020

And two more attempts, in parallel with the 3rd, ETA is by morning.

@atamazov
Copy link
Copy Markdown
Contributor Author

atamazov commented May 31, 2020

I would like to know the stats for the baseline. This would allow us to evaluate if a PR is more or less stable than develop, and, thus, decide, is it good to merge or it needs more testing/fixing. For example, is it Ok to merge a PR that has been tested once (and, of course, successfully), but with two restarts? or do we need to run the tests again?

(I am assuming that our CI testing is affected by some random factor.)

@atamazov
Copy link
Copy Markdown
Contributor Author

atamazov commented Jun 1, 2020

Final stats:
image

9 runs total, run #30 is not counted (is was not actually failed, but terminated, reason unknown). So we have 8 actual runs, 5 of them succeeded, 3 failed. This could serve us as a baseline for acceptable failure rate:

NumOfFailedRuns / NumOfActualRuns <= 3/8 (0.375)

@atamazov
Copy link
Copy Markdown
Contributor Author

atamazov commented Jun 1, 2020

Ready to merge.

@atamazov
Copy link
Copy Markdown
Contributor Author

atamazov commented Jun 1, 2020

Acceptable CI Testing Failure Rate (proposal)

Base acceptance criterion:

NumOfFailedRuns / NumOfActualRuns <= 3/8 (0.375)

  • All the attempts shall be started from the same commit. The runs that that didn't passed due to termination, guthib access problems etc -- not counted.
  • The PR that does not satisfy the condition is considering as introducing instability. It shall be investigated and fixed.
  • If a developer is quite sure that there are no bugs, then they are allowed to made additional testing attempts shall be made (it's Ok to skip static checks) until the condition is met.
  • PRs that do not meet the basic criteria should not be merged into develop under any conditions.

Examples:

  • The 1st attempt passed (no restarts): Ok to merge.
  • The 1st attempt passed with one restart, the 2nd one passed: 1/3, Ok.
  • Both 1st and 2nd attempts passed, each with one restart: 2/4, NOT Ok.
  • Same as previous, plus one more (3rd) attempt passed: 2/5 (0.4), NOT Ok.
  • Same as previous, plus 4th attempt passed: 2/6 (0.333), Ok.
  • The 1st attempt passed with two restarts, 2nd and 3rd passed: 2/5, NOT Ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants