Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigating CI Instability #1147

Open
4 of 11 tasks
atamazov opened this issue Sep 9, 2021 · 15 comments
Open
4 of 11 tasks

Investigating CI Instability #1147

atamazov opened this issue Sep 9, 2021 · 15 comments

Comments

@atamazov

This comment has been minimized.

@atamazov
Copy link
Contributor Author

atamazov commented Sep 9, 2021

Statistics of several CI runs launched in a row

Runs 287-289 are of the `develop` branch, others are of `ci-instability-investigation`
CI# Re-run of Result Failing stage Node Failing test Error message Reboot after
287 c636bf2 Fp32 Hip Debug COMGR ixt-rack-55 test_tensor_trans MAF * ?
288 287 🔴 Fp16 Hip MLIR gfx908 v340i-2 * (build) RCF * F
289 288 🟢          
1 c636bf2 🔴 Fp16 Hip All Install gfx908 v340i-1 (build) RCF F
2 1 Terminated          
3 8b2f260 🟢          
4 8b2f260 🟢
5 8b2f260 🟢
6 8b2f260 🟢
7 8b2f260 🟢
⚠️ ⚠️ ⚠️ JENKINS UPGRADE ⚠️ ⚠️ ⚠️ ⚠️
8 8b2f260 🔴     (build) download failure (boost)  
9 8b2f260 🟢
10 8b2f260 🟢
11 8b2f260 🟢
  • Abbreviations:
    • v340i-2 = rocm-frameworks-v340i-2.amd.com
    • MAF = Memory access fault
    • RCF = Cannot contact... hudson.remoting.ChannelClosingException... Remote call failed
    • ? = Unknown

develop branch

CI# commit Res Failing stage GPU BE Node Failing test Error message
311 068bb6b1 🟢 -     - - -
312 502df2b 🔴 Fp32 Debug /opt/rocm MI100 Hip trex-vg-20 (build) Fetch
313 R 312 🔴 Fp32 Install All Vega20 OCL prj47-rack-91 test_soft_max FAILED: 0.0442317
314 R 313 🔴 Bf16 nstall All gfx90a Hip dell113-r01-u31-32 (build) GPU detection
315 R 314 - -   - - Aborted by user
316 R 313 🟢 - -   - - -
317 2509c87 🔴 Fp16 /opt/rocm gfx90a Hip dell113-r01-u05-06 (build) GPU detection
318 1a52c62 Fp32 /opt/rocm Vega10 Hip prj47-rack-19 test_? Timeout 5 hours
319 3b1b3ac 🔴 Fp32 Debug /opt/rocm gfx90a Hip dell113-r01-u31-32 (build) GPU detection
320 2d7b2dd 🔴 Fp32 All Vega20 /64 Hip ixt-rack-54 test_conv_group FRDI *
321.. ..322 🟢 - -   - - -
323 030a369 Fp32 MLIR Vega10 Hip prj47-rack-19 test_? Timeout 5 hours
324 R 323 🟢 - -   - - -
325 1a1df9f Fp32 Debug Vega10 Hip prj47-rack-09 clinfo Timeout 5 minutes
326 R 325 🔴 Fp32 All Vega20 /64 Hip ixt-rack-54 test_conv_for_implicit_gemm FRDI *
327 R 326 🔴 Fp32 All Vega20 /64 Hip ixt-rack-54 test_conv_group FRDI *
328.. ..329 🟢 - -   - - -
330 bc464d5 🔴 Fp32 All Vega20 /64 Hip ixt-rack-54 test_conv3d FRDI
331.. ..334 🟢 - -   - - -
335 f21cdc1 🔴 Bf16 Install gfx908 Hip pytorch-vg20-1 (REBOOT SLAVES!) connect timed out
336 886fc21 🔴 Fp16 MLIR gfx908 Hip MI100-5 (REBOOT SLAVES!) api.github.com
337.. ..338 🟢 - -   - - -
339 568c2e4 🔴 - -   hpe-rack-16 docker build (boost download)
340 R 339 🔴 - -   hpe-rack-16 docker build (boost download)
341 R 340 - -   - - (aborted by user)
342 R 341 🔴 Fp32 Debug + Codecov Vega20 OCL prj47-rack-91 test_sqlite_perfdb runtime error: index 624 out of bounds...
343 R 342 🔴 Bf16 /opt/rocm gfx90a Hip dell113-r01-u31-32 build GPU detection
344.. ..347 🟢 - -   - - -
348 56215d6 🔴 Int8 All Vega20 /64 Hip ixt-rack-54 test_tensor_vec Iteration: 24 \ Mismatch at 4736223: ! != )
349 R 348 🔴 Fp16 All Install gfx908 Hip MI100-5 test_regression_half_mi100 (error as expected)
350 12e52ed 🔴 Fp32 All gfx908 Hip v340l-3 test_conv_group (none - INTERNAL ERROR?)
351.. ..360 - 🟢 - -   - - -
361 e0ded03 🔴 Fp32 gfx90a OpenCL hpe-rack-16 build GPU detection
362 8498875 🔴 Fp32 gfx90a OpenCL hpe-rack-15 build GPU detection
363 8498875 - - - - - (aborted by Jun)
364 e0ded03 - - - - - (aborted by Jun)
365 R 363 🟢 - -   - - -
366 R 364 🟢 - -   - - -
367 f091329 🔴 Fp32 gfx90a Hip dell113-r01-u31-32 build GPU detection
368 R 367 🟢 - -   - - -
369.. ..373 - - -   - - (aborted by Artem)
374 R 373 🟢 - -   - - -
375 d4f48bd 🔴 Fp32 gfx908 OCL pytorch-vg20-1 mlir testing (tests broken)
376 0a095af 🔴 Fp32 gfx908 OCL ixt-sjc2-11 (REBOOT SLAVES!) n/a
377 R 376 🟢 - -   - - -
378 5cb2e54 🔴 Fp32 All gfx1030 OCL ixt-sjc2-16 docker build permission denied... Docker daemon socket
379 R 378 🟢 - -   - - -
  • Abbreviations:
    • Fetch = Error fetching remote repo 'origin
    • MAF = Memory access fault
    • RCF = Cannot contact... hudson.remoting.ChannelClosingException... Remote call failed
    • FRDI = FAILED: filesystem::recursive_directory_iterator increment error: No such file or directory

@junliume

This comment has been minimized.

@atamazov

This comment has been minimized.

@atamazov

This comment has been minimized.

@atamazov

This comment has been minimized.

@atamazov
Copy link
Contributor Author

atamazov commented Sep 24, 2021

Testing on Vega10 (prj47-rack-19) often ends with 5 hour timeout (see runs 318 and 323 at #1147 (comment)). prj47-rack-09 also run into timeout (5 minutes) with clinfo (run 325).

@atamazov
Copy link
Contributor Author

The ticket reorganized, all TODO things collected in the topmost comment.

@junliume

This comment has been minimized.

@atamazov
Copy link
Contributor Author

atamazov commented Oct 6, 2021

@aserio aserio pinned this issue Oct 7, 2021
@atamazov
Copy link
Contributor Author

"Statistics of several CI runs launched in a row" updated to cover builds from 352 to 379. There are no new systematically reproducible problems.

@atamazov
Copy link
Contributor Author

atamazov commented Nov 18, 2021

CI# commit Res Failing stage GPU BE Node Failing test Error message
380   🟢 - -   - - -
381   🔴 Fp32 All gfx90a Hip dell113-r01-u27-28 test_soft_max FAILED: 0.0283859
382   🟢 - -   - - -
383.. ..386   🔴 - - - - docker build pcre download issue
387   - -   - - (aborted by Artem)
388   🟢 - -   - - -
389   🔴 Fp32 Debug gfx1030 OCL ixt-sjc2-11 test_gpu_reference_kernel MAF
390 R 389 🟢 - -   - - -
391   🔴 Fp32 Debug gfx1030 OCL ixt-sjc2-11 test_gpu_reference_kernel MAF
392 R 391 Fp32 All Install gfx1030 Hip ixt-sjc2-17 test_? Timeout 5 hours
393.. .. 394 🟢 - -   - - -
395.. .. 396 🔴 - -   - - git fetch... Permission denied
397 R 396 🟢 - -   - - -
398.. ..400   🟢 - -   - - -
401   🔴 Fp32 Debug gfx1030 OCL ixt-sjc2-17 test_gpu_reference_kernel MAF
402 R 401 🟢 - -   - - -
403   🔴 Fp32 All Install Vega20 OCL prj47-rack-91 test_soft_max FAILED: 0.01854
404 R 403 🔴 Fp32 All Xnack+ gfx90a Hip hpe-rack-16 test_soft_max FAILED: 0.00613016
405 R 404 🟢 - -   - - -
406   🔴 Fp32 Debug gfx1030 OCL ixt-sjc2-11 test_gpu_reference_kernel MAF
407 R 406 🟢 - -   - - -
408   🔴 Fp32 gfx90a OCL hpe-rack-15 MIOpenDriver build RCF
409 R 408 🔴 Fp32 Debug gfx1030 OCL ixt-sjc2-11 test_gpu_reference_kernel MAF
410   - - - - - (aborted by Jun)
411 R 409 🔴 Fp32 Debug gfx1030 OCL ixt-sjc2-11 test_find_db MAF
412 R 411 🔴 Fp32 Debug gfx1030 OCL ixt-sjc2-11 test_gpu_reference_kernel MAF
413 R 412  🟢 - -   - - -
414 R 410  🌀 - -   - - -

@atamazov
Copy link
Contributor Author

atamazov commented Nov 18, 2021

14 successful runs out of 34 (41%)

  • 6 memory access faults on ixt-sjc2-11 -- the node is candidate for disabling. 2 serious faults on ixt-sjc2-17 -- the node is suspicious. Other gfx1030 nodes (-16, -22, -61): NO faults.
  • 3 test_soft_max errors
  • Other faults are random or resolved.

@junliume
Copy link
Collaborator

junliume commented Nov 18, 2021

@okakarpa ixt-sjc2-11 (navi21)
Nov 18, 2021 4:17:29 PM
Disconnected: Memory access fault on develop run #409, #411, #412

@atamazov
Copy link
Contributor Author

atamazov commented Dec 6, 2021

CI# commit Res Failing stage GPU BE Node Failing test Error message
415.. ..417   🟢 - -   - - -
418   🔴 Fp32 Vega20 /64 Hip ixt-rack-54 test_gpu_nchw_nhwc_transpose Core dumped
419 R 418 🟢 - -   - - -
420   🟢 - -   - - -
421   🔴 - -   ixt-hq-35 docker build Connection timed out (sourceforge)
422 R 421 - -   - - (aborted by Artem)
423.. ..424   🟢 - -   - - -
425   🔴 - -   - - git checkout error after force-push
426   🏃‍♂️ - -   - - -

All faults seem either random, or expected, or already resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants