-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigating CI Instability #1147
Comments
This comment has been minimized.
This comment has been minimized.
Statistics of several CI runs launched in a rowRuns 287-289 are of the `develop` branch, others are of `ci-instability-investigation`
|
CI# | commit | Res | Failing stage | GPU | BE | Node | Failing test | Error message |
---|---|---|---|---|---|---|---|---|
311 | 068bb6b1 | 🟢 | - | - | - | - | ||
312 | 502df2b | 🔴 | Fp32 Debug /opt/rocm | MI100 | Hip | trex-vg-20 | (build) | Fetch |
313 | R 312 | 🔴 | Fp32 Install All | Vega20 | OCL | prj47-rack-91 | test_soft_max | FAILED: 0.0442317 |
314 | R 313 | 🔴 | Bf16 nstall All | gfx90a | Hip | dell113-r01-u31-32 | (build) | GPU detection |
315 | R 314 | ⚫ | - | - | - | - | Aborted by user | |
316 | R 313 | 🟢 | - | - | - | - | - | |
317 | 2509c87 | 🔴 | Fp16 /opt/rocm | gfx90a | Hip | dell113-r01-u05-06 | (build) | GPU detection |
318 | 1a52c62 | ⚫ | Fp32 /opt/rocm | Vega10 | Hip | prj47-rack-19 | test_? | Timeout 5 hours |
319 | 3b1b3ac | 🔴 | Fp32 Debug /opt/rocm | gfx90a | Hip | dell113-r01-u31-32 | (build) | GPU detection |
320 | 2d7b2dd | 🔴 | Fp32 All | Vega20 /64 | Hip | ixt-rack-54 | test_conv_group | FRDI * |
321.. ..322 | 🟢 | - | - | - | - | - | ||
323 | 030a369 | ⚫ | Fp32 MLIR | Vega10 | Hip | prj47-rack-19 | test_? | Timeout 5 hours |
324 | R 323 | 🟢 | - | - | - | - | - | |
325 | 1a1df9f | ⚫ | Fp32 Debug | Vega10 | Hip | prj47-rack-09 | clinfo | Timeout 5 minutes |
326 | R 325 | 🔴 | Fp32 All | Vega20 /64 | Hip | ixt-rack-54 | test_conv_for_implicit_gemm | FRDI * |
327 | R 326 | 🔴 | Fp32 All | Vega20 /64 | Hip | ixt-rack-54 | test_conv_group | FRDI * |
328.. ..329 | 🟢 | - | - | - | - | - | ||
330 | bc464d5 | 🔴 | Fp32 All | Vega20 /64 | Hip | ixt-rack-54 | test_conv3d | FRDI |
331.. ..334 | 🟢 | - | - | - | - | - | ||
335 | f21cdc1 | 🔴 | Bf16 Install | gfx908 | Hip | pytorch-vg20-1 | (REBOOT SLAVES!) | connect timed out |
336 | 886fc21 | 🔴 | Fp16 MLIR | gfx908 | Hip | MI100-5 | (REBOOT SLAVES!) | api.github.com |
337.. ..338 | 🟢 | - | - | - | - | - | ||
339 | 568c2e4 | 🔴 | - | - | hpe-rack-16 | docker build | (boost download) | |
340 | R 339 | 🔴 | - | - | hpe-rack-16 | docker build | (boost download) | |
341 | R 340 | ⚫ | - | - | - | - | (aborted by user) | |
342 | R 341 | 🔴 | Fp32 Debug + Codecov | Vega20 | OCL | prj47-rack-91 | test_sqlite_perfdb | runtime error: index 624 out of bounds... |
343 | R 342 | 🔴 | Bf16 /opt/rocm | gfx90a | Hip | dell113-r01-u31-32 | build | GPU detection |
344.. ..347 | 🟢 | - | - | - | - | - | ||
348 | 56215d6 | 🔴 | Int8 All | Vega20 /64 | Hip | ixt-rack-54 | test_tensor_vec | Iteration: 24 \ Mismatch at 4736223: ! != ) |
349 | R 348 | 🔴 | Fp16 All Install | gfx908 | Hip | MI100-5 | test_regression_half_mi100 | (error as expected) |
350 | 12e52ed | 🔴 | Fp32 All | gfx908 | Hip | v340l-3 | test_conv_group | (none - INTERNAL ERROR?) |
351.. ..360 | - | 🟢 | - | - | - | - | - | |
361 | e0ded03 | 🔴 | Fp32 | gfx90a | OpenCL | hpe-rack-16 | build | GPU detection |
362 | 8498875 | 🔴 | Fp32 | gfx90a | OpenCL | hpe-rack-15 | build | GPU detection |
363 | 8498875 | ⚫ | - | - | - | - | - | (aborted by Jun) |
364 | e0ded03 | ⚫ | - | - | - | - | - | (aborted by Jun) |
365 | R 363 | 🟢 | - | - | - | - | - | |
366 | R 364 | 🟢 | - | - | - | - | - | |
367 | f091329 | 🔴 | Fp32 | gfx90a | Hip | dell113-r01-u31-32 | build | GPU detection |
368 | R 367 | 🟢 | - | - | - | - | - | |
369.. ..373 | - | ⚫ | - | - | - | - | (aborted by Artem) | |
374 | R 373 | 🟢 | - | - | - | - | - | |
375 | d4f48bd | 🔴 | Fp32 | gfx908 | OCL | pytorch-vg20-1 | mlir testing | (tests broken) |
376 | 0a095af | 🔴 | Fp32 | gfx908 | OCL | ixt-sjc2-11 | (REBOOT SLAVES!) | n/a |
377 | R 376 | 🟢 | - | - | - | - | - | |
378 | 5cb2e54 | 🔴 | Fp32 All | gfx1030 | OCL | ixt-sjc2-16 | docker build | permission denied... Docker daemon socket |
379 | R 378 | 🟢 | - | - | - | - | - |
- Abbreviations:
- Fetch = Error fetching remote repo 'origin
- MAF = Memory access fault
- RCF = Cannot contact... hudson.remoting.ChannelClosingException... Remote call failed
- FRDI = FAILED: filesystem::recursive_directory_iterator increment error: No such file or directory
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Testing on Vega10 (prj47-rack-19) often ends with 5 hour timeout (see runs 318 and 323 at #1147 (comment)). prj47-rack-09 also run into timeout (5 minutes) with |
The ticket reorganized, all TODO things collected in the topmost comment. |
This comment has been minimized.
This comment has been minimized.
"Statistics of several CI runs launched in a row" updated to cover builds from 352 to 379. There are no new systematically reproducible problems. |
|
14 successful runs out of 34 (41%)
|
All faults seem either random, or expected, or already resolved. |
This is umbrella ticket intended to collect & analyze information related to CI instability that we currently observe.
A list of tickets that may affect the CI stability
Findings from CI runs launched in a row
The text was updated successfully, but these errors were encountered: