Skip to content

Conversation

maleadt
Copy link
Collaborator

@maleadt maleadt commented Oct 10, 2025

Significant refactor of the output tasks in order to make it possible to output additional info. The architecture switches to a separate task that's responsible for all the printing, with data sent to over a channel from several executor tasks.

                                                  |          | ---------------- CPU ---------------- |
Test                                     (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
mtl/metal                                     (3) |    10.80 |   0.58 |  5.4 |    1667.32 |   786.41 |                                                                                                
device/intrinsics/atomics                     (5) |    39.10 |   0.95 |  2.4 |    3614.11 |   729.88 |                                                                
execution                                     (3) |    29.78 |   0.35 |  1.2 |    1680.40 |   786.41 |                                                                
random                                        (6) |    50.82 |   1.23 |  2.4 |    5593.74 |   768.22 |                                                                
mps/matrix                                    (6) |     0.51 |   0.00 |  0.0 |      48.11 |   768.22 |                                                                
device/intrinsics/simd                        (5) |    15.41 |   0.26 |  1.7 |    1791.05 |   788.94 |                                                                
metallib                                      (5) |     0.96 |   0.01 |  1.2 |     208.30 |   806.45 |                                                                
device/intrinsics/arguments                   (6) |     6.57 |   0.10 |  1.6 |     575.51 |   768.22 |                                                                
mps/linalg                                    (3) |    24.38 |   0.44 |  1.8 |    2047.03 |   786.41 |                                                                
                                                                                                                                                                      
Running:  array (2), device/intrinsics/math (4), device/random (5), linalg (6), mps/ndarray (3)                                                                       
Progress: 9/56 tests completed | ETA: ~7 min                                                                                                                          

Fixes #26
Fixes #13

@maleadt maleadt marked this pull request as draft October 10, 2025 16:30
@maleadt
Copy link
Collaborator Author

maleadt commented Oct 10, 2025

Needs some locking to not compete against Distributed.jl printing worker messages:

Running:  array (2), capturing (7), profiling (4), examples (6), kernelabstractions (3)                                                                               
Progress: 21/56 tests completed | ETA: ~3 min      From worker 4:       Run issues were detected (trace is still ready to be viewed):                                 
      From worker 4:    * [Warning] GPU Service reported error: Selected counter profile is not supported on target device

... but naively taking the io_lock caused hangs.

@maleadt maleadt force-pushed the tb/refactor_printing branch from d7d5016 to d42285f Compare October 12, 2025 10:30
@maleadt
Copy link
Collaborator Author

maleadt commented Oct 12, 2025

Turns out IOCapture, as introduced in #25, is somehow incompatible with Distributed. On this PR, running runtests under IOCapture.capture causes hangs in Distributed.jl's message deserialization... Maybe something like JuliaDocs/IOCapture.jl#17?

Anyway, I decided to add stdout and stderr arguments to runtests again and use IOBuffers in testing. Some parts of the runner still relies on IOCapture because annoyingly, even after JuliaLang/julia#59357, functions like Test.record still generate output.

@maleadt maleadt marked this pull request as ready for review October 12, 2025 10:47
@giordano

This comment was marked as resolved.

using Test, IOCapture
using Test

cd(@__DIR__)
Copy link
Collaborator

@giordano giordano Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this cd necessary? If I remove it, tests are still successful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to support running tests out of tree, e.g., doing julia --project test/runtests.jl. Just a little convenience.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, yes, I can reproduce that. But I'd say this is something that this package could handle more nicely in principle by making the tests more working-directory-independent, but haven't looked into what's going on. In any case this issue is orthogonal to this PR.

Copy link
Collaborator

@giordano giordano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get colourful output, like in #27.

@maleadt
Copy link
Collaborator Author

maleadt commented Oct 13, 2025

This PR is getting unwieldy... Outout looks nice though:

image

Thoughts on the test output printing indented? It does require quite a large screen to render without wrapping, but without indentation it looks off.

Or maybe the output should be print separately after the tests.

@giordano
Copy link
Collaborator

One thing which isn't clear to me is where the message is coming from (i.e. which worker/job is printing it).

Is that Metal.jl? Do you have a branch for us to test this locally?

@maleadt
Copy link
Collaborator Author

maleadt commented Oct 13, 2025

Output is now printed when a testset finishes, so it's from the worker listed above.

Metal.jl WIP: https://github.com/JuliaGPU/Metal.jl/tree/tb/paralleltestrunner

@maleadt
Copy link
Collaborator Author

maleadt commented Oct 13, 2025

Alternative:

Test  (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
profiling  (2) |     9.17 |   0.02 |  0.2 |     234.77 |   642.66 |

Output generated during execution of 'profiling':
| Starting recording with the Blank template and GPU, Time Profiler, Metal Application, Metal GPU Counters, Metal Resource Events, os_signpost Instruments. Attaching to: julia (8902). 
| Ctrl-C to stop the recording
| Stopping recording...
| Run issues were detected (trace is still ready to be viewed):
| * [Warning] GPU Service reported error: Selected counter profile is not supported on target device
| 
| * [Warning] Data stream: Signpost intervals with backdated timestamps were emitted that are not visualized on the timeline until immediate mode recording is stopped.
| 
| Recording completed. Saving output file...
| Output file saved as: julia_1.trace
| [ Info: System trace saved to /private/var/folders/wc/z_cydqs118735x03cvsf96tr0000gn/T/jl_mhwDNH/julia_1.trace; open the resulting trace in Instruments

Test Summary: | Pass  Total   Time
  Overall     |    1      1  16.2s
    SUCCESS
     Testing Metal tests passed 

Maybe that's clearer?

@maleadt maleadt force-pushed the tb/refactor_printing branch from 26883a0 to 25a22ba Compare October 13, 2025 10:27
@giordano
Copy link
Collaborator

I think I like it overall, but the captured output doesn't have the colour.

image

The only thing about printing captured log at the end is that you need to wait a possibly long time to see it.

@maleadt maleadt force-pushed the tb/refactor_printing branch from 25a22ba to 86b4b44 Compare October 13, 2025 11:31
@maleadt maleadt force-pushed the tb/refactor_printing branch from 86b4b44 to abf77ec Compare October 13, 2025 11:41
@maleadt
Copy link
Collaborator Author

maleadt commented Oct 13, 2025

I'm occasionally seeing:

Error in testset basic:
Error During Test at none:1
  Got exception outside of a @test
  On worker 7:
  KeyError: key Base.PkgId(Base.UUID("d3525ed8-44d0-4b2c-a655-542cee43accc"), "ParallelTestRunner") not found

... which I'm not sure how to explain

@maleadt maleadt merged commit 96da192 into main Oct 13, 2025
20 checks passed
@maleadt maleadt deleted the tb/refactor_printing branch October 13, 2025 12:32
@vchuravy
Copy link
Member

Very cool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failing tests have partially wrong stacktrace Speed up test interruption

3 participants