Adapt to GPUCompiler 0.18 #1799

maleadt · 2023-03-14T14:18:42Z

Also refactors the compiler instantiation functionality.

maleadt · 2023-03-14T19:08:17Z

Uhh, so this speeds-up CI (and local testing) by 30-50% 🤯 There's two major new changes that may explain a speed-up:

we better track world age bounds;
some allocations on the kernel launch path are gone.

I can't imagine the latter explaining such a significant performance improvement though.

maleadt · 2023-03-15T06:45:23Z

I partially reverted changes to figure out which one was responsible, and it's JuliaGPU/GPUCompiler.jl#394. So probably we weren't actually caching inference results before?

maleadt · 2023-03-15T07:16:01Z

Zooming in on one of the more compilation-heavy test suites. Before:

                                                  |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                                     (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
gpuarrays/reductions/minimum maximum extrema  (2) |   428.64 |   0.02 |  0.0 |       2.19 |   164.00 |  44.59 | 10.4 |   77145.76 |  6116.79 |
      From worker 2:	ROOT                      :  0.29 %   4358156652
      From worker 2:	GC                        : 10.35 %   155737735862
      From worker 2:	LOWERING                  :  0.73 %   10948753338
      From worker 2:	PARSING                   :  0.02 %   240557038
      From worker 2:	INFERENCE                 : 22.96 %   345677035476
      From worker 2:	CODEGEN                   :  1.43 %   21536148027
      From worker 2:	METHOD_LOOKUP_SLOW        :  0.16 %   2336906767
      From worker 2:	METHOD_LOOKUP_FAST        :  0.69 %   10380410781
      From worker 2:	LLVM_OPT                  : 16.83 %   253331561010
      From worker 2:	LLVM_MODULE_FINISH        : 19.40 %   291992202844
      From worker 2:	METHOD_MATCH              :  4.55 %   68471389325
      From worker 2:	TYPE_CACHE_LOOKUP         : 12.11 %   182266100602
      From worker 2:	TYPE_CACHE_INSERT         :  0.01 %   173729519
      From worker 2:	STAGED_FUNCTION           :  0.05 %   703772998
      From worker 2:	MACRO_INVOCATION          :  0.00 %   64829275
      From worker 2:	AST_COMPRESS              :  0.67 %   10062717010
      From worker 2:	AST_UNCOMPRESS            :  1.54 %   23135474442
      From worker 2:	SYSIMG_LOAD               :  0.06 %   847003392
      From worker 2:	ADD_METHOD                :  0.62 %   9337180473
      From worker 2:	INIT_MODULE               :  0.01 %   79426959
Testing finished in 7 minutes, 13 seconds, 143 milliseconds

After:

                                                  |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                                     (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
gpuarrays/reductions/minimum maximum extrema  (2) |   216.24 |   0.02 |  0.0 |       2.19 |   164.00 |   9.52 |  4.4 |   22780.41 |  3063.43 |
      From worker 2:	ROOT                      :  0.59 %   4576884704
      From worker 2:	GC                        :  4.46 %   34910530948
      From worker 2:	LOWERING                  :  1.33 %   10411143740
      From worker 2:	PARSING                   :  0.03 %   237914694
      From worker 2:	INFERENCE                 : 14.38 %   112516036469
      From worker 2:	CODEGEN                   :  2.25 %   17627011981
      From worker 2:	METHOD_LOOKUP_SLOW        :  0.20 %   1571032821
      From worker 2:	METHOD_LOOKUP_FAST        :  0.60 %   4718704146
      From worker 2:	LLVM_OPT                  : 23.56 %   184315822978
      From worker 2:	LLVM_MODULE_FINISH        : 24.17 %   189110125544
      From worker 2:	METHOD_MATCH              :  6.08 %   47587361273
      From worker 2:	TYPE_CACHE_LOOKUP         :  3.93 %   30729679106
      From worker 2:	TYPE_CACHE_INSERT         :  0.02 %   167141302
      From worker 2:	STAGED_FUNCTION           :  0.09 %   694362217
      From worker 2:	MACRO_INVOCATION          :  0.01 %   61610775
      From worker 2:	AST_COMPRESS              :  0.83 %   6494759220
      From worker 2:	AST_UNCOMPRESS            :  1.34 %   10460498820
      From worker 2:	SYSIMG_LOAD               :  0.11 %   836257496
      From worker 2:	ADD_METHOD                :  1.17 %   9139799823
      From worker 2:	INIT_MODULE               :  0.01 %   78340762
Testing finished in 3 minutes, 40 seconds, 340 milliseconds

So LLVM times also improved by 25%...

maleadt · 2023-03-15T07:55:52Z

The GPUCompiler timings are confusing me more than anything else. Before:

────────────────────────────────────────────────────────────────────────────────
                                       Time                    Allocations
                              ───────────────────────   ────────────────────────
      Tot / % measured:            2446s /   1.3%           24.1GiB /  11.2%
Section               ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────
IR post-processing       252    14.2s   44.9%  56.3ms    321MiB   11.6%  1.27MiB
  optimization           223    14.1s   44.7%  63.3ms    320MiB   11.5%  1.43MiB
    nvvmreflect        8.69k    225ms    0.7%  25.9μs   23.7MiB    0.9%  2.80KiB
  clean-up               252   21.5ms    0.1%  85.4μs   11.8KiB    0.0%    48.0B
Validation               475    6.19s   19.6%  13.0ms   1.59GiB   58.8%  3.43MiB
Library linking          252    3.75s   11.9%  14.9ms   38.7MiB    1.4%   157KiB
  target libraries       223    3.52s   11.2%  15.8ms   38.4MiB    1.4%   176KiB
  runtime library        223    222ms    0.7%   998μs   31.4KiB    0.0%     144B
LLVM back-end            223    3.16s   10.0%  14.1ms   58.7MiB    2.1%   269KiB
  machine-code gen...    223    3.12s    9.9%  14.0ms   57.0MiB    2.1%   262KiB
  preparation            223   34.7ms    0.1%   156μs   1.71MiB    0.1%  7.86KiB
IR generation            252    3.09s    9.8%  12.3ms    627MiB   22.6%  2.49MiB
  emission               252    2.32s    7.3%  9.19ms    559MiB   20.2%  2.22MiB
  rewrite                252    458ms    1.4%  1.82ms   36.2MiB    1.3%   147KiB
    lower throw          252    146ms    0.5%   579μs   27.0MiB    1.0%   110KiB
    hide trap            252   13.8ms    0.0%  54.7μs   3.34MiB    0.1%  13.6KiB
  clean-up               252   5.39ms    0.0%  21.4μs   3.61MiB    0.1%  14.7KiB
lower byval              223    1.22s    3.9%  5.48ms   97.9MiB    3.5%   450KiB
Julia front-end          252   5.32ms    0.0%  21.1μs   92.0KiB    0.0%     374B
────────────────────────────────────────────────────────────────────────────────

After:

────────────────────────────────────────────────────────────────────────────────
                                       Time                    Allocations
                              ───────────────────────   ────────────────────────
      Tot / % measured:            2208s /   2.6%           83.2GiB /   3.3%
Section               ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────
Validation               446    33.4s   57.2%  74.9ms   1.69GiB   61.8%  3.89MiB
IR post-processing       223    14.2s   24.3%  63.6ms    332MiB   11.8%  1.49MiB
  optimization           223    14.1s   24.1%  63.3ms    331MiB   11.8%  1.49MiB
    nvvmreflect        8.69k    240ms    0.4%  27.6μs   27.0MiB    1.0%  3.19KiB
  clean-up               223   20.9ms    0.0%  93.8μs   10.5KiB    0.0%    48.0B
Library linking          223    3.81s    6.5%  17.1ms   38.7MiB    1.4%   178KiB
  target libraries       223    3.59s    6.1%  16.1ms   38.5MiB    1.4%   177KiB
  runtime library        223    223ms    0.4%   998μs   31.4KiB    0.0%     144B
LLVM back-end            223    3.10s    5.3%  13.9ms   58.7MiB    2.1%   270KiB
  machine-code gen...    223    3.06s    5.2%  13.7ms   57.0MiB    2.0%   262KiB
  preparation            223   36.6ms    0.1%   164μs   1.73MiB    0.1%  7.96KiB
IR generation            223    2.74s    4.7%  12.3ms    542MiB   19.3%  2.43MiB
  emission               223    1.99s    3.4%  8.95ms    490MiB   17.5%  2.20MiB
  rewrite                223    454ms    0.8%  2.04ms   34.9MiB    1.2%   160KiB
    lower throw          223    150ms    0.3%   675μs   25.8MiB    0.9%   119KiB
    hide trap            223   13.3ms    0.0%  59.8μs   3.29MiB    0.1%  15.1KiB
  clean-up               223   4.37ms    0.0%  19.6μs    912KiB    0.0%  4.09KiB
lower byval              223    1.19s    2.0%  5.32ms    101MiB    3.6%   463KiB
Julia front-end          223   6.15ms    0.0%  27.6μs    203KiB    0.0%     934B
────────────────────────────────────────────────────────────────────────────────

I guess our @timeit blocks don't cover where the speedup occurs. The timings also don't look very trustworthy; the total time is much too large, while the individual times don't come near the actual total time (4 to 7 minutes).

maleadt · 2023-03-15T09:32:26Z

Ah, the difference isn't in GPUCompiler-related compilation, but in how GPUCompiler.jl itself is compiled by Julia. Here's the (sorted and processed) compiler trace (julia --trace-compile=...) when running the CUDA.jl execution test suite with and without the world-age refactor: https://gist.github.com/maleadt/896e680053d12372710936a44f6bec64

The relevant bits (comparing fast against slow):

-precompile(Tuple{Type{Base.Dict{GPUCompiler.CompilerJob{T, P} where P where T, Any}}})
-precompile(Tuple{Type{Base.Dict{GPUCompiler.CompilerJob{T, P} where P where T, String}}, Pair{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, String}})
+precompile(Tuple{Type{Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Any}}})
+precompile(Tuple{Type{Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, String}}, Pair{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Int64, 1, 1}, Int64}}}, String}})
+precompile(Tuple{Type{Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, String}}, Pair{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, String}})
+precompile(Tuple{Type{Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, String}}, Pair{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, String}})
...

-precompile(Tuple{Type{GPUCompiler.CompilerJob{T, P} where P where T}, GPUCompiler.FunctionSpec, GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
-precompile(Tuple{Type{GPUCompiler.FunctionSpec}, Type, Type, UInt64})
-precompile(Tuple{Type{GPUCompiler.KernelError}, GPUCompiler.CompilerJob{T, P} where P where T, String, String})
+precompile(Tuple{Type{GPUCompiler.CompilerJob{T, P, F} where F where P where T}, GPUCompiler.PTXCompilerTarget, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}, CUDA.CUDACompilerParams, Symbol, Bool})
+precompile(Tuple{Type{GPUCompiler.CompilerJob{T, P, F} where F where P where T}, GPUCompiler.PTXCompilerTarget, GPUCompiler.FunctionSpec{Main.var"..."{Int64}, Tuple{}}, CUDA.CUDACompilerParams, Symbol, Bool})
+precompile(Tuple{Type{GPUCompiler.CompilerJob{T, P, F} where F where P where T}, GPUCompiler.PTXCompilerTarget, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Int16}, Tuple{Int32, Int8, Int16, Int64, Int16, Int16}}}, CUDA.CUDACompilerParams, Symbol, Bool})
+precompile(Tuple{Type{GPUCompiler.CompilerJob{T, P, F} where F where P where T}, GPUCompiler.PTXCompilerTarget, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Int32}, Tuple{Int16}}}, CUDA.CUDACompilerParams, Symbol, Bool})
+precompile(Tuple{Type{GPUCompiler.CompilerJob{T, P, F} where F where P where T}, GPUCompiler.PTXCompilerTarget, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{}, NTuple{8, Int64}}}, CUDA.CUDACompilerParams,
...

-precompile(Tuple{typeof(Base.get!), Base.Dict{GPUCompiler.CompilerJob{T, P} where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, Array{LLVM.CallInst, 1}})
+precompile(Tuple{typeof(Base.get!), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}}, Array{LLVM.CallInst, 1}})
+precompile(Tuple{typeof(Base.get!), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"..."{Int64}, Tuple{}}}, Array{LLVM.CallInst, 1}})
+precompile(Tuple{typeof(Base.get!), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Int16}, Tuple{Int32, Int8, Int16, Int64, Int16, Int16}}}}, Array{LLVM.CallInst, 1}})
+precompile(Tuple{typeof(Base.get!), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Int32}, Tuple{Int16}}}}, Array{LLVM.CallInst, 1}})
...

-precompile(Tuple{typeof(Base.get!), GPUCompiler.var"..."{LLVM.ThreadSafeContext, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}, Base.Dict{GPUCompiler.CompilerJob{T, P} where P where T, String}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
+precompile(Tuple{typeof(Base.get!), GPUCompiler.var"..."{LLVM.ThreadSafeContext, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}}}, Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, String}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}}})
+precompile(Tuple{typeof(Base.get!), GPUCompiler.var"..."{LLVM.ThreadSafeContext, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"..."{Int64}, Tuple{}}}}, Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, String}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"..."{Int64}, Tuple{}}}})
+precompile(Tuple{typeof(Base.get!), GPUCompiler.var"..."{LLVM.ThreadSafeContext, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(Main.world), Tuple{}}}}, Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, String}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(Main.world), Tuple{}}}})
+precompile(Tuple{typeof(Base.get!), GPUCompiler.var"..."{LLVM.ThreadSafeContext, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"#hello##...", Tuple{}}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(Main.world), Tuple{}}}}, Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, String}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(Main.world), Tuple{}}}})
...

-precompile(Tuple{typeof(Base.getindex), Base.Dict{GPUCompiler.CompilerJob{T, P} where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
+precompile(Tuple{typeof(Base.getindex), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}}})
+precompile(Tuple{typeof(Base.getindex), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"..."{Int64}, Tuple{}}}})
+precompile(Tuple{typeof(Base.getindex), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Int16}, Tuple{Int32, Int8, Int16, Int64, Int16, Int16}}}}})
+precompile(Tuple{typeof(Base.getindex), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Int32}, Tuple{Int16}}}}})
+precompile(Tuple{typeof(Base.getindex), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{}, NTuple{8, Int64}}}}})
+precompile(Tuple{typeof(Base.getindex), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"#dp_5arg_kernel##...", NTuple{5, Int64}}}})

+precompile(Tuple{typeof(Base.hash), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Int64, 1, 1}, Int64}}}, UInt64})
+precompile(Tuple{typeof(Base.hash), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, UInt64})
+precompile(Tuple{typeof(Base.hash), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, UInt64})
+precompile(Tuple{typeof(Base.hash), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, UInt64})
+precompile(Tuple{typeof(Base.hash), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Tuple{Base.Complex{Float32}, Base.Complex{Float32}}, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(Base.Math.sincos), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Base.Complex{Float32}, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, UInt64})

-precompile(Tuple{typeof(Base.lock), GPUCompiler.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}, Base.ReentrantLock})
-precompile(Tuple{typeof(Base.lock), GPUCompiler.var"..."{LLVM.ThreadSafeContext, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}, Base.ReentrantLock})
+precompile(Tuple{typeof(Base.lock), GPUCompiler.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Int64, 1, 1}, Int64}}}}, Base.ReentrantLock})
+precompile(Tuple{typeof(Base.lock), GPUCompiler.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}}, Base.ReentrantLock})
+precompile(Tuple{typeof(Base.lock), GPUCompiler.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}}, Base.ReentrantLock})
+precompile(Tuple{typeof(Base.lock), GPUCompiler.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}}, Base.ReentrantLock})
+precompile(Tuple{typeof(Base.lock), GPUCompiler.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Tuple{Base.Complex{Float32}, Base.Complex{Float32}}, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(Base.Math.sincos), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Base.Complex{Float32}, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}}, Base.ReentrantLock})
...

+precompile(Tuple{typeof(Base.setindex!), Base.Dict{Int64, Union{GPUCompiler.CompilerJob{T, P, F} where F where P where T, GPUCompiler.FunctionSpec{F, TT} where TT where F}}, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}, Int64})
+precompile(Tuple{typeof(Base.setindex!), Base.Dict{Int64, Union{GPUCompiler.CompilerJob{T, P, F} where F where P where T, GPUCompiler.FunctionSpec{F, TT} where TT where F}}, GPUCompiler.FunctionSpec{Main.var"..."{Int64}, Tuple{}}, Int64})
+precompile(Tuple{typeof(Base.setindex!), Base.Dict{Int64, Union{GPUCompiler.CompilerJob{T, P, F} where F where P where T, GPUCompiler.FunctionSpec{F, TT} where TT where F}}, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Int16}, Tuple{Int32, Int8, Int16, Int64, Int16, Int16}}}, Int64})
+precompile(Tuple{typeof(Base.setindex!), Base.Dict{Int64, Union{GPUCompiler.CompilerJob{T, P, F} where F where P where T, GPUCompiler.FunctionSpec{F, TT} where TT where F}}, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Int32}, Tuple{Int16}}}, Int64})

-precompile(Tuple{typeof(GPUCompiler.cached_compilation), Base.Dict{UInt64, Any}, GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, Type, Type, Function, Function})
-precompile(Tuple{typeof(GPUCompiler.check_ir), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, LLVM.Module})
-precompile(Tuple{typeof(GPUCompiler.ci_cache_lookup), GPUCompiler.CodeCache, Core.MethodInstance, UInt64, UInt64})
-precompile(Tuple{typeof(GPUCompiler.classify_arguments), GPUCompiler.CompilerJob{T, P} where P where T, LLVM.FunctionType})
+precompile(Tuple{typeof(GPUCompiler.cached_compilation), Base.Dict{UInt64, Any}, GPUCompiler.CompilerJob{T, P, F} where F where P where T, typeof(CUDA.cufunction_compile), typeof(CUDA.cufunction_link)})
+precompile(Tuple{typeof(GPUCompiler.check_ir), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Int64, 1, 1}, Int64}}}, LLVM.Module})
+precompile(Tuple{typeof(GPUCompiler.check_ir), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, LLVM.Module})
+precompile(Tuple{typeof(GPUCompiler.check_ir), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, LLVM.Module})
+precompile(Tuple{typeof(GPUCompiler.check_ir), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, LLVM.Module})
...

+precompile(Tuple{typeof(GPUCompiler.isintrinsic), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, _A} where _A, String})
+precompile(Tuple{typeof(GPUCompiler.JuliaContext), CUDA.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Int64, 1, 1}, Int64}}}}})
+precompile(Tuple{typeof(GPUCompiler.JuliaContext), CUDA.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}}})
+precompile(Tuple{typeof(GPUCompiler.JuliaContext), CUDA.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}}})
+precompile(Tuple{typeof(GPUCompiler.JuliaContext), CUDA.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}}})

-precompile(Tuple{Type{Pair{A, B} where B where A}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, String})
+precompile(Tuple{Type{Pair{A, B} where B where A}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Int64, 1, 1}, Int64}}}, String})
+precompile(Tuple{Type{Pair{A, B} where B where A}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, String})
+precompile(Tuple{Type{Pair{A, B} where B where A}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, String})
+precompile(Tuple{Type{Pair{A, B} where B where A}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, String})
+precompile(Tuple{Type{Pair{A, B} where B where A}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Tuple{Base.Complex{Float32}, Base.Complex{Float32}}, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(Base.Math.sincos), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Base.Complex{Float32}, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, String})
+precompile(Tuple{Type{Pair{A, B} where B where A}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.KernelObject{Float64}, Tuple{CUDA.CuDeviceArray{Float64, 1, 1}}}}, String})
+precompile(Tuple{Type{Pair{A, B} where B where A}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}}, String})
...

Essentially, by dropping the F typevar from CompilerJob we avoid re-specializing a metric ton of code on the target function. This was introduced in JuliaGPU/GPUCompiler.jl#140, I'm surprised we didn't spot that before...

maleadt force-pushed the tb/world_ages branch 4 times, most recently from 7e55432 to 9b1463b Compare March 14, 2023 16:53

maleadt marked this pull request as ready for review March 14, 2023 16:53

maleadt force-pushed the tb/world_ages branch from 9b1463b to eb79369 Compare March 14, 2023 16:54

maleadt added 3 commits March 14, 2023 19:52

Adapt to GPUCompiler changes.

66dfab6

Use new functionality to cache compiler configuraton.

557def9

Streamline creation of GPUCompiler objects.

26edcbd

maleadt force-pushed the tb/world_ages branch from eb79369 to 26edcbd Compare March 14, 2023 18:53

maleadt changed the title ~~Adapt to GPUCompiler changes~~ Adapt to GPUCompiler 0.18 Mar 15, 2023

maleadt merged commit a218b9c into master Mar 15, 2023

maleadt deleted the tb/world_ages branch March 15, 2023 10:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt to GPUCompiler 0.18 #1799

Adapt to GPUCompiler 0.18 #1799

maleadt commented Mar 14, 2023

maleadt commented Mar 14, 2023 •

edited

maleadt commented Mar 15, 2023

maleadt commented Mar 15, 2023

maleadt commented Mar 15, 2023 •

edited

maleadt commented Mar 15, 2023

Adapt to GPUCompiler 0.18 #1799

Adapt to GPUCompiler 0.18 #1799

Conversation

maleadt commented Mar 14, 2023

maleadt commented Mar 14, 2023 • edited

maleadt commented Mar 15, 2023

maleadt commented Mar 15, 2023

maleadt commented Mar 15, 2023 • edited

maleadt commented Mar 15, 2023

maleadt commented Mar 14, 2023 •

edited

maleadt commented Mar 15, 2023 •

edited