Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt to GPUCompiler 0.18 #1799

Merged
merged 3 commits into from
Mar 15, 2023
Merged

Adapt to GPUCompiler 0.18 #1799

merged 3 commits into from
Mar 15, 2023

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Mar 14, 2023

Also refactors the compiler instantiation functionality.

@maleadt maleadt force-pushed the tb/world_ages branch 4 times, most recently from 7e55432 to 9b1463b Compare March 14, 2023 16:53
@maleadt maleadt marked this pull request as ready for review March 14, 2023 16:53
@maleadt
Copy link
Member Author

maleadt commented Mar 14, 2023

Uhh, so this speeds-up CI (and local testing) by 30-50% 🤯 There's two major new changes that may explain a speed-up:

  • we better track world age bounds;
  • some allocations on the kernel launch path are gone.

I can't imagine the latter explaining such a significant performance improvement though.

@maleadt
Copy link
Member Author

maleadt commented Mar 15, 2023

I partially reverted changes to figure out which one was responsible, and it's JuliaGPU/GPUCompiler.jl#394. So probably we weren't actually caching inference results before?

@maleadt
Copy link
Member Author

maleadt commented Mar 15, 2023

Zooming in on one of the more compilation-heavy test suites. Before:

                                                  |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                                     (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
gpuarrays/reductions/minimum maximum extrema  (2) |   428.64 |   0.02 |  0.0 |       2.19 |   164.00 |  44.59 | 10.4 |   77145.76 |  6116.79 |
      From worker 2:	ROOT                      :  0.29 %   4358156652
      From worker 2:	GC                        : 10.35 %   155737735862
      From worker 2:	LOWERING                  :  0.73 %   10948753338
      From worker 2:	PARSING                   :  0.02 %   240557038
      From worker 2:	INFERENCE                 : 22.96 %   345677035476
      From worker 2:	CODEGEN                   :  1.43 %   21536148027
      From worker 2:	METHOD_LOOKUP_SLOW        :  0.16 %   2336906767
      From worker 2:	METHOD_LOOKUP_FAST        :  0.69 %   10380410781
      From worker 2:	LLVM_OPT                  : 16.83 %   253331561010
      From worker 2:	LLVM_MODULE_FINISH        : 19.40 %   291992202844
      From worker 2:	METHOD_MATCH              :  4.55 %   68471389325
      From worker 2:	TYPE_CACHE_LOOKUP         : 12.11 %   182266100602
      From worker 2:	TYPE_CACHE_INSERT         :  0.01 %   173729519
      From worker 2:	STAGED_FUNCTION           :  0.05 %   703772998
      From worker 2:	MACRO_INVOCATION          :  0.00 %   64829275
      From worker 2:	AST_COMPRESS              :  0.67 %   10062717010
      From worker 2:	AST_UNCOMPRESS            :  1.54 %   23135474442
      From worker 2:	SYSIMG_LOAD               :  0.06 %   847003392
      From worker 2:	ADD_METHOD                :  0.62 %   9337180473
      From worker 2:	INIT_MODULE               :  0.01 %   79426959
Testing finished in 7 minutes, 13 seconds, 143 milliseconds

After:

                                                  |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                                     (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
gpuarrays/reductions/minimum maximum extrema  (2) |   216.24 |   0.02 |  0.0 |       2.19 |   164.00 |   9.52 |  4.4 |   22780.41 |  3063.43 |
      From worker 2:	ROOT                      :  0.59 %   4576884704
      From worker 2:	GC                        :  4.46 %   34910530948
      From worker 2:	LOWERING                  :  1.33 %   10411143740
      From worker 2:	PARSING                   :  0.03 %   237914694
      From worker 2:	INFERENCE                 : 14.38 %   112516036469
      From worker 2:	CODEGEN                   :  2.25 %   17627011981
      From worker 2:	METHOD_LOOKUP_SLOW        :  0.20 %   1571032821
      From worker 2:	METHOD_LOOKUP_FAST        :  0.60 %   4718704146
      From worker 2:	LLVM_OPT                  : 23.56 %   184315822978
      From worker 2:	LLVM_MODULE_FINISH        : 24.17 %   189110125544
      From worker 2:	METHOD_MATCH              :  6.08 %   47587361273
      From worker 2:	TYPE_CACHE_LOOKUP         :  3.93 %   30729679106
      From worker 2:	TYPE_CACHE_INSERT         :  0.02 %   167141302
      From worker 2:	STAGED_FUNCTION           :  0.09 %   694362217
      From worker 2:	MACRO_INVOCATION          :  0.01 %   61610775
      From worker 2:	AST_COMPRESS              :  0.83 %   6494759220
      From worker 2:	AST_UNCOMPRESS            :  1.34 %   10460498820
      From worker 2:	SYSIMG_LOAD               :  0.11 %   836257496
      From worker 2:	ADD_METHOD                :  1.17 %   9139799823
      From worker 2:	INIT_MODULE               :  0.01 %   78340762
Testing finished in 3 minutes, 40 seconds, 340 milliseconds

So LLVM times also improved by 25%...

@maleadt
Copy link
Member Author

maleadt commented Mar 15, 2023

The GPUCompiler timings are confusing me more than anything else. Before:

────────────────────────────────────────────────────────────────────────────────
                                       Time                    Allocations
                              ───────────────────────   ────────────────────────
      Tot / % measured:            2446s /   1.3%           24.1GiB /  11.2%
Section               ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────
IR post-processing       252    14.2s   44.9%  56.3ms    321MiB   11.6%  1.27MiB
  optimization           223    14.1s   44.7%  63.3ms    320MiB   11.5%  1.43MiB
    nvvmreflect        8.69k    225ms    0.7%  25.9μs   23.7MiB    0.9%  2.80KiB
  clean-up               252   21.5ms    0.1%  85.4μs   11.8KiB    0.0%    48.0B
Validation               475    6.19s   19.6%  13.0ms   1.59GiB   58.8%  3.43MiB
Library linking          252    3.75s   11.9%  14.9ms   38.7MiB    1.4%   157KiB
  target libraries       223    3.52s   11.2%  15.8ms   38.4MiB    1.4%   176KiB
  runtime library        223    222ms    0.7%   998μs   31.4KiB    0.0%     144B
LLVM back-end            223    3.16s   10.0%  14.1ms   58.7MiB    2.1%   269KiB
  machine-code gen...    223    3.12s    9.9%  14.0ms   57.0MiB    2.1%   262KiB
  preparation            223   34.7ms    0.1%   156μs   1.71MiB    0.1%  7.86KiB
IR generation            252    3.09s    9.8%  12.3ms    627MiB   22.6%  2.49MiB
  emission               252    2.32s    7.3%  9.19ms    559MiB   20.2%  2.22MiB
  rewrite                252    458ms    1.4%  1.82ms   36.2MiB    1.3%   147KiB
    lower throw          252    146ms    0.5%   579μs   27.0MiB    1.0%   110KiB
    hide trap            252   13.8ms    0.0%  54.7μs   3.34MiB    0.1%  13.6KiB
  clean-up               252   5.39ms    0.0%  21.4μs   3.61MiB    0.1%  14.7KiB
lower byval              223    1.22s    3.9%  5.48ms   97.9MiB    3.5%   450KiB
Julia front-end          252   5.32ms    0.0%  21.1μs   92.0KiB    0.0%     374B
────────────────────────────────────────────────────────────────────────────────

After:

────────────────────────────────────────────────────────────────────────────────
                                       Time                    Allocations
                              ───────────────────────   ────────────────────────
      Tot / % measured:            2208s /   2.6%           83.2GiB /   3.3%
Section               ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────
Validation               446    33.4s   57.2%  74.9ms   1.69GiB   61.8%  3.89MiB
IR post-processing       223    14.2s   24.3%  63.6ms    332MiB   11.8%  1.49MiB
  optimization           223    14.1s   24.1%  63.3ms    331MiB   11.8%  1.49MiB
    nvvmreflect        8.69k    240ms    0.4%  27.6μs   27.0MiB    1.0%  3.19KiB
  clean-up               223   20.9ms    0.0%  93.8μs   10.5KiB    0.0%    48.0B
Library linking          223    3.81s    6.5%  17.1ms   38.7MiB    1.4%   178KiB
  target libraries       223    3.59s    6.1%  16.1ms   38.5MiB    1.4%   177KiB
  runtime library        223    223ms    0.4%   998μs   31.4KiB    0.0%     144B
LLVM back-end            223    3.10s    5.3%  13.9ms   58.7MiB    2.1%   270KiB
  machine-code gen...    223    3.06s    5.2%  13.7ms   57.0MiB    2.0%   262KiB
  preparation            223   36.6ms    0.1%   164μs   1.73MiB    0.1%  7.96KiB
IR generation            223    2.74s    4.7%  12.3ms    542MiB   19.3%  2.43MiB
  emission               223    1.99s    3.4%  8.95ms    490MiB   17.5%  2.20MiB
  rewrite                223    454ms    0.8%  2.04ms   34.9MiB    1.2%   160KiB
    lower throw          223    150ms    0.3%   675μs   25.8MiB    0.9%   119KiB
    hide trap            223   13.3ms    0.0%  59.8μs   3.29MiB    0.1%  15.1KiB
  clean-up               223   4.37ms    0.0%  19.6μs    912KiB    0.0%  4.09KiB
lower byval              223    1.19s    2.0%  5.32ms    101MiB    3.6%   463KiB
Julia front-end          223   6.15ms    0.0%  27.6μs    203KiB    0.0%     934B
────────────────────────────────────────────────────────────────────────────────

I guess our @timeit blocks don't cover where the speedup occurs. The timings also don't look very trustworthy; the total time is much too large, while the individual times don't come near the actual total time (4 to 7 minutes).

@maleadt
Copy link
Member Author

maleadt commented Mar 15, 2023

Ah, the difference isn't in GPUCompiler-related compilation, but in how GPUCompiler.jl itself is compiled by Julia. Here's the (sorted and processed) compiler trace (julia --trace-compile=...) when running the CUDA.jl execution test suite with and without the world-age refactor: https://gist.github.com/maleadt/896e680053d12372710936a44f6bec64

The relevant bits (comparing fast against slow):

-precompile(Tuple{Type{Base.Dict{GPUCompiler.CompilerJob{T, P} where P where T, Any}}})
-precompile(Tuple{Type{Base.Dict{GPUCompiler.CompilerJob{T, P} where P where T, String}}, Pair{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, String}})
+precompile(Tuple{Type{Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Any}}})
+precompile(Tuple{Type{Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, String}}, Pair{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Int64, 1, 1}, Int64}}}, String}})
+precompile(Tuple{Type{Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, String}}, Pair{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, String}})
+precompile(Tuple{Type{Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, String}}, Pair{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, String}})
...
-precompile(Tuple{Type{GPUCompiler.CompilerJob{T, P} where P where T}, GPUCompiler.FunctionSpec, GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
-precompile(Tuple{Type{GPUCompiler.FunctionSpec}, Type, Type, UInt64})
-precompile(Tuple{Type{GPUCompiler.KernelError}, GPUCompiler.CompilerJob{T, P} where P where T, String, String})
+precompile(Tuple{Type{GPUCompiler.CompilerJob{T, P, F} where F where P where T}, GPUCompiler.PTXCompilerTarget, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}, CUDA.CUDACompilerParams, Symbol, Bool})
+precompile(Tuple{Type{GPUCompiler.CompilerJob{T, P, F} where F where P where T}, GPUCompiler.PTXCompilerTarget, GPUCompiler.FunctionSpec{Main.var"..."{Int64}, Tuple{}}, CUDA.CUDACompilerParams, Symbol, Bool})
+precompile(Tuple{Type{GPUCompiler.CompilerJob{T, P, F} where F where P where T}, GPUCompiler.PTXCompilerTarget, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Int16}, Tuple{Int32, Int8, Int16, Int64, Int16, Int16}}}, CUDA.CUDACompilerParams, Symbol, Bool})
+precompile(Tuple{Type{GPUCompiler.CompilerJob{T, P, F} where F where P where T}, GPUCompiler.PTXCompilerTarget, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Int32}, Tuple{Int16}}}, CUDA.CUDACompilerParams, Symbol, Bool})
+precompile(Tuple{Type{GPUCompiler.CompilerJob{T, P, F} where F where P where T}, GPUCompiler.PTXCompilerTarget, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{}, NTuple{8, Int64}}}, CUDA.CUDACompilerParams,
...
-precompile(Tuple{typeof(Base.get!), Base.Dict{GPUCompiler.CompilerJob{T, P} where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, Array{LLVM.CallInst, 1}})
+precompile(Tuple{typeof(Base.get!), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}}, Array{LLVM.CallInst, 1}})
+precompile(Tuple{typeof(Base.get!), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"..."{Int64}, Tuple{}}}, Array{LLVM.CallInst, 1}})
+precompile(Tuple{typeof(Base.get!), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Int16}, Tuple{Int32, Int8, Int16, Int64, Int16, Int16}}}}, Array{LLVM.CallInst, 1}})
+precompile(Tuple{typeof(Base.get!), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Int32}, Tuple{Int16}}}}, Array{LLVM.CallInst, 1}})
...
-precompile(Tuple{typeof(Base.get!), GPUCompiler.var"..."{LLVM.ThreadSafeContext, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}, Base.Dict{GPUCompiler.CompilerJob{T, P} where P where T, String}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
+precompile(Tuple{typeof(Base.get!), GPUCompiler.var"..."{LLVM.ThreadSafeContext, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}}}, Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, String}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}}})
+precompile(Tuple{typeof(Base.get!), GPUCompiler.var"..."{LLVM.ThreadSafeContext, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"..."{Int64}, Tuple{}}}}, Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, String}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"..."{Int64}, Tuple{}}}})
+precompile(Tuple{typeof(Base.get!), GPUCompiler.var"..."{LLVM.ThreadSafeContext, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(Main.world), Tuple{}}}}, Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, String}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(Main.world), Tuple{}}}})
+precompile(Tuple{typeof(Base.get!), GPUCompiler.var"..."{LLVM.ThreadSafeContext, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"#hello##...", Tuple{}}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(Main.world), Tuple{}}}}, Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, String}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(Main.world), Tuple{}}}})
...
-precompile(Tuple{typeof(Base.getindex), Base.Dict{GPUCompiler.CompilerJob{T, P} where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
+precompile(Tuple{typeof(Base.getindex), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}}})
+precompile(Tuple{typeof(Base.getindex), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"..."{Int64}, Tuple{}}}})
+precompile(Tuple{typeof(Base.getindex), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Int16}, Tuple{Int32, Int8, Int16, Int64, Int16, Int16}}}}})
+precompile(Tuple{typeof(Base.getindex), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Int32}, Tuple{Int16}}}}})
+precompile(Tuple{typeof(Base.getindex), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{}, NTuple{8, Int64}}}}})
+precompile(Tuple{typeof(Base.getindex), Base.Dict{GPUCompiler.CompilerJob{T, P, F} where F where P where T, Array{LLVM.CallInst, 1}}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"#dp_5arg_kernel##...", NTuple{5, Int64}}}})
+precompile(Tuple{typeof(Base.hash), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Int64, 1, 1}, Int64}}}, UInt64})
+precompile(Tuple{typeof(Base.hash), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, UInt64})
+precompile(Tuple{typeof(Base.hash), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, UInt64})
+precompile(Tuple{typeof(Base.hash), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, UInt64})
+precompile(Tuple{typeof(Base.hash), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Tuple{Base.Complex{Float32}, Base.Complex{Float32}}, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(Base.Math.sincos), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Base.Complex{Float32}, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, UInt64})
-precompile(Tuple{typeof(Base.lock), GPUCompiler.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}, Base.ReentrantLock})
-precompile(Tuple{typeof(Base.lock), GPUCompiler.var"..."{LLVM.ThreadSafeContext, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}, Base.ReentrantLock})
+precompile(Tuple{typeof(Base.lock), GPUCompiler.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Int64, 1, 1}, Int64}}}}, Base.ReentrantLock})
+precompile(Tuple{typeof(Base.lock), GPUCompiler.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}}, Base.ReentrantLock})
+precompile(Tuple{typeof(Base.lock), GPUCompiler.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}}, Base.ReentrantLock})
+precompile(Tuple{typeof(Base.lock), GPUCompiler.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}}, Base.ReentrantLock})
+precompile(Tuple{typeof(Base.lock), GPUCompiler.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Tuple{Base.Complex{Float32}, Base.Complex{Float32}}, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(Base.Math.sincos), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Base.Complex{Float32}, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}}, Base.ReentrantLock})
...
+precompile(Tuple{typeof(Base.setindex!), Base.Dict{Int64, Union{GPUCompiler.CompilerJob{T, P, F} where F where P where T, GPUCompiler.FunctionSpec{F, TT} where TT where F}}, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}, Int64})
+precompile(Tuple{typeof(Base.setindex!), Base.Dict{Int64, Union{GPUCompiler.CompilerJob{T, P, F} where F where P where T, GPUCompiler.FunctionSpec{F, TT} where TT where F}}, GPUCompiler.FunctionSpec{Main.var"..."{Int64}, Tuple{}}, Int64})
+precompile(Tuple{typeof(Base.setindex!), Base.Dict{Int64, Union{GPUCompiler.CompilerJob{T, P, F} where F where P where T, GPUCompiler.FunctionSpec{F, TT} where TT where F}}, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Int16}, Tuple{Int32, Int8, Int16, Int64, Int16, Int16}}}, Int64})
+precompile(Tuple{typeof(Base.setindex!), Base.Dict{Int64, Union{GPUCompiler.CompilerJob{T, P, F} where F where P where T, GPUCompiler.FunctionSpec{F, TT} where TT where F}}, GPUCompiler.FunctionSpec{Main.var"#child##...", Tuple{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Int32}, Tuple{Int16}}}, Int64})
-precompile(Tuple{typeof(GPUCompiler.cached_compilation), Base.Dict{UInt64, Any}, GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, Type, Type, Function, Function})
-precompile(Tuple{typeof(GPUCompiler.check_ir), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, LLVM.Module})
-precompile(Tuple{typeof(GPUCompiler.ci_cache_lookup), GPUCompiler.CodeCache, Core.MethodInstance, UInt64, UInt64})
-precompile(Tuple{typeof(GPUCompiler.classify_arguments), GPUCompiler.CompilerJob{T, P} where P where T, LLVM.FunctionType})
+precompile(Tuple{typeof(GPUCompiler.cached_compilation), Base.Dict{UInt64, Any}, GPUCompiler.CompilerJob{T, P, F} where F where P where T, typeof(CUDA.cufunction_compile), typeof(CUDA.cufunction_link)})
+precompile(Tuple{typeof(GPUCompiler.check_ir), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Int64, 1, 1}, Int64}}}, LLVM.Module})
+precompile(Tuple{typeof(GPUCompiler.check_ir), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, LLVM.Module})
+precompile(Tuple{typeof(GPUCompiler.check_ir), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, LLVM.Module})
+precompile(Tuple{typeof(GPUCompiler.check_ir), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, LLVM.Module})
...
+precompile(Tuple{typeof(GPUCompiler.isintrinsic), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, _A} where _A, String})
+precompile(Tuple{typeof(GPUCompiler.JuliaContext), CUDA.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Int64, 1, 1}, Int64}}}}})
+precompile(Tuple{typeof(GPUCompiler.JuliaContext), CUDA.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}}})
+precompile(Tuple{typeof(GPUCompiler.JuliaContext), CUDA.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}}})
+precompile(Tuple{typeof(GPUCompiler.JuliaContext), CUDA.var"..."{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}}})
-precompile(Tuple{Type{Pair{A, B} where B where A}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, String})
+precompile(Tuple{Type{Pair{A, B} where B where A}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Int64, 1, 1}, Int64}}}, String})
+precompile(Tuple{Type{Pair{A, B} where B where A}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, String})
+precompile(Tuple{Type{Pair{A, B} where B where A}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, String})
+precompile(Tuple{Type{Pair{A, B} where B where A}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Main.var"...", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Int64, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, String})
+precompile(Tuple{Type{Pair{A, B} where B where A}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel##...", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Tuple{Base.Complex{Float32}, Base.Complex{Float32}}, 1, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(Base.Math.sincos), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Base.Complex{Float32}, 1, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, String})
+precompile(Tuple{Type{Pair{A, B} where B where A}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.KernelObject{Float64}, Tuple{CUDA.CuDeviceArray{Float64, 1, 1}}}}, String})
+precompile(Tuple{Type{Pair{A, B} where B where A}, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{Main.var"...", Tuple{}}}, String})
...

Essentially, by dropping the F typevar from CompilerJob we avoid re-specializing a metric ton of code on the target function. This was introduced in JuliaGPU/GPUCompiler.jl#140, I'm surprised we didn't spot that before...

@maleadt maleadt changed the title Adapt to GPUCompiler changes Adapt to GPUCompiler 0.18 Mar 15, 2023
@maleadt maleadt merged commit a218b9c into master Mar 15, 2023
@maleadt maleadt deleted the tb/world_ages branch March 15, 2023 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant