add GPUCompiler precompilation caching #425

collinwarner · 2023-04-09T00:35:22Z

Adds ability to precompile code in GPUCompiler.GLOBAL_CI_CACHES. Taps into non-gpu caching of global constants to write the current instance of the global cache and reload on initialization. Requires user to declare, initialize and snapshot local cache. The user will then use GPUCompiler.precompile_gpucompiler. Mainly this adds and api for downstream packages such as Enzyme, CUDA, to use to cache instances of their functions. A sample SimpleGPU and Example.jl illustrate usage.

maleadt · 2023-04-09T06:44:45Z

You forgot to commit precompile_native.jl.

codecov · 2023-04-09T21:24:00Z

Codecov Report

Patch coverage has no change and project coverage change: -10.21 ⚠️

Comparison is base (d5086fb) 87.08% compared to head (cc34d21) 76.87%.

❗ Current head cc34d21 differs from pull request most recent head 1951087. Consider uploading reports for the commit 1951087 to get more accurate results

Additional details and impacted files

@@             Coverage Diff             @@
##           master     #425       +/-   ##
===========================================
- Coverage   87.08%   76.87%   -10.21%     
===========================================
  Files          24       25        +1     
  Lines        2943     2993       +50     
===========================================
- Hits         2563     2301      -262     
- Misses        380      692      +312

Impacted Files	Coverage Δ
src/GPUCompiler.jl	`100.00% <ø> (ø)`
src/jlgen.jl	`66.86% <0.00%> (-16.57%)`	⬇️
src/precompilation_cache.jl	`0.00% <0.00%> (ø)`

... and 14 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

maleadt · 2023-04-11T07:44:00Z

Could you explain what the purpose/design of this PR is? It's not at all clear to me, and looking downstream lots of functionality is entirely unused (e.g. reinit_cache).

I'm not sure why this needs anything in GPUCompiler.jl at all. Shouldn't it be sufficient for downstream packages to trigger a compilation to cache whatever they need, e.g., how JET.jl does it https://github.com/aviatesk/JET.jl/blob/b688eda6eb50a18e9e218d32650d2de23f085d50/src/JET.jl#L1382-L1396

collinwarner · 2023-04-11T17:32:39Z

Could you explain what the purpose/design of this PR is? It's not at all clear to me, and looking downstream lots of functionality is entirely unused (e.g. reinit_cache).

I'm not sure why this needs anything in GPUCompiler.jl at all. Shouldn't it be sufficient for downstream packages to trigger a compilation to cache whatever they need, e.g., how JET.jl does it https://github.com/aviatesk/JET.jl/blob/b688eda6eb50a18e9e218d32650d2de23f085d50/src/JET.jl#L1382-L1396

Updated initial comment and added some example code. Hope this clears some things up!

maleadt · 2023-04-11T18:32:42Z

Not really, sorry. Could you describe what's the problem you want to solve, why it doesn't work with current precompilation tools, and why you opted for the design you did? Those global undocumented macros (doing questionable things) are a very non-Julian API.

collinwarner · 2023-04-11T18:55:23Z

Not really, sorry. Could you describe what's the problem you want to solve, why it doesn't work with current precompilation tools, and why you opted for the design you did? Those global undocumented macros (doing questionable things) are a very non-Julian API.

The main issue is that GPUCompiler's GLOBAL_CI_CACHES are not persistent on reruns. This commit fixes this issue requiring some user input. This would improve time-to-first-x for things requiring GPUCompiler. Have a pull request for Enzyme in the works that is one downstream use case another is in CUDA.

Been working with @vchuravy on this, with an eventual extension to be to cache binary code between runs not just type hints.

The reason for so much user involvement and use of macros is this was the simplest way forward. We use macros to create a local cache outside of the user control that has a unique id that does not conflict with the user code. We want a unique cache to eliminate duplications in the cache. Additionally we tried making all of this run at init time, but that was to late, the caches had already been serialized at that point, so we needed user involvement.

We definitely want to try to reduce this but this is a first polished attempt at the matter.

maleadt · 2023-04-12T11:48:34Z

The main issue is that GPUCompiler's GLOBAL_CI_CACHES are not persistent on reruns.

Why not? It's just a global dict, why doesn't it get serialized in the .ji file?

collinwarner · 2023-04-12T15:43:26Z

The main issue is that GPUCompiler's GLOBAL_CI_CACHES are not persistent on reruns.

Why not? It's just a global dict, why doesn't it get serialized in the .ji file?

It is serialized, it just occurs to early in the process. By the time the dependent packages have inserted into the cache it is too late for the global. Additionally multiple children are now allowed to mutate and still have some see a cache improvements.

maleadt · 2023-04-12T18:44:32Z

It is serialized, it just occurs to early in the process.

Repeating my comment from Slack: Is this because the global is serialized as part of the GPUCompiler.ji, and isn't part of, e.g., CUDA.jl's precompilation image? In that case, you could override ci_cache and use a Dict that's serialized as part of the downstream package, in order to avoid this complexity.

If that turns out to be the way to do it, we could even remove the global CI cache here to force users to bring their own (and thus get proper precompilation of GPUCompiler-inferred CIs).

vchuravy · 2023-04-12T19:52:38Z

So the overarching design consideration is:

Users of Enzyme.jl/CUDA.jl/AMDGPU.jl should be able to "precompile" their code.
Where can we store these precompilation results, while also ensuring that they get invalidated properly?

Each user package will need to declare an "anchor"/cache that will be serialized into the .ji of this package.
So the workflow is something like:

module ParallelStencil

using CUDA
using Enzyme
import GPUCompiler

GPUCompiler.@declare_cache() # anchor

f(x) = 1
CUDA.precompile_cuda(f, (Int, ))
Enzyme.precompile_fwd(f, (Int, ))

function __init__()
    GPUCompiler.@reinit_cache()
end

GPUCompiler.@snapshot_cache()

So it is not the down-stream packages of GPUCompiler that need to bring their own cache,
but it is the user of those packages.

We use the cachefile of ParallelStencil to save the cache entries that were discovered during precompilation of PS,
and we then need to re-insert those cache entries into the cache.

That's at least the high-level design Collin and I came up with.

vchuravy

The new content in examples should likely go into test.

src/precompile_native.jl

vchuravy · 2023-04-12T19:55:40Z

src/jlgen.jl

+    CodeCache(cache::CodeCache) = new(GPUCompiler.copyAndFilter(cache.dict))
+end
+
+function copyAndFilter(dict::IdDict)


What is this needed for?

That is used in https://github.com/collinwarner/GPUCompiler.jl/blob/3dbe9d5b7c7c5f56f18553f0e4d4bd9c2bdcaca5/src/precompile_native.jl#L102

It creates a CodeCache that contains unbounded entries only. Used when snapshotting.

can we just write this as filter(validate_codecache, cache.dict) where valid is:

function validate_codecache(cache) for ci in cache if ci.max_world < typemax(typeof(ci.max_world)) return false end return true end end

But that seems overeager, are we gurantueed just one entry? Or do we want to remove all CIs that don't have max_world?

maleadt · 2023-04-14T07:50:24Z

Why does the API consist of macros? Why doesn't something like this work:

module DownstreamPackage

using GPUCompiler, CUDA

const cache_snapshot = GPUCompiler.ci_cache_snapshot()
include("precompile.jl")
const cache = GPUCompiler.ci_cache_delta(cache_snapshot)

__init__() = GPUCompiler.ci_cache_insert(cache)

end

collinwarner · 2023-04-16T17:12:14Z

Why does the API consist of macros? Why doesn't something like this work:

module DownstreamPackage

using GPUCompiler, CUDA

const cache_snapshot = GPUCompiler.ci_cache_snapshot()
include("precompile.jl")
const cache = GPUCompiler.ci_cache_delta(cache_snapshot)

__init__() = GPUCompiler.ci_cache_insert(cache)

end

That would seem to work. Updating now

maleadt · 2023-04-17T18:01:18Z

Downstream packages probably should not serialize the entire cache snapshot, and rather do something like:

module DownstreamPackage

using GPUCompiler, CUDA

const cache = let
    cache_snapshot = GPUCompiler.ci_cache_snapshot()
    include("precompile.jl")
    GPUCompiler.ci_cache_delta(cache_snapshot)
end


__init__() = GPUCompiler.ci_cache_insert(cache)

end

But that doesn't change the actual API.

collinwarner · 2023-04-23T00:19:26Z

Changed API to follow @maleadt advice. Leads to a cleaner interface. Added an example kernel with caching at test/ExamplePersistentCache/GPUKernel.jl . Using this you are able to get a persistent cache, which reduces the recompilation time on consecutive calls of using Package when restarting Julia.

collinwarner · 2023-04-23T00:26:04Z

Remaining work is to test integration with Downstream packages such as Enzyme, Oceananigans, CUDA, AMDGPU,.... Additionally, there are potentially some algorithmic improvements for the merge algorithm to bring precompile times with and without this feature more inline.

collinwarner · 2023-04-23T00:28:25Z

maleadt · 2023-04-23T18:48:44Z

src/precompilation_cache.jl

+Reloads Global Cache from global variable which stores the previous
+cached results
+"""
+function reinit_cache(LOCAL_CACHE)


What is this used for?

oops reminent code

Dead code, removed

maleadt · 2023-04-23T18:52:22Z

cc @aviatesk, this may be relevant to DAECompiler (as a workaround, until we have the ability to update another module's globals, i.e., a ci cache).

collinwarner · 2023-04-23T19:15:20Z

See greater performance improvement when used during Enzyme.jl's precompilation phase. EnzymeAD/Enzyme.jl#760

vchuravy · 2023-04-23T20:15:19Z

src/precompilation_cache.jl

+export ci_cache_snapshot, ci_cache_delta, ci_cache_insert, precompile_gpucompiler
+
+function ci_cache_snapshot()
+    cleaned_cache_to_save = IdDict()


Is this just copy(GPUCompiler.GLOBAL_CI_CACHES)?

There is an additional parse when constructing the CodeCache that removes CodeInstances in finite ranges. I could potentially split up that process so there are two phases. Copying then filtering, I though since we were already doing one pass over the data we could add filtering in directly.

collinwarner · 2023-05-02T18:53:59Z

Improves downstream CUDA code. Creation of two CuArrays and vectorized adds:

add GPUCompiler precompilation caching

772bd94

Collin R Warner added 2 commits April 9, 2023 14:27

add precompile file

a4bad27

fix accidental deletion

4de3f62

collinwarner mentioned this pull request Apr 10, 2023

add precompilation hook to enzyme EnzymeAD/Enzyme.jl#712

Closed

maleadt marked this pull request as draft April 11, 2023 07:44

Add examples detailing functionality

11007f2

vchuravy reviewed Apr 12, 2023

View reviewed changes

Collin R Warner added 3 commits April 13, 2023 17:32

remove debugging function

3dbe9d5

Change name to precompilation_cache

db12163

remove uneeded code

9bfdcee

switch from macros

44c5a7e

collinwarner and others added 5 commits April 20, 2023 20:17

Merge branch 'JuliaGPU:master' into master

f86feeb

change api

09d05df

debugging

c0d25a3

modifying to get orig

845be17

Add persistent Cache example

a6bd41a

maleadt reviewed Apr 23, 2023

View reviewed changes

collinwarner mentioned this pull request Apr 23, 2023

Bug Fix and use of new GPUCompiler persistent caching (8x improvement to loading time) EnzymeAD/Enzyme.jl#760

Closed

Remove dead code

cc34d21

vchuravy reviewed Apr 23, 2023

View reviewed changes

Merge branch 'master' of github.com:collinwarner/GPUCompiler.jl

456d4cb

collinwarner mentioned this pull request May 4, 2023

add persistent caching 8x speedup EnzymeAD/Enzyme.jl#808

Open

add native caching

1951087

maleadt force-pushed the master branch from 628b8dd to f3f2c5e Compare September 5, 2023 19:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add GPUCompiler precompilation caching #425

add GPUCompiler precompilation caching #425

collinwarner commented Apr 9, 2023 •

edited

Loading

maleadt commented Apr 9, 2023

codecov bot commented Apr 9, 2023 •

edited

Loading

maleadt commented Apr 11, 2023 •

edited

Loading

collinwarner commented Apr 11, 2023

maleadt commented Apr 11, 2023

collinwarner commented Apr 11, 2023

maleadt commented Apr 12, 2023

collinwarner commented Apr 12, 2023

maleadt commented Apr 12, 2023

vchuravy commented Apr 12, 2023

vchuravy left a comment

vchuravy Apr 12, 2023

collinwarner Apr 13, 2023

vchuravy Apr 23, 2023

maleadt commented Apr 14, 2023

collinwarner commented Apr 16, 2023 •

edited

Loading

maleadt commented Apr 17, 2023

collinwarner commented Apr 23, 2023 •

edited

Loading

collinwarner commented Apr 23, 2023

collinwarner commented Apr 23, 2023

maleadt Apr 23, 2023

collinwarner Apr 23, 2023

collinwarner Apr 23, 2023

maleadt commented Apr 23, 2023

collinwarner commented Apr 23, 2023 •

edited

Loading

vchuravy Apr 23, 2023

collinwarner Apr 23, 2023

collinwarner commented May 2, 2023 •

edited

Loading

add GPUCompiler precompilation caching #425

Are you sure you want to change the base?

add GPUCompiler precompilation caching #425

Conversation

collinwarner commented Apr 9, 2023 • edited Loading

maleadt commented Apr 9, 2023

codecov bot commented Apr 9, 2023 • edited Loading

Codecov Report

maleadt commented Apr 11, 2023 • edited Loading

collinwarner commented Apr 11, 2023

maleadt commented Apr 11, 2023

collinwarner commented Apr 11, 2023

maleadt commented Apr 12, 2023

collinwarner commented Apr 12, 2023

maleadt commented Apr 12, 2023

vchuravy commented Apr 12, 2023

vchuravy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maleadt commented Apr 14, 2023

collinwarner commented Apr 16, 2023 • edited Loading

maleadt commented Apr 17, 2023

collinwarner commented Apr 23, 2023 • edited Loading

collinwarner commented Apr 23, 2023

collinwarner commented Apr 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maleadt commented Apr 23, 2023

collinwarner commented Apr 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

collinwarner commented May 2, 2023 • edited Loading

collinwarner commented Apr 9, 2023 •

edited

Loading

codecov bot commented Apr 9, 2023 •

edited

Loading

maleadt commented Apr 11, 2023 •

edited

Loading

collinwarner commented Apr 16, 2023 •

edited

Loading

collinwarner commented Apr 23, 2023 •

edited

Loading

collinwarner commented Apr 23, 2023 •

edited

Loading

collinwarner commented May 2, 2023 •

edited

Loading