Conversation
34d680c
to
163ae91
Compare
Also pushed some optimizations. On an artificial alloc-heavy workload by Deniz, before:
After:
Old (but still default) binned allocator:
Basically, the new allocator is quite a bit slower, but doesn't lean on the GC as much which should be a (strong) positive in more realistic workloads like https://github.com/JuliaGPU/CuArrays.jl/issues/273. Proof (also convincing myself I'm doing something useful here): resnet50 with binned memory allocator, no memory cap (32GB GPU):
resnet50 with split memory allocator, no memory cap (32GB GPU):
Similar situation to the artificial benchmark. However, adding some memory pressure to the mix: resnet50 with binned memory allocator, 8GB memory cap:
resnet50 with binned split allocator, 8GB memory cap:
50% alloc time reduction, 50s -> 20s gc(false+true) time. There's some optimizations to be made before the new allocator outperforms the old one in all cases, but I'm confident that's possible. |
Introduce an INVALID state for initial and actually freed blocks.
d27e323
to
a6b814a
Compare
a6b814a
to
a47b827
Compare
08838ec
to
80f6104
Compare
Summary of improvements with Flux.jl (still using Tracker.jl) + @KristofferC's resnet50
So solid improvements across the board. Even stronger improvements on non-GC heavy workloads (e.g. Knet.jl models, which much more aggressively free memory before it hits the GC). I propose to tag and release a version of CuArrays with the new allocator available but not enabled by default ( Detailed timings for those interested: Binned pool, 16GB GPU:
Binned pool, 8GB cap:
Split allocator, 16GB GPU:
Split allocator, 8GB cap:
|
Exciting stuff! |
No description provided.