Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethinking the programming model #143

Open
MikeInnes opened this issue Mar 26, 2019 · 7 comments
Open

Rethinking the programming model #143

MikeInnes opened this issue Mar 26, 2019 · 7 comments
Labels
speculative Not sure about this one yet.

Comments

@MikeInnes
Copy link
Contributor

Duplicating FluxML/Flux.jl#706 here so that the right people can see it. I think the GPU maintainers generally agree that this is a good idea (please say if not) but we haven't written it down anywhere yet. Ideally we can work out some forward path for putting some effort into this.

@maleadt
Copy link
Member

maleadt commented Mar 26, 2019

I'm working on some of the necessary CUDAdrv improvements over at JuliaGPU/CUDAdrv.jl#133.

@vchuravy
Copy link
Member

Part of the challenge is that only on very modern Linux system any malloc is valid. One pretty much anything else you need to use cudaMalloc :/

@maleadt
Copy link
Member

maleadt commented Mar 27, 2019

Part of the challenge is that only on very modern Linux system any malloc is valid.

Is there even a version of Linux & CUDA where this works? Sure, HMM is merged in 4.14, but it doesn't work on CUDA 10 + Linux 4.19.

Furthermore, it's not like unified memory is a magic bullet. Workloads that flips between CPU and GPU will still be similarly slow as the current allowscalar(true), so I think one would prefer a hard and clear failure when that happens.

@MikeInnes
Copy link
Contributor Author

Widely-available HMM definitely seems like the major blocker. I think it's worth exploring whether some workarounds are possible. For example, we could swap out Julia's default malloc, (and even swap out all existing pointers when CuArrays is loaded). This seems technically feasible though I don't know if there are downsides to using cudaMalloc by default for all allocations.

If the major downside to this approach is that we have a little extra work to turn slow code into failures/warnings, that seems like an OK position to be in. If cuda is a compiler pass there's plenty of good tooling and diagnostics we can build around that pretty easily.

@maleadt
Copy link
Member

maleadt commented Mar 28, 2019

a little extra work to turn slow code into failures/warnings

Except that those cases would become very hard to spot. As soon as some shared pointer leaks (which wouldn't be limited to CuArray <-> Ptr conversions, since anything CPU-allocated can leak into GPU code and vice versa) there's the risk of slowing down computation, causing memory traffic, etc.

Isn't the higher abstraction level much more suited for capturing inputs and uploading them to the GPU? I haven't been following Flux.jl, but I think I greatly prefer improving it as opposed to betting on unified memory (performance cost: unknown) and hoping we don't make things even harder to reason about.

@MikeInnes
Copy link
Contributor Author

I think that's where we need some empirical testing, to see how likely this really is to trip people up. My feeling is that while those cases are possible, they are going to be much less common than just running a few simple matmuls in a clearly scoped block, which is going to work fine and have far fewer hazards than the current model. The cost of running the experiment seems low for the potential gains -- and we can decide whether to bet the farm on it later.

FWIW what I'm proposing is also significantly different from the CUDA C unified programming model, where CPU and GPU kernels can be pretty freely mixed, and closer to what we have now. Kernels don't have to be allowed outside a cuda block and scalar indexing can be disabled within it; it can be thought of as simply automating the conversion to CuArray (indeed that might be one way to prototype it).

Improving Flux is obviously preferable, but I basically think we've hit a wall there. You put conversions in a bunch of places and if it's slightly wrong you go out of memory or get an obscure error. The TensorFlow-style approach takes control of that for you at a very high cost to usability (that's why we're here, after all). Unified memory is the only way I can see to get the best of all worlds, though of course I'm very open to other suggestions.

@MikeInnes MikeInnes changed the title CUDA Unified Memory Rethinking the programming model Apr 11, 2019
@MikeInnes
Copy link
Contributor Author

My issue title was misleading and unclear; unified memory is kind of beside the point here, it's just one implementation of a better CUDA programming model (and possibly not the best one).

We discussed this a bit today and came to the conclusion that prototyping this as a simple compiler pass is the right way to try it out. There are various other things – e.g. better array abstractions in Base – that we may need for the full story, but that's a start. I may get time to prototype something soon.

Anyone interested in hacking on this is welcome to reach out and I can help with that too.

@maleadt maleadt transferred this issue from JuliaGPU/CuArrays.jl May 27, 2020
@maleadt maleadt added the speculative Not sure about this one yet. label May 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
speculative Not sure about this one yet.
Projects
None yet
Development

No branches or pull requests

3 participants